bugfix — Equal-Weight Aggregation (companion to final-report.md)

Generated: 2026-05-15T03:20:50Z

Inputs and source artifacts

Same inputs as the canonical final-report.md — only the aggregation rule changes here.

Trial input (task PRD). _blind-eval/prd.md.
Per-tool prompt prefix. scripts/manual-bench.sh.
Judge input (verbatim request payload). _blind-eval/Alpha/round1/ (<judge>-judge.json.request.json).
Judge prompt template. scripts/generate-judge-prompt-combined-v2.sh.
Methodology and threats to validity. PAPER.md · README.md · landing page.

Methodology

Same cohort, judges, rubric, and 3-round layout as final-report.md.
Equal weighting — every judge contributes weight 1 (vs the published weighted mean’s opus×3, gpt54pro×2, others×1).
Use this to verify rank-stability under operator-neutral weighting.

Ranking (Equal-Weight Mean)

claudekit — 185.47/200
ecc — 181.29/200
pure — 176.76/200
bmad — 175.02/200
superpower — 169.20/200
compound — 168.00/200
omc — 167.47/200
gstack — 161.18/200

Detail

Tool	Equal-Weight Mean	Pooled σ	within_σ	between_σ	N
claudekit	185.47	11.48	8.51	8.05	45
ecc	181.29	12.25	8.16	10.14	45
pure	176.76	13.24	11.22	8.55	45
bmad	175.02	16.05	11.81	12.41	45
superpower	169.20	13.02	6.21	12.64	45
compound	168.00	12.27	8.58	10.09	45
omc	167.47	20.89	15.07	15.64	45
gstack	161.18	16.05	7.10	15.98	45

Cross-rule comparison

Compare Equal-Weight Mean here against Weighted Mean in final-report.md. Rank-1 is identical under both rules on every task in this corpus; mid-pack ranks 4–7 may swap by at most 2 positions.