refactor — Equal-Weight Aggregation (companion to final-report.md)

Generated: 2026-05-15T03:20:51Z

Inputs and source artifacts

Same inputs as the canonical final-report.md — only the aggregation rule changes here.

Trial input (task PRD). _blind-eval/prd.md.
Per-tool prompt prefix. scripts/manual-bench.sh.
Judge input (verbatim request payload). _blind-eval/Alpha/round1/ (<judge>-judge.json.request.json).
Judge prompt template. scripts/generate-judge-prompt-combined-v2.sh.
Methodology and threats to validity. PAPER.md · README.md · landing page.

Methodology

Same cohort, judges, rubric, and 3-round layout as final-report.md.
Equal weighting — every judge contributes weight 1 (vs the published weighted mean’s opus×3, gpt54pro×2, others×1).
Use this to verify rank-stability under operator-neutral weighting.

Ranking (Equal-Weight Mean)

pure — 182.87/200
superpower — 181.67/200
bmad — 181.02/200
claudekit — 180.80/200
ecc — 180.29/200
compound — 177.07/200
gstack — 173.56/200
omc — 173.07/200

Detail

Tool	Equal-Weight Mean	Pooled σ	within_σ	between_σ	N
pure	182.87	12.89	4.54	13.37	45
superpower	181.67	13.92	4.07	14.72	45
bmad	181.02	14.96	5.71	15.20	45
claudekit	180.80	17.80	4.38	19.02	45
ecc	180.29	15.86	4.70	16.72	45
compound	177.07	16.49	5.28	17.26	45
gstack	173.56	21.49	12.70	19.14	45
omc	173.07	18.46	7.90	18.41	45

Cross-rule comparison

Compare Equal-Weight Mean here against Weighted Mean in final-report.md. Rank-1 is identical under both rules on every task in this corpus; mid-pack ranks 4–7 may swap by at most 2 positions.