feature — Equal-Weight Aggregation (companion to final-report.md)

Generated: 2026-05-15T03:20:49Z

Inputs and source artifacts

Same inputs as the canonical final-report.md — only the aggregation rule changes here.

Trial input (task PRD). _blind-eval/prd.md.
Per-tool prompt prefix. scripts/manual-bench.sh.
Judge input (verbatim request payload). _blind-eval/Alpha/round1/ (<judge>-judge.json.request.json).
Judge prompt template. scripts/generate-judge-prompt-combined-v2.sh.
Methodology and threats to validity. PAPER.md · README.md · landing page.

Methodology

Same cohort, judges, rubric, and 3-round layout as final-report.md.
Equal weighting — every judge contributes weight 1 (vs the published weighted mean’s opus×3, gpt54pro×2, others×1).
Use this to verify rank-stability under operator-neutral weighting.

Ranking (Equal-Weight Mean)

ecc — 152.22/200
compound — 148.07/200
bmad — 147.87/200
pure — 146.51/200
superpower — 146.22/200
omc — 140.00/200
claudekit — 135.53/200
gstack — 127.13/200

Detail

Tool	Equal-Weight Mean	Pooled σ	within_σ	between_σ	N
ecc	152.22	14.42	5.46	14.72	45
compound	148.07	17.30	7.09	17.29	45
bmad	147.87	20.61	6.79	21.54	45
pure	146.51	16.71	6.12	17.30	45
superpower	146.22	15.27	7.49	14.72	45
omc	140.00	20.91	9.56	20.56	45
claudekit	135.53	20.33	14.32	16.30	45
gstack	127.13	21.88	12.10	20.00	45

Cross-rule comparison

Compare Equal-Weight Mean here against Weighted Mean in final-report.md. Rank-1 is identical under both rules on every task in this corpus; mid-pack ranks 4–7 may swap by at most 2 positions.