feature — Equal-Weight Aggregation (companion to final-report.md)
Generated: 2026-05-15T03:20:49Z
Inputs and source artifacts
Same inputs as the canonical final-report.md — only the aggregation rule changes here.
- Trial input (task PRD).
_blind-eval/prd.md. - Per-tool prompt prefix.
scripts/manual-bench.sh. - Judge input (verbatim request payload).
_blind-eval/Alpha/round1/(<judge>-judge.json.request.json). - Judge prompt template.
scripts/generate-judge-prompt-combined-v2.sh. - Methodology and threats to validity.
PAPER.md·README.md· landing page.
Methodology
- Same cohort, judges, rubric, and 3-round layout as
final-report.md. - Equal weighting — every judge contributes weight 1 (vs the published weighted mean’s opus×3, gpt54pro×2, others×1).
- Use this to verify rank-stability under operator-neutral weighting.
Ranking (Equal-Weight Mean)
- ecc — 152.22/200
- compound — 148.07/200
- bmad — 147.87/200
- pure — 146.51/200
- superpower — 146.22/200
- omc — 140.00/200
- claudekit — 135.53/200
- gstack — 127.13/200
Detail
| Tool | Equal-Weight Mean | Pooled σ | within_σ | between_σ | N |
|---|---|---|---|---|---|
| ecc | 152.22 | 14.42 | 5.46 | 14.72 | 45 |
| compound | 148.07 | 17.30 | 7.09 | 17.29 | 45 |
| bmad | 147.87 | 20.61 | 6.79 | 21.54 | 45 |
| pure | 146.51 | 16.71 | 6.12 | 17.30 | 45 |
| superpower | 146.22 | 15.27 | 7.49 | 14.72 | 45 |
| omc | 140.00 | 20.91 | 9.56 | 20.56 | 45 |
| claudekit | 135.53 | 20.33 | 14.32 | 16.30 | 45 |
| gstack | 127.13 | 21.88 | 12.10 | 20.00 | 45 |
Cross-rule comparison
Compare Equal-Weight Mean here against Weighted Mean in final-report.md. Rank-1 is identical under both rules on every task in this corpus; mid-pack ranks 4–7 may swap by at most 2 positions.