bugfix — Equal-Weight Aggregation (companion to final-report.md)
Generated: 2026-05-15T03:20:50Z
Inputs and source artifacts
Same inputs as the canonical final-report.md — only the aggregation rule changes here.
- Trial input (task PRD).
_blind-eval/prd.md. - Per-tool prompt prefix.
scripts/manual-bench.sh. - Judge input (verbatim request payload).
_blind-eval/Alpha/round1/(<judge>-judge.json.request.json). - Judge prompt template.
scripts/generate-judge-prompt-combined-v2.sh. - Methodology and threats to validity.
PAPER.md·README.md· landing page.
Methodology
- Same cohort, judges, rubric, and 3-round layout as
final-report.md. - Equal weighting — every judge contributes weight 1 (vs the published weighted mean’s opus×3, gpt54pro×2, others×1).
- Use this to verify rank-stability under operator-neutral weighting.
Ranking (Equal-Weight Mean)
- claudekit — 185.47/200
- ecc — 181.29/200
- pure — 176.76/200
- bmad — 175.02/200
- superpower — 169.20/200
- compound — 168.00/200
- omc — 167.47/200
- gstack — 161.18/200
Detail
| Tool | Equal-Weight Mean | Pooled σ | within_σ | between_σ | N |
|---|---|---|---|---|---|
| claudekit | 185.47 | 11.48 | 8.51 | 8.05 | 45 |
| ecc | 181.29 | 12.25 | 8.16 | 10.14 | 45 |
| pure | 176.76 | 13.24 | 11.22 | 8.55 | 45 |
| bmad | 175.02 | 16.05 | 11.81 | 12.41 | 45 |
| superpower | 169.20 | 13.02 | 6.21 | 12.64 | 45 |
| compound | 168.00 | 12.27 | 8.58 | 10.09 | 45 |
| omc | 167.47 | 20.89 | 15.07 | 15.64 | 45 |
| gstack | 161.18 | 16.05 | 7.10 | 15.98 | 45 |
Cross-rule comparison
Compare Equal-Weight Mean here against Weighted Mean in final-report.md. Rank-1 is identical under both rules on every task in this corpus; mid-pack ranks 4–7 may swap by at most 2 positions.