refactor — Equal-Weight Aggregation (companion to final-report.md)
Generated: 2026-05-15T03:20:51Z
Inputs and source artifacts
Same inputs as the canonical final-report.md — only the aggregation rule changes here.
- Trial input (task PRD).
_blind-eval/prd.md. - Per-tool prompt prefix.
scripts/manual-bench.sh. - Judge input (verbatim request payload).
_blind-eval/Alpha/round1/(<judge>-judge.json.request.json). - Judge prompt template.
scripts/generate-judge-prompt-combined-v2.sh. - Methodology and threats to validity.
PAPER.md·README.md· landing page.
Methodology
- Same cohort, judges, rubric, and 3-round layout as
final-report.md. - Equal weighting — every judge contributes weight 1 (vs the published weighted mean’s opus×3, gpt54pro×2, others×1).
- Use this to verify rank-stability under operator-neutral weighting.
Ranking (Equal-Weight Mean)
- pure — 182.87/200
- superpower — 181.67/200
- bmad — 181.02/200
- claudekit — 180.80/200
- ecc — 180.29/200
- compound — 177.07/200
- gstack — 173.56/200
- omc — 173.07/200
Detail
| Tool | Equal-Weight Mean | Pooled σ | within_σ | between_σ | N |
|---|---|---|---|---|---|
| pure | 182.87 | 12.89 | 4.54 | 13.37 | 45 |
| superpower | 181.67 | 13.92 | 4.07 | 14.72 | 45 |
| bmad | 181.02 | 14.96 | 5.71 | 15.20 | 45 |
| claudekit | 180.80 | 17.80 | 4.38 | 19.02 | 45 |
| ecc | 180.29 | 15.86 | 4.70 | 16.72 | 45 |
| compound | 177.07 | 16.49 | 5.28 | 17.26 | 45 |
| gstack | 173.56 | 21.49 | 12.70 | 19.14 | 45 |
| omc | 173.07 | 18.46 | 7.90 | 18.41 | 45 |
Cross-rule comparison
Compare Equal-Weight Mean here against Weighted Mean in final-report.md. Rank-1 is identical under both rules on every task in this corpus; mid-pack ranks 4–7 may swap by at most 2 positions.