Feature Cohort — Cross-Tool Analysis
Cohort: 8 tools × 3 trials × 5 judges = 120 valid judgments Base model: claude-opus-4-7 (all tools) Task: greenfield TD-CD Mode 2 feature on infina-partner-sdk Generated: 2026-05-12
Ranking (equal-weight 5-judge mean / 200)
Values are the Pooled Mean column from results/final-report.md. The weighted-mean ranking (the canonical one, with pre-registered 3 / 2 / 1 / 1 / 1 weights) is identical in order except for ranks 6/7, which swap: under pooled mean omc edges claudekit; under weighted mean claudekit edges omc.
| Rank | Tool | Pooled Mean | Pooled σ |
|---|---|---|---|
| 1 | ecc | 152.80 | 15.15 |
| 2 | compound | 151.07 | 17.83 |
| 3 | pure | 148.27 | 17.79 |
| 4 | superpower | 145.60 | 14.26 |
| 5 | bmad | 145.27 | 21.97 |
| 6 | omc | 139.60 | 20.63 |
| 7 | claudekit | 138.87 | 21.14 |
| 8 | gstack | 126.73 | 23.81 |
Spread: 26.1 pts. Top 3 within 4.5 pts. Lowest σ: superpower (14.26 — most consistent across trials).
Effort & Cost (per-trial mean)
| Tool | Score | Files | +Lines | Cost | Wall (s) | Turns | Subagents |
|---|---|---|---|---|---|---|---|
| ecc | 152.8 | 27.0 | 1924 | $156.6 | 13313 | 216 | 1.7 |
| compound | 151.1 | 13.7 | 960 | $68.8 | 1317 | 165 | 1.7 |
| pure | 148.3 | 16.3 | 821 | $73.8 | 1822 | 138 | 1.0 |
| superpower | 145.6 | 36.0 | 2706 | $172.0 | 5079 | 183 | 18.3 |
| bmad | 145.3 | 8.3 | 521 | $43.6 | 1148 | 156 | 1.0 |
| omc | 139.6 | 34.3 | 1837 | $554.7 | 5046 | 171 | 14.0 |
| claudekit | 138.9 | 31.0 | 1621 | $93.8 | 2330 | 116 | 2.3 |
| gstack | 126.7 | 12.7 | 1000 | $63.6 | 1992 | 156 | 2.7 |
Cost-efficiency (pts per $)
| Tool | Score / $ |
|---|---|
| bmad | 3.33 |
| compound | 2.20 |
| pure | 2.01 |
| gstack | 1.99 |
| claudekit | 1.48 |
| ecc | 0.98 |
| superpower | 0.85 |
| omc | 0.25 |
bmad delivers 95% of ecc’s score at 28% of the cost. omc burns 13× more than bmad for 6 fewer points.
Key observations
Quality vs verbosity is non-monotonic
Top scorer (ecc, 1924 lines, $156) and bottom scorer (gstack, 1000 lines, $63) both ship measurable code, but rank 25 pts apart. compound matches ecc within 1.7 pts on under half the lines and under half the cost — judges weight design choices over diff size.
Hard-gate pass ≠ judge score
- superpower: highest gate pass rate (5.0/6) but 4th in judge score
- gstack, bmad: 2.0/6 gates avg yet bmad ranks 5th Hard gates measure mechanical compliance (G1-G7); judges weight reasoning, structure, test design — orthogonal axes.
Multi-agent fan-out doesn’t predict quality
- superpower: 18.3 subagents/trial → rank 4
- omc: 14.0 subagents/trial → rank 7 (last among full-power tools)
- bmad, pure: 1.0 subagent → ranks 5 and 3 Subagent expansion is a cost multiplier, not a quality multiplier on this task.
omc is the cost outlier
$554.7/trial — 6.4× the cohort median ($82). 14 subagents × deep cache reuse drives it. Not justified by score (rank 7).
Judge harshness gradient
| Judge | Cohort mean |
|---|---|
| gpt54pro | 115.0 |
| opus | 138.5 |
| glm51 | 148.0 |
| mimo25pro | 157.0 |
| grok420 | 159.0 |
gpt54pro is the floor (44 pt range below grok420). The weighted (3/2/1/1/1) and equal-weight aggregations both cancel this asymmetry; rank-1 is identical under both rules.
Trial-to-trial volatility (pooled σ)
| Tool | Pooled σ |
|---|---|
| superpower | 14.26 (most stable) |
| ecc | 15.15 |
| pure | 17.79 |
| compound | 17.83 |
| omc | 20.63 |
| claudekit | 21.14 |
| bmad | 21.97 |
| gstack | 23.81 (most volatile) |
bmad, claudekit, omc, gstack swing 20+ pts in pooled σ — most of the variance is judge disagreement rather than tool instability. The within_σ column in results/final-report.md shows trial-to-trial within-judge noise is much smaller (4.7–15.5) — the dominant component is between_σ (judge base-rate spread).
Headline takeaways
- ecc wins on quality but at high cost.
- bmad is the value pick — 95% of the score at 28% of the price.
- compound is the consensus pick — 2nd in score, 2nd in cost-efficiency.
- omc and superpower invest heavily in fan-out without commensurate judge reward on this task.
- gstack lags by 12+ pts from #7 — gap suggests a structural weakness, not noise.
Caveats
- n=3 trials per tool × 5 judges = 15 judgments per cell. Pairs within ~5 weighted pts should be read as ties on this n.
- One feature task. Per-tool generalization across the three tasks (feature, bugfix, refactor) is reported in the three per-task
final-report.mdfiles; cross-task synthesis as a single leaderboard is intentionally not reported (see README caveat 7). - Judges sample at provider defaults (no temperature pin exposed for opus or gpt54pro).
- gpt54pro is the harshest judge by 25+ pts; rank-1 is stable under both weighted (3/2/1/1/1) and equal-weight aggregation on every task.