Feature Cohort — Cross-Tool Analysis

Cohort: 8 tools × 3 trials × 5 judges = 120 valid judgments Base model: claude-opus-4-7 (all tools) Task: greenfield TD-CD Mode 2 feature on infina-partner-sdk Generated: 2026-05-12

Ranking (equal-weight 5-judge mean / 200)

Values are the Pooled Mean column from results/final-report.md. The weighted-mean ranking (the canonical one, with pre-registered 3 / 2 / 1 / 1 / 1 weights) is identical in order except for ranks 6/7, which swap: under pooled mean omc edges claudekit; under weighted mean claudekit edges omc.

Rank Tool Pooled Mean Pooled σ
1 ecc 152.80 15.15
2 compound 151.07 17.83
3 pure 148.27 17.79
4 superpower 145.60 14.26
5 bmad 145.27 21.97
6 omc 139.60 20.63
7 claudekit 138.87 21.14
8 gstack 126.73 23.81

Spread: 26.1 pts. Top 3 within 4.5 pts. Lowest σ: superpower (14.26 — most consistent across trials).

Effort & Cost (per-trial mean)

Tool Score Files +Lines Cost Wall (s) Turns Subagents
ecc 152.8 27.0 1924 $156.6 13313 216 1.7
compound 151.1 13.7 960 $68.8 1317 165 1.7
pure 148.3 16.3 821 $73.8 1822 138 1.0
superpower 145.6 36.0 2706 $172.0 5079 183 18.3
bmad 145.3 8.3 521 $43.6 1148 156 1.0
omc 139.6 34.3 1837 $554.7 5046 171 14.0
claudekit 138.9 31.0 1621 $93.8 2330 116 2.3
gstack 126.7 12.7 1000 $63.6 1992 156 2.7

Cost-efficiency (pts per $)

Tool Score / $
bmad 3.33
compound 2.20
pure 2.01
gstack 1.99
claudekit 1.48
ecc 0.98
superpower 0.85
omc 0.25

bmad delivers 95% of ecc’s score at 28% of the cost. omc burns 13× more than bmad for 6 fewer points.

Key observations

Quality vs verbosity is non-monotonic

Top scorer (ecc, 1924 lines, $156) and bottom scorer (gstack, 1000 lines, $63) both ship measurable code, but rank 25 pts apart. compound matches ecc within 1.7 pts on under half the lines and under half the cost — judges weight design choices over diff size.

Hard-gate pass ≠ judge score

Multi-agent fan-out doesn’t predict quality

omc is the cost outlier

$554.7/trial — 6.4× the cohort median ($82). 14 subagents × deep cache reuse drives it. Not justified by score (rank 7).

Judge harshness gradient

Judge Cohort mean
gpt54pro 115.0
opus 138.5
glm51 148.0
mimo25pro 157.0
grok420 159.0

gpt54pro is the floor (44 pt range below grok420). The weighted (3/2/1/1/1) and equal-weight aggregations both cancel this asymmetry; rank-1 is identical under both rules.

Trial-to-trial volatility (pooled σ)

Tool Pooled σ
superpower 14.26 (most stable)
ecc 15.15
pure 17.79
compound 17.83
omc 20.63
claudekit 21.14
bmad 21.97
gstack 23.81 (most volatile)

bmad, claudekit, omc, gstack swing 20+ pts in pooled σ — most of the variance is judge disagreement rather than tool instability. The within_σ column in results/final-report.md shows trial-to-trial within-judge noise is much smaller (4.7–15.5) — the dominant component is between_σ (judge base-rate spread).

Headline takeaways

  1. ecc wins on quality but at high cost.
  2. bmad is the value pick — 95% of the score at 28% of the price.
  3. compound is the consensus pick — 2nd in score, 2nd in cost-efficiency.
  4. omc and superpower invest heavily in fan-out without commensurate judge reward on this task.
  5. gstack lags by 12+ pts from #7 — gap suggests a structural weakness, not noise.

Caveats