Feature Cohort — Cross-Tool Analysis

Cohort: 8 tools × 3 trials × 5 judges = 120 valid judgments Base model: claude-opus-4-7 (all tools) Task: greenfield TD-CD Mode 2 feature on infina-partner-sdk Generated: 2026-05-12

Ranking (equal-weight 5-judge mean / 200)

Values are the Pooled Mean column from results/final-report.md. The weighted-mean ranking (the canonical one, with pre-registered 3 / 2 / 1 / 1 / 1 weights) is identical in order except for ranks 6/7, which swap: under pooled mean omc edges claudekit; under weighted mean claudekit edges omc.

Rank	Tool	Pooled Mean	Pooled σ
1	ecc	152.80	15.15
2	compound	151.07	17.83
3	pure	148.27	17.79
4	superpower	145.60	14.26
5	bmad	145.27	21.97
6	omc	139.60	20.63
7	claudekit	138.87	21.14
8	gstack	126.73	23.81

Spread: 26.1 pts. Top 3 within 4.5 pts. Lowest σ: superpower (14.26 — most consistent across trials).

Effort & Cost (per-trial mean)

Tool	Score	Files	+Lines	Cost	Wall (s)	Turns	Subagents
ecc	152.8	27.0	1924	$156.6	13313	216	1.7
compound	151.1	13.7	960	$68.8	1317	165	1.7
pure	148.3	16.3	821	$73.8	1822	138	1.0
superpower	145.6	36.0	2706	$172.0	5079	183	18.3
bmad	145.3	8.3	521	$43.6	1148	156	1.0
omc	139.6	34.3	1837	$554.7	5046	171	14.0
claudekit	138.9	31.0	1621	$93.8	2330	116	2.3
gstack	126.7	12.7	1000	$63.6	1992	156	2.7

Cost-efficiency (pts per $)

Tool	Score / $
bmad	3.33
compound	2.20
pure	2.01
gstack	1.99
claudekit	1.48
ecc	0.98
superpower	0.85
omc	0.25

bmad delivers 95% of ecc’s score at 28% of the cost. omc burns 13× more than bmad for 6 fewer points.

Key observations

Quality vs verbosity is non-monotonic

Top scorer (ecc, 1924 lines, $156) and bottom scorer (gstack, 1000 lines, $63) both ship measurable code, but rank 25 pts apart. compound matches ecc within 1.7 pts on under half the lines and under half the cost — judges weight design choices over diff size.

Hard-gate pass ≠ judge score

superpower: highest gate pass rate (5.0/6) but 4th in judge score
gstack, bmad: 2.0/6 gates avg yet bmad ranks 5th Hard gates measure mechanical compliance (G1-G7); judges weight reasoning, structure, test design — orthogonal axes.

Multi-agent fan-out doesn’t predict quality

superpower: 18.3 subagents/trial → rank 4
omc: 14.0 subagents/trial → rank 7 (last among full-power tools)
bmad, pure: 1.0 subagent → ranks 5 and 3 Subagent expansion is a cost multiplier, not a quality multiplier on this task.

omc is the cost outlier

$554.7/trial — 6.4× the cohort median ($82). 14 subagents × deep cache reuse drives it. Not justified by score (rank 7).

Judge harshness gradient

Judge	Cohort mean
gpt54pro	115.0
opus	138.5
glm51	148.0
mimo25pro	157.0
grok420	159.0

gpt54pro is the floor (44 pt range below grok420). The weighted (3/2/1/1/1) and equal-weight aggregations both cancel this asymmetry; rank-1 is identical under both rules.

Trial-to-trial volatility (pooled σ)

Tool	Pooled σ
superpower	14.26 (most stable)
ecc	15.15
pure	17.79
compound	17.83
omc	20.63
claudekit	21.14
bmad	21.97
gstack	23.81 (most volatile)

bmad, claudekit, omc, gstack swing 20+ pts in pooled σ — most of the variance is judge disagreement rather than tool instability. The within_σ column in results/final-report.md shows trial-to-trial within-judge noise is much smaller (4.7–15.5) — the dominant component is between_σ (judge base-rate spread).

Headline takeaways

ecc wins on quality but at high cost.
bmad is the value pick — 95% of the score at 28% of the price.
compound is the consensus pick — 2nd in score, 2nd in cost-efficiency.
omc and superpower invest heavily in fan-out without commensurate judge reward on this task.
gstack lags by 12+ pts from #7 — gap suggests a structural weakness, not noise.

Caveats

n=3 trials per tool × 5 judges = 15 judgments per cell. Pairs within ~5 weighted pts should be read as ties on this n.
One feature task. Per-tool generalization across the three tasks (feature, bugfix, refactor) is reported in the three per-task final-report.md files; cross-task synthesis as a single leaderboard is intentionally not reported (see README caveat 7).
Judges sample at provider defaults (no temperature pin exposed for opus or gpt54pro).
gpt54pro is the harshest judge by 25+ pts; rank-1 is stable under both weighted (3/2/1/1/1) and equal-weight aggregation on every task.