Worktree clone
Pinned base SHA · isolated HOME · PRD prepared
Open benchmark · 2026-05 snapshot
A blind, multi-judge evaluation of the Claude Code ecosystem — plugins, skill packs, hook kits, and a no-addon baseline — on feature, bugfix, and refactor work in a real TypeScript monorepo. 1800 judgments (5 trials per cell, 75 per (tool, task) cell). No setup is top-2 on all three tasks — every prompt, transcript, diff, and judge score is checked in for independent re-analysis.
Same PRD, same base commit, isolated worktrees produce the artifacts. Then judging runs in two parallel modes: a 5-judge panel scores each diff independently (canonical, weighted mean), and a comparative-rank sidecar has a single judge rank all 8 diffs head-to-head, run as two independent model lanes (validity probe). Their Spearman ρ measures whether the rank order survives a different judging regime.
Pinned base SHA · isolated HOME · PRD prepared
Skills · agents · hooks · MCP dropped into HOME
Fire slash command · approve-only · tool implements the task
auto-metrics · diff · session-logs · time · tokens
Committed under results/<tool>/t<N>/
Each blinded diff scored independently by 5 judges against a 20-item / 200-pt rubric. Weighted mean is the headline number.
All 8 blinded diffs ranked 1–8 head-to-head by a single judge in one prompt — run as two independent lanes (Opus 4.7 1M + GPT-5.4) and triangulated against the panel. Produces a relative order; never enters the weighted mean.
triangulation & per-tool Δ → (omitted from public release)
Built purely from the head-to-head lanes — the canonical benchmark ranking is the 5-judge panel weighted mean (per-task final-report.md); this never enters it. One judgment = one model ranking all 8 tools 1–8 in one prompt; per (task) = 50 pooled observations (Opus-1M + GPT-5.4, 25 cells each, equal weight). Score = mean pooled rank, lower = better. Overall = mean of the 3 per-task mean ranks.
| # | Tool | Comp. mean rank | feature | bugfix | refactor |
|---|---|---|---|---|---|
| 1 | pure | 3.03 | 3.32 | 3.86 | 1.92 |
| 2 | claudekit | 4.27 | 3.06 | 5.42 | 4.34 |
| 3 | superpower | 4.45 | 4.38 | 4.20 | 4.76 |
| 4 | compound | 4.48 | 5.84 | 2.62 | 4.98 |
| 5 | bmad | 4.71 | 5.64 | 4.10 | 4.38 |
| 6 | ecc | 4.79 | 2.54 | 5.72 | 6.10 |
| 7 | omc | 5.07 | 4.78 | 5.48 | 4.96 |
| 8 | gstack | 5.20 | 6.44 | 4.60 | 4.56 |
This ordering diverges from the canonical panel (panel per-task rank-1s: ecc / claudekit / pure) — e.g. ecc is feature-1 head-to-head but falls to overall rank-6 on bugfix/refactor: the regime drift §2.5 quantifies. Per-task tables + σ + per-lane means: _comparative-ranking.md → (omitted from public release)
Rank-1 by task is the canonical claim of this benchmark. Each panel plots the weighted mean (200 max) with a mean ± standard-error envelope (N=75 judgments per cell: 5 trials × 5 judges × 3 rounds — every trial ran 3 rounds, the canonical run plus two added stability rounds, so 15 judgments per judge per cell). The dashed line is the cohort mean. Where the horizontal bars overlap, the pair should be read as a tie at this sample size.
z̄ is the equal-weight mean of per-task z-scores (each task's z computed against its 8-tool cohort mean / stdev). This benchmark deliberately does not publish a cross-task z̄ leaderboard as a headline (see caveat 08) — read the per-task panels above for the canonical claim. Shown here as a visual aid only; collapsing three tasks into one number masks the task-specific specialisation that is the main finding.
Chips show per-task z-scores. ★ Orange = tool's best task · · dashed = tool's worst task. Click a row for the tool's transcript-grounded profile — mechanism, invocation, observed behaviors, failure modes.
No setup is top-2 on all three tasks. The right framing is "best by task", not "best overall". Three task-rooted observations; read the per-task reports for the full per-tool breakdown.
On the greenfield Mode-2 CD Batch feature, ecc (153.30) leads pure (143.13, rank-2) and bmad (141.33, rank-3), but the rank-1 lead (10.17 pts) is within the 19.33-pt feature MDE — a statistical tie, like every per-task rank-1 lead. The only separation that clears MDE anywhere in the corpus is ecc − gstack on feature (21.3 > 19.33); gstack at 131.98 (z = −1.28) is the lowest. On the per-judge z-normalized sensitivity, ecc is the only setup cleanly outside the pack (z = +2.17 vs next +0.23).
Read the feature report →On a near-maturity filter bugfix, claudekit (178.93) and ecc (172.31) sit ≈ 3–9 pts above the rest by point estimate. pure (baseline) lands rank-3 at 169.53. The formal MDE at the cohort's n=5 is 22.17 pts (bugfix) — claudekit's 9.4-pt lead over pure is well below detection threshold, so the strict "tools add no value over the bare CLI" null is not rejected at α=0.05 / 80% power. An earlier reading rejected it against a heuristic ~5-pt tie envelope; the power analysis retracts that claim.
Read the bugfix report →On the aggregate-ownership refactor, the bare CLI takes rank-1 (180.19) over a 3-way tied cluster: claudekit (178.04), bmad (177.74), and superpower (177.56). The top-4 span is 2.6 weighted pts — well within the between-judge σ envelope — so this is "tools do not outperform baseline" rather than "pure dominates". Across the corpus pure is the only setup that lands top-3 on all three tasks (feature rank-2, bugfix rank-3, refactor rank-1) — the strongest counter-claim to the "you need addons" prior.
Read the refactor report →Source artifacts per task — what the tool saw, what the judges saw, the aggregated report.
| Task | PRD | Blind labels & judge requests | Report | Equal-weight |
|---|---|---|---|---|
| feature | PRD → (omitted from public release) |
labels & requests → (omitted from public release) | aggregated → | equal-weight → |
| bugfix | PRD → (omitted from public release) |
labels & requests → (omitted from public release) | aggregated → | equal-weight → |
| refactor | PRD → (omitted from public release) |
labels & requests → (omitted from public release) | aggregated → | equal-weight → |
Mean across t1–t3 per (tool, task) cell, mined from the raw session JSONL. (t4–t5 session-audit re-run pending — the score panel above is already at n=5.)
Surfaces the most surprising finding: most setups' multi-agent architecture either does not
fire on this corpus, or fires inconsistently. Generated by
scripts/audit-sessions.py from
session-audit.
| Tool | feature | bugfix | refactor |
|---|---|---|---|
| bmad | 1.0 | 0.0 | 1.3 |
| claudekit | 2.3 | 0.7 | 0.3 |
| compound | 1.7 | 0.0 | 0.0 |
| ecc | 1.7 | 1.0 | 1.0 |
| gstack | 2.7 | 0.0 | 2.7 |
| omc | 10.7 | 3.7 | 12.3 |
| pure | 1.0 | 0.3 | 1.3 |
| superpower | 18.3 | 0.0 | 0.7 |
compound and bmad dispatch ~0 sub-agents on bugfix/refactor. superpower fan-out is 18.3 on feature, ~0 on bugfix. "Multi-agent" is task-conditional, not architectural.
| Tool | feature | bugfix | refactor |
|---|---|---|---|
| bmad | 6.0 | 2.7 | 4.7 |
| claudekit | 0.0 | 0.0 | 0.0 |
| compound | 0.0 | 0.0 | 0.0 |
| ecc | 0.0 | 0.0 | 0.0 |
| gstack | 0.0 | 0.0 | 0.0 |
| omc | 17.3 | 0.3 | 4.7 |
| pure | 0.0 | 0.0 | 0.0 |
| superpower | 1.7 | 0.0 | 0.0 |
"Setup tax": reads of the tool's own scaffolding during execution. omc dominates this metric (~17 per feature trial); bmad is the only other tool that re-reads its scaffolding on every task (2.7–6.0 per trial). The other six tools load their scaffolding once into the prompt and are done with it.
Mean per trial across the full n=5 cohort (24 (tool, task) cells × 5 trials = 120 trials). Each table is ordered by $ cost ascending so the row at the top is the cheapest setup for that task. Input tokens include cache-creation and cache-read volume — Anthropic prices cache reads at 0.1× input, so dollar cost is the better cross-tool axis than raw input volume. Score column is the canonical n=5 weighted-mean per task (opus×3, gpt54pro×2, others×1). The $/pt column is cost-efficiency: dollars per rubric point. Token columns use SI scale: K = thousand, M = million (so 15.9M = 15.9 million input tokens per trial).
| Tool | $ cost | Out tok | In tok | Score | $/pt |
|---|---|---|---|---|---|
| bmad | $38.92 | 72.6K | 15.9M | 141.3 | 0.275 |
| compound | $52.15 | 87.1K | 24.0M | 134.7 | 0.387 |
| gstack | $68.69 | 137.2K | 30.3M | 132.0 | 0.520 |
| pure | $70.98 | 101.5K | 36.0M | 143.1 | 0.496 |
| ecc | $118.23 | 132.2K | 57.6M | 153.3 | 0.771 |
| superpower | $127.51 | 223.6K | 47.8M | 140.2 | 0.910 |
| claudekit | $132.47 | 175.4K | 63.3M | 135.0 | 0.981 |
| omc | $417.92 | 422.9K | 148.0M | 139.5 | 2.996 |
bmad is the cheapest setup on feature (~$39/trial) and the most cost-efficient at $0.275/pt despite a mid-pack score. omc spends 10.7× more dollars per trial than bmad for a comparable score — driven by the highest sub-agent fan-out and ~17 setup-tax re-reads per trial (see Behavioral fingerprints above). ecc is the only tool to top rank-1 by score and still come in under $120.
| Tool | $ cost | Out tok | In tok | Score | $/pt |
|---|---|---|---|---|---|
| ecc | $18.27 | 27.6K | 7.5M | 172.3 | 0.106 |
| pure | $20.43 | 38.2K | 8.3M | 169.5 | 0.121 |
| bmad | $21.14 | 45.2K | 8.3M | 165.7 | 0.128 |
| compound | $22.41 | 35.9K | 9.7M | 166.2 | 0.135 |
| claudekit | $24.00 | 39.9K | 10.9M | 178.9 | 0.134 |
| superpower | $27.18 | 42.5K | 12.0M | 166.4 | 0.163 |
| gstack | $27.32 | 48.6K | 11.1M | 160.0 | 0.171 |
| omc | $59.97 | 85.0K | 17.7M | 164.8 | 0.364 |
Cost spread compresses on bugfix: rank-1 to rank-7 fits in $18–$27, all 7 within ~1.5× of each other. ecc is both rank-2 by score and rank-1 by efficiency at $0.106/pt. omc is the lone outlier — 2.2× the median trial cost for a mid-pack score.
| Tool | $ cost | Out tok | In tok | Score | $/pt |
|---|---|---|---|---|---|
| ecc | $50.26 | 80.5K | 21.9M | 173.6 | 0.290 |
| pure | $52.14 | 109.5K | 23.6M | 180.2 | 0.289 |
| superpower | $54.02 | 103.7K | 19.8M | 177.6 | 0.304 |
| bmad | $56.03 | 114.7K | 22.1M | 177.7 | 0.315 |
| claudekit | $56.70 | 91.2K | 27.5M | 178.0 | 0.318 |
| compound | $61.45 | 101.3K | 30.6M | 174.4 | 0.352 |
| gstack | $95.87 | 147.4K | 45.8M | 144.9 | 0.662 |
| omc | $191.23 | 216.7K | 60.6M | 170.1 | 1.124 |
On refactor the top six tools cluster tightly ($50–$62, all under $0.36/pt). pure (no-addons baseline) is rank-1 by score at near-cheapest cost — the strongest "addons don't pay for themselves" signal in the corpus. gstack and omc separate into a high-cost tail; gstack also drops to rank-8 by score.
Read this column-by-column, not row-aligned across tables. The same tool can be cheap on one task and a high spender on another (compare omc's $/pt of 2.996 on feature vs 0.364 on bugfix — an 8× swing). Source: per-trial phase1-metrics.json (cost_usd Anthropic-billing-derived) aggregated by scripts/audit-sessions.py; scores from the per-task reports rendered below.
The repository's docs/ folder is organized by reader intent.
Four routes in, depending on what you want.
Clone → pick → tool run → judge → aggregate. Minimum viable reproduction path with the full command sequence.
Quickstart →The canonical flow: tasks, tools, trials, judging, aggregation, and the cohort-symmetry rule. Plus the interaction protocol between operator and tool.
Pipeline reference →Per-tool: upstream, version, mechanism (skills / hooks / sub-agents), exact invocation, per-task transcript notes, strengths, and failure modes. Grounded in the session logs.
Tool profiles →The feature-cohort write-up: where the top cluster separates, where it ties, and what the session transcripts show about the planning/orchestration patterns that drove it.
Feature-cohort analysis →Output tokens per score point and per line, joined from the session JSONL audit. ecc clears 25 tok/line; superpower's subagent skill burns 6,353 tok/pt with no measurable lift over the bare baseline.
Skill cost efficiency →The scaffolding flow, the plan-mode-vs-native-planning decision matrix, the cohort-symmetry obligation, and two PR checklists.
Extending guide →"Why is pure rank-1 on refactor?" — step-by-step walkthroughs that re-compute headline claims from the committed artifacts.
Verification guide →
A benchmark is only as credible as its protocol. Four commitments you can audit in results/:
Every diff is relabeled with a NATO codename (Alpha, Bravo, Charlie…) before judging.
Markdown plan files are stripped. The label-to-tool mapping
(.mapping-DO-NOT-OPEN.json) is sealed until scoring finishes — any review
that reads it during scoring is invalid by protocol.
Claude Opus 4.7 (Anthropic), GPT-5.4 (OpenAI), Grok-4.20 (xAI), GLM-5.1 (Z.ai), and MiMo-2.5-pro
(Xiaomi) each score the same 20-item rubric independently. Per-judge means are combined under the
pre-registered 3 / 2 / 1 / 1 / 1 weighting in
versions.lock.json. An equal-weight comparator is emitted
alongside every report; rank-1 is identical under both rules on every task, and top-3 is identical on bugfix but reorders on feature and at rank-3 on refactor under equal weighting (see PAPER §5).
Trial inputs: the exact task PRD fed to every tool —
feature (omitted from public release) ·
bugfix (omitted from public release) ·
refactor (omitted from public release)
— plus per-tool prompt prefix in scripts/manual-bench.sh.
Per trial: full session transcript (session-logs/*.jsonl), the byte-exact prompt,
wall-clock + token metrics, TSC / ESLint / Jest output, diff stats, and task-specific hard gates.
Judge inputs: the verbatim request payload sent to each of the 5 judges per label per round —
e.g.
Alpha/round1/*-judge.json.request.json (omitted from public release)
— built from the
judge-prompt template (omitted from public release)
over the blinded diff. Per label: the diff the judges saw and all five judges' raw JSON outputs (with scores_pre_r1 snapshot for the R1 audit trail).
Two scripts re-generate every number: aggregate-results.sh
(R1 sweep → weighted mean → equal-weight comparator → σ decomposition) and
audit-cohort-symmetry.py (no-cherry-picking audit). No network, no private state.
The ranks above are point-estimate only. The strongest claims this design supports are the negative result (no rank-1 lead clears MDE on any task — every top-cluster gap is a statistical tie) and the calibration finding (Krippendorff α = 0.124 on feature — LLM judges fundamentally disagree on absolute scores under this rubric). Tasks, the 20-item rubric, and the judge panel were operator-iterative — only weights and the rerun protocol are pre-registered (see PAPER §4). The value is methodological: a published 1800-judgment corpus and an honest threats-to-validity audit, not a tool leaderboard.
Every per-task report now publishes Krippendorff α
(feature 0.124 · bugfix 0.284 · refactor 0.626 — judges disagree on absolute scores due to lenience drift, even where orderings concur; refactor (0.626) sits just below the tentative band; these α are an upper bound — computed on round-averaged scores, so true per-round agreement is lower), the
MDE
for tool-vs-tool comparison (19.33 / 22.17 / 44.02 pts at the cohort's n=5 — feature / bugfix / refactor), and a per-judge z-normalized sensitivity column to verify rank stability under lenience normalization.
See any report's Power analysis and Inter-rater agreement sections —
e.g. bugfix MDE.
Computed by
compute-krippendorff.py + compute-power-analysis.py
(auto-run before each aggregation; outputs in results/).
These limitations are why we publish all artifacts and refuse to cite rank-positions within the top cluster. Read PAPER §4 for the full threats-to-validity list.
claude-opus-4-7. A setup tuned for Sonnet or Haiku could rank differently.auto-metrics.json; the lock list varies per task (feature locks 4 items: tsc / eslint / core-test failures / lines removed; bugfix locks 2; refactor locks 2 — see PAPER §1.5). Pre-override scores are preserved per-file under scores_pre_r1./v1/responses do not expose temperature/seed.versions.lock.json before aggregation; earlier choices of tasks and rubric items are not preregistered.versions.lock.json. Re-run for current-version claims.Nothing is hidden. Run the three aggregation scripts and diff against the committed report — output should byte-match (seed 42, stdlib + numpy only, no network).
Runs the R1 sweep, weighted mean, equal-weight comparator, and σ decomposition. Writes final-report.md + final-report.equal-weight.md.
TASK=feature ./scripts/aggregate-results.sh
Rewrites deterministic rubric items from auto-metrics.json; idempotent; preserves scores_pre_r1.
TASK=feature python3 scripts/apply-r1-override.py results/_blind-eval/Alpha
Verifies no trial or rerun was cherry-picked. Exits non-zero on hard violations.
python3 scripts/audit-cohort-symmetry.py
Trial inputs:
feature PRD (omitted from public release) ·
bugfix PRD (omitted from public release) ·
refactor PRD (omitted from public release).
Judge inputs: sample request payloads (omitted from public release) ·
prompt template (omitted from public release).
Reports: feature ·
bugfix ·
refactor. The per-task reports below are the authoritative source for every number on this page; they are rendered in full as on-site pages.
browse the rendered reports below ↓
Full walkthrough: Verification guide · Pre-publish runbook: RERUN-PRE-PUBLISH · Paper: PAPER.md