Open benchmark · 2026-05 snapshot

Eight setups. Three tasks. Five judges. No single winner.

A blind, multi-judge evaluation of the Claude Code ecosystem — plugins, skill packs, hook kits, and a no-addon baseline — on feature, bugfix, and refactor work in a real TypeScript monorepo. 1800 judgments (5 trials per cell, 75 per (tool, task) cell). No setup is top-2 on all three tasks — every prompt, transcript, diff, and judge score is checked in for independent re-analysis.

8
Claude Code setups (plugins, skill packs, hook kits, and a no-addon baseline), all on claude-opus-4-7
3
Tasks: feature · bugfix · refactor (TypeScript NX monorepo)
5
Judges (weighted): Opus 4.7 ×3 · GPT-5.4 ×2 · Grok-4.20 ×1 · GLM-5.1 ×1 · MiMo-2.5-pro ×1
1800
Blind-labeled judgments: 3 tasks × 8 tools × 5 trials × 5 judges (N=75 per cell; every trial ran 3 rounds — symmetric)
120
Tool trials (8 tools × 5 trials × 3 tasks), every clone pinned to the task's base SHA
40
NATO-codename blind labels per task (8 tools × 5 trials) — diff scrubbed for tool-state directories, mapping sealed until aggregation
4 / 2 / 2
R1 mechanical-fact items locked per task from auto-metrics.json (feature locks 4: tsc / eslint / core-test failures / lines removed; bugfix locks 2; refactor locks 2 — see PAPER §1.5)
3 / 2 / 1
Judge weights pre-registered in versions.lock.json: Anthropic ×3 · OpenAI ×2 · xAI / Z.ai / Xiaomi ×1

How a trial flows to a score

Same PRD, same base commit, isolated worktrees produce the artifacts. Then judging runs in two parallel modes: a 5-judge panel scores each diff independently (canonical, weighted mean), and a comparative-rank sidecar has a single judge rank all 8 diffs head-to-head, run as two independent model lanes (validity probe). Their Spearman ρ measures whether the rank order survives a different judging regime.

① Trial generation · same setup, every tool
1 Setup

Worktree clone

Pinned base SHA · isolated HOME · PRD prepared

2 Install

Tool config

Skills · agents · hooks · MCP dropped into HOME

3 Execute

Vibecode mode

Fire slash command · approve-only · tool implements the task

4 Collect

Capture metrics

auto-metrics · diff · session-logs · time · tokens

5 Archive

Commit artifact

Committed under results/<tool>/t<N>/

8 tools × 5 trials × 3 tasks = 120 artifacts on disk — every commit fed into the two judging modes below
② Judging · two parallel modes
Mode A · canonical

Panel · absolute scoring

Each blinded diff scored independently by 5 judges against a 20-item / 200-pt rubric. Weighted mean is the headline number.

20-item rubric 5 trials × 5 judges × 3 rounds = 75 / cell · 600 / task
Mode B · sidecar

Comparative · relative ranking

All 8 blinded diffs ranked 1–8 head-to-head by a single judge in one prompt — run as two independent lanes (Opus 4.7 1M + GPT-5.4) and triangulated against the panel. Produces a relative order; never enters the weighted mean.

5 rounds × 5 trials = 25 cells / (task, model) _comparative-eval/_aggregate.md (omitted from public release)
Convergent validity Spearman ρ between the two modes (per task) — does the rank order survive a different judging regime? Cross-model ρ separates regime drift from vendor bias.
2 comparative lanes · n = 25 / task / lane Spearman ρ across panel ↔ opus-comp ↔ gpt-comp · triangulation & per-tool Δ → (omitted from public release)
feature
panel ↔ opus+0.571
panel ↔ gpt+0.833
opus ↔ gpt+0.857
vendor bias — opus-comp diverges, gpt-comp tracks panel
bugfix
panel ↔ opus−0.405
panel ↔ gpt+0.167
opus ↔ gpt+0.667
mixed — both comp lanes diverge from panel, agree with each other (regime gap)
refactor
panel ↔ opus+0.310
panel ↔ gpt+0.810
opus ↔ gpt+0.500
vendor bias — opus-comp diverges, gpt-comp tracks panel
Headline implication: the two comparative lanes are internally consistent (opus ↔ gpt ρ never weak), so where comparative disagrees with the panel the gap is concentrated in the opus-comp lane on feature/refactor (vendor bias) — only bugfix is a true cross-vendor regime gap (both lanes prefer surgical fixes over the panel's high-completeness picks). Read mid-pack ranks as an operational tie cluster; rank-1 / rank-8 anchors are the strongest claims. See PAPER §2.5 for the full table and tool-level Δ.
Comparative-only ranking parallel signal · NOT canonical

Built purely from the head-to-head lanes — the canonical benchmark ranking is the 5-judge panel weighted mean (per-task final-report.md); this never enters it. One judgment = one model ranking all 8 tools 1–8 in one prompt; per (task) = 50 pooled observations (Opus-1M + GPT-5.4, 25 cells each, equal weight). Score = mean pooled rank, lower = better. Overall = mean of the 3 per-task mean ranks.

# Tool Comp. mean rank feature bugfix refactor
1pure3.033.323.861.92
2claudekit4.273.065.424.34
3superpower4.454.384.204.76
4compound4.485.842.624.98
5bmad4.715.644.104.38
6ecc4.792.545.726.10
7omc5.074.785.484.96
8gstack5.206.444.604.56

This ordering diverges from the canonical panel (panel per-task rank-1s: ecc / claudekit / pure) — e.g. ecc is feature-1 head-to-head but falls to overall rank-6 on bugfix/refactor: the regime drift §2.5 quantifies. Per-task tables + σ + per-lane means: _comparative-ranking.md → (omitted from public release)

Per-task score intervals — the canonical view

Rank-1 by task is the canonical claim of this benchmark. Each panel plots the weighted mean (200 max) with a mean ± standard-error envelope (N=75 judgments per cell: 5 trials × 5 judges × 3 rounds — every trial ran 3 rounds, the canonical run plus two added stability rounds, so 15 judgments per judge per cell). The dashed line is the cohort mean. Where the horizontal bars overlap, the pair should be read as a tie at this sample size.

Cross-task — informational only

is the equal-weight mean of per-task z-scores (each task's z computed against its 8-tool cohort mean / stdev). This benchmark deliberately does not publish a cross-task z̄ leaderboard as a headline (see caveat 08) — read the per-task panels above for the canonical claim. Shown here as a visual aid only; collapsing three tasks into one number masks the task-specific specialisation that is the main finding.

Chips show per-task z-scores. ★ Orange = tool's best task · · dashed = tool's worst task. Click a row for the tool's transcript-grounded profile — mechanism, invocation, observed behaviors, failure modes.

What we learned

No setup is top-2 on all three tasks. The right framing is "best by task", not "best overall". Three task-rooted observations; read the per-task reports for the full per-tool breakdown.

01 · Feature

ecc is point-estimate rank-1; the top cluster is a statistical tie.

On the greenfield Mode-2 CD Batch feature, ecc (153.30) leads pure (143.13, rank-2) and bmad (141.33, rank-3), but the rank-1 lead (10.17 pts) is within the 19.33-pt feature MDE — a statistical tie, like every per-task rank-1 lead. The only separation that clears MDE anywhere in the corpus is ecc − gstack on feature (21.3 > 19.33); gstack at 131.98 (z = −1.28) is the lowest. On the per-judge z-normalized sensitivity, ecc is the only setup cleanly outside the pack (z = +2.17 vs next +0.23).

Read the feature report →
02 · Bugfix

claudekit and ecc lead by point estimate; statistically tied.

On a near-maturity filter bugfix, claudekit (178.93) and ecc (172.31) sit ≈ 3–9 pts above the rest by point estimate. pure (baseline) lands rank-3 at 169.53. The formal MDE at the cohort's n=5 is 22.17 pts (bugfix) — claudekit's 9.4-pt lead over pure is well below detection threshold, so the strict "tools add no value over the bare CLI" null is not rejected at α=0.05 / 80% power. An earlier reading rejected it against a heuristic ~5-pt tie envelope; the power analysis retracts that claim.

Read the bugfix report →
03 · Refactor

pure (no addons) is rank-1.

On the aggregate-ownership refactor, the bare CLI takes rank-1 (180.19) over a 3-way tied cluster: claudekit (178.04), bmad (177.74), and superpower (177.56). The top-4 span is 2.6 weighted pts — well within the between-judge σ envelope — so this is "tools do not outperform baseline" rather than "pure dominates". Across the corpus pure is the only setup that lands top-3 on all three tasks (feature rank-2, bugfix rank-3, refactor rank-1) — the strongest counter-claim to the "you need addons" prior.

Read the refactor report →

Source artifacts per task — what the tool saw, what the judges saw, the aggregated report.

Task PRD Blind labels & judge requests Report Equal-weight
feature PRD → (omitted from public release) labels & requests → (omitted from public release) aggregated → equal-weight →
bugfix PRD → (omitted from public release) labels & requests → (omitted from public release) aggregated → equal-weight →
refactor PRD → (omitted from public release) labels & requests → (omitted from public release) aggregated → equal-weight →

Behavioral fingerprints (transcript-mined)

Mean across t1–t3 per (tool, task) cell, mined from the raw session JSONL. (t4–t5 session-audit re-run pending — the score panel above is already at n=5.) Surfaces the most surprising finding: most setups' multi-agent architecture either does not fire on this corpus, or fires inconsistently. Generated by scripts/audit-sessions.py from session-audit.

Sub-agent dispatches per trial
Toolfeaturebugfixrefactor
bmad1.00.01.3
claudekit2.30.70.3
compound1.70.00.0
ecc1.71.01.0
gstack2.70.02.7
omc10.73.712.3
pure1.00.31.3
superpower18.30.00.7

compound and bmad dispatch ~0 sub-agents on bugfix/refactor. superpower fan-out is 18.3 on feature, ~0 on bugfix. "Multi-agent" is task-conditional, not architectural.

Tool-config reads per trial ("setup tax")
Toolfeaturebugfixrefactor
bmad6.02.74.7
claudekit0.00.00.0
compound0.00.00.0
ecc0.00.00.0
gstack0.00.00.0
omc17.30.34.7
pure0.00.00.0
superpower1.70.00.0

"Setup tax": reads of the tool's own scaffolding during execution. omc dominates this metric (~17 per feature trial); bmad is the only other tool that re-reads its scaffolding on every task (2.7–6.0 per trial). The other six tools load their scaffolding once into the prompt and are done with it.

Token spend & cost per trial — ranked most-efficient first

Mean per trial across the full n=5 cohort (24 (tool, task) cells × 5 trials = 120 trials). Each table is ordered by $ cost ascending so the row at the top is the cheapest setup for that task. Input tokens include cache-creation and cache-read volume — Anthropic prices cache reads at 0.1× input, so dollar cost is the better cross-tool axis than raw input volume. Score column is the canonical n=5 weighted-mean per task (opus×3, gpt54pro×2, others×1). The $/pt column is cost-efficiency: dollars per rubric point. Token columns use SI scale: K = thousand, M = million (so 15.9M = 15.9 million input tokens per trial).

feature — ranked by $ cost (n=5)
Tool$ costOut tokIn tokScore$/pt
bmad$38.9272.6K15.9M141.30.275
compound$52.1587.1K24.0M134.70.387
gstack$68.69137.2K30.3M132.00.520
pure$70.98101.5K36.0M143.10.496
ecc$118.23132.2K57.6M153.30.771
superpower$127.51223.6K47.8M140.20.910
claudekit$132.47175.4K63.3M135.00.981
omc$417.92422.9K148.0M139.52.996

bmad is the cheapest setup on feature (~$39/trial) and the most cost-efficient at $0.275/pt despite a mid-pack score. omc spends 10.7× more dollars per trial than bmad for a comparable score — driven by the highest sub-agent fan-out and ~17 setup-tax re-reads per trial (see Behavioral fingerprints above). ecc is the only tool to top rank-1 by score and still come in under $120.

bugfix — ranked by $ cost (n=5)
Tool$ costOut tokIn tokScore$/pt
ecc$18.2727.6K7.5M172.30.106
pure$20.4338.2K8.3M169.50.121
bmad$21.1445.2K8.3M165.70.128
compound$22.4135.9K9.7M166.20.135
claudekit$24.0039.9K10.9M178.90.134
superpower$27.1842.5K12.0M166.40.163
gstack$27.3248.6K11.1M160.00.171
omc$59.9785.0K17.7M164.80.364

Cost spread compresses on bugfix: rank-1 to rank-7 fits in $18–$27, all 7 within ~1.5× of each other. ecc is both rank-2 by score and rank-1 by efficiency at $0.106/pt. omc is the lone outlier — 2.2× the median trial cost for a mid-pack score.

refactor — ranked by $ cost (n=5)
Tool$ costOut tokIn tokScore$/pt
ecc$50.2680.5K21.9M173.60.290
pure$52.14109.5K23.6M180.20.289
superpower$54.02103.7K19.8M177.60.304
bmad$56.03114.7K22.1M177.70.315
claudekit$56.7091.2K27.5M178.00.318
compound$61.45101.3K30.6M174.40.352
gstack$95.87147.4K45.8M144.90.662
omc$191.23216.7K60.6M170.11.124

On refactor the top six tools cluster tightly ($50–$62, all under $0.36/pt). pure (no-addons baseline) is rank-1 by score at near-cheapest cost — the strongest "addons don't pay for themselves" signal in the corpus. gstack and omc separate into a high-cost tail; gstack also drops to rank-8 by score.

Read this column-by-column, not row-aligned across tables. The same tool can be cheap on one task and a high spender on another (compare omc's $/pt of 2.996 on feature vs 0.364 on bugfix — an 8× swing). Source: per-trial phase1-metrics.json (cost_usd Anthropic-billing-derived) aggregated by scripts/audit-sessions.py; scores from the per-task reports rendered below.

Explore the docs

The repository's docs/ folder is organized by reader intent. Four routes in, depending on what you want.

How we kept the measurement honest

A benchmark is only as credible as its protocol. Four commitments you can audit in results/:

Blind evaluation

Every diff is relabeled with a NATO codename (Alpha, Bravo, Charlie…) before judging. Markdown plan files are stripped. The label-to-tool mapping (.mapping-DO-NOT-OPEN.json) is sealed until scoring finishes — any review that reads it during scoring is invalid by protocol.

Five-judge weighted panel

Claude Opus 4.7 (Anthropic), GPT-5.4 (OpenAI), Grok-4.20 (xAI), GLM-5.1 (Z.ai), and MiMo-2.5-pro (Xiaomi) each score the same 20-item rubric independently. Per-judge means are combined under the pre-registered 3 / 2 / 1 / 1 / 1 weighting in versions.lock.json. An equal-weight comparator is emitted alongside every report; rank-1 is identical under both rules on every task, and top-3 is identical on bugfix but reorders on feature and at rank-3 on refactor under equal weighting (see PAPER §5).

Everything committed

Trial inputs: the exact task PRD fed to every tool — feature (omitted from public release) · bugfix (omitted from public release) · refactor (omitted from public release) — plus per-tool prompt prefix in scripts/manual-bench.sh. Per trial: full session transcript (session-logs/*.jsonl), the byte-exact prompt, wall-clock + token metrics, TSC / ESLint / Jest output, diff stats, and task-specific hard gates. Judge inputs: the verbatim request payload sent to each of the 5 judges per label per round — e.g. Alpha/round1/*-judge.json.request.json (omitted from public release) — built from the judge-prompt template (omitted from public release) over the blinded diff. Per label: the diff the judges saw and all five judges' raw JSON outputs (with scores_pre_r1 snapshot for the R1 audit trail).

Deterministic aggregation

Two scripts re-generate every number: aggregate-results.sh (R1 sweep → weighted meanequal-weight comparator → σ decomposition) and audit-cohort-symmetry.py (no-cherry-picking audit). No network, no private state.

Read this as a calibration study, not a leaderboard

The ranks above are point-estimate only. The strongest claims this design supports are the negative result (no rank-1 lead clears MDE on any task — every top-cluster gap is a statistical tie) and the calibration finding (Krippendorff α = 0.124 on feature — LLM judges fundamentally disagree on absolute scores under this rubric). Tasks, the 20-item rubric, and the judge panel were operator-iterative — only weights and the rerun protocol are pre-registered (see PAPER §4). The value is methodological: a published 1800-judgment corpus and an honest threats-to-validity audit, not a tool leaderboard.

Statistical honesty (added 2026-05)

Every per-task report now publishes Krippendorff α (feature 0.124 · bugfix 0.284 · refactor 0.626 — judges disagree on absolute scores due to lenience drift, even where orderings concur; refactor (0.626) sits just below the tentative band; these α are an upper bound — computed on round-averaged scores, so true per-round agreement is lower), the MDE for tool-vs-tool comparison (19.33 / 22.17 / 44.02 pts at the cohort's n=5 — feature / bugfix / refactor), and a per-judge z-normalized sensitivity column to verify rank stability under lenience normalization. See any report's Power analysis and Inter-rater agreement sections — e.g. bugfix MDE. Computed by compute-krippendorff.py + compute-power-analysis.py (auto-run before each aggregation; outputs in results/).

Caveats, in plain English

These limitations are why we publish all artifacts and refuse to cite rank-positions within the top cluster. Read PAPER §4 for the full threats-to-validity list.

!
Approve-only "vibecode" execution. Trials are run by an operator who accepts whatever each tool proposes — no mid-flight steering, no plan rejection, no "try a different approach". The per-tool slash command is fired once and the operator only clicks through permission prompts. This measures autonomous one-shot capability under the pinned base model. Setups that depend on iterative human feedback (plan revision, rejecting a sub-task, mid-edit course correction) will rank lower here than in interactive pair-programming use. Read rankings as "best when the operator just keeps approving", not as effectiveness in a human-in-the-loop workflow.
01
Single codebase, single language — TypeScript. Don't assume the rankings carry to Python, Go, or Rust.
02
Single executor base modelclaude-opus-4-7. A setup tuned for Sonnet or Haiku could rank differently.
03
LLM judges diverge 32–42 pts (full panel spread). GPT-5.4 is consistently the harshest (19–27 pts below the per-task panel mean), MiMo-2.5-pro the most lenient. The 5-judge weighted panel and the within / between σ split are the intended mitigations — they expose drift rather than hide it.
04
Pure (no addons) is rank-1 on refactor, rank-3 on bugfix, and rank-2 on feature — top-3 on every task in this corpus. The per-task reports now compute the MDE at the cohort's actual n=5 trials per arm: 19.33 pts (feature) · 22.17 pts (bugfix) · 44.02 pts (refactor) (σ_pool 9.72 / 11.14 / 22.13). Counter-intuitively the n=3 → n=5 expansion did not cut MDE the way 1/√n predicts — σ_pool rose on every task as more trials exposed true trial-to-trial variance (feature 7.87→9.72, bugfix 9.99→11.14, refactor 6.07→22.13), driven on refactor by gstack's trial-4 refactor diff scoring ≈36/200 against ~178 on its other four (a mechanically clean run, valid under the pre-registered no-selective-rerun rule). Every rank-1 lead falls below MDE (feature 10.17, bugfix 6.62, refactor 2.15); an earlier reading rejected the strict null on bugfix via claudekit's lead over pure against a heuristic ~5-pt tie envelope — it does not survive the formal power analysis (9.4 ≪ 22.17). The only separation that clears MDE anywhere in the 1800-judgment corpus is ecc − gstack on feature (≈21.3 > 19.33); no separation clears MDE on bugfix or refactor. Treat all top-cluster rank-1 claims as "point-estimate first, statistically tied"; use cost, speed, and DX as the real differentiator. See the next-cohort improvement plan.
Show 7 more methodological caveats
05
R1 mechanical-fact override is post-hoc. Deterministic items are rewritten from auto-metrics.json; the lock list varies per task (feature locks 4 items: tsc / eslint / core-test failures / lines removed; bugfix locks 2; refactor locks 2 — see PAPER §1.5). Pre-override scores are preserved per-file under scores_pre_r1.
06
Judge weights pre-registered, not derived. The 3 / 2 / 1 / 1 / 1 scheme reflects operator trust in the Anthropic and OpenAI judges; equal-weight aggregations are emitted alongside as sensitivity. Rank-1 is identical under both rules on every task; top-3 is identical on bugfix, reorders on feature (weighted ecc / pure / bmad → equal-weight ecc / bmad / pure), and swaps at rank-3 on refactor (weighted bmad → equal-weight superpower).
07
Self-preference is not identified by this design. Every executor uses a Claude base model, so judge-family favoritism cannot be isolated. A proper audit needs a non-Anthropic executor as control.
08
Cross-task synthesis is informational only. A single cross-task leaderboard is sensitive to weighting and noisy at this sample size — read the per-task reports together. The leaderboard above is a visual aid, not a ranking.
09
Judge sampling not pinned. Temperature is fixed to 0 where the provider exposes it (OpenRouter, OpenCode Go); Claude CLI and OpenAI /v1/responses do not expose temperature/seed.
10
Not preregistered. Tasks, rubric, judge panel, weight scheme, and R1 lock list were chosen iteratively. The weight scheme is committed to versions.lock.json before aggregation; earlier choices of tasks and rubric items are not preregistered.
11
Tool-version snapshot, 2026-05. Versions captured in versions.lock.json. Re-run for current-version claims.

Reproduce every number in one command

Nothing is hidden. Run the three aggregation scripts and diff against the committed report — output should byte-match (seed 42, stdlib + numpy only, no network).

Per-task aggregation

Runs the R1 sweep, weighted mean, equal-weight comparator, and σ decomposition. Writes final-report.md + final-report.equal-weight.md.

TASK=feature ./scripts/aggregate-results.sh

R1 mechanical-fact override

Rewrites deterministic rubric items from auto-metrics.json; idempotent; preserves scores_pre_r1.

TASK=feature python3 scripts/apply-r1-override.py results/_blind-eval/Alpha

Cohort symmetry

Verifies no trial or rerun was cherry-picked. Exits non-zero on hard violations.

python3 scripts/audit-cohort-symmetry.py

Open every artifact

Trial inputs: feature PRD (omitted from public release) · bugfix PRD (omitted from public release) · refactor PRD (omitted from public release). Judge inputs: sample request payloads (omitted from public release) · prompt template (omitted from public release). Reports: feature · bugfix · refactor. The per-task reports below are the authoritative source for every number on this page; they are rendered in full as on-site pages.

browse the rendered reports below ↓

Full walkthrough: Verification guide · Pre-publish runbook: RERUN-PRE-PUBLISH · Paper: PAPER.md