Open benchmark · 2026-05 snapshot
Eight Claude Code setups, three tasks, five judges — and task-specific specialisation.
A blind, multi-judge evaluation of the Claude Code ecosystem — plugins, skill packs, hook kits, and a no-addon baseline — on feature, bugfix, and refactor work in a real TypeScript monorepo. 360 judgments. No setup is top-2 on all three tasks — every prompt, transcript, diff, and judge score is checked in for independent re-analysis.
The cross-task leaderboard
z̄ is the equal-weight mean of per-task z-scores (each task's z computed against its 8-tool cohort mean / stdev). This benchmark deliberately does not publish a cross-task z̄ leaderboard as a headline — read the per-task panels below instead. Shown here as a visual aid; rank-1 by task is the canonical claim.
Chips show per-task z-scores. Orange = tool's best task. Click a row for the tool's transcript-grounded profile — mechanism, invocation, observed behaviors, failure modes.
Per-task score intervals
Each panel plots the weighted mean (200 max) with a mean ± standard-error envelope (N=15 judgments per cell: 3 trials × 5 judges). The dashed line is the cohort mean. Where the horizontal bars overlap, the pair should be read as a tie at this sample size.
What we learned
No setup is top-2 on all three tasks. The right framing is "best by task", not "best overall". Three task-rooted observations; read the per-task reports for the full per-tool breakdown.
Feature work: ecc wins, gstack is the only outlier.
On the greenfield Mode-2 CD Batch feature, ecc (149.17) leads compound (144.58) and pure (144.25) by ~5 weighted pts — within the between-judge σ envelope. gstack at 121.17 is the only setup cleanly below the cohort; every other tool is within 14 pts of rank-1.
Read the feature report →Bugfix: claudekit and ecc separate; pure is rank-3.
On a near-maturity filter bugfix, claudekit (184.33) and ecc (181.38) sit ≈ 7 pts above the rest. pure (baseline) lands rank-3 at 175.00 — the null "tools add no value over the bare CLI" is not rejected on this task.
Read the bugfix report →Refactor: pure (no addons) is rank-1.
On the aggregate-ownership refactor, the bare CLI takes rank-1 (182.46) over bmad (181.29) and superpower (180.04). The top-5 span is 4.2 weighted pts — within the between-judge σ envelope — so this is "tools do not outperform baseline" rather than "pure dominates".
Read the refactor report →Explore the docs
The repository's docs/ folder is organized by reader intent.
Four routes in, depending on what you want.
Run one trial in ~10 minutes
Clone → pick → tool run → judge → aggregate. Minimum viable reproduction path with the full command sequence.
Quickstart →Pipeline end-to-end
The canonical flow: tasks, tools, trials, judging, aggregation, and the cohort-symmetry rule. Plus the interaction protocol between operator and tool.
Pipeline reference →Eight setups, profiled
Per-tool: upstream, version, mechanism (skills / hooks / sub-agents), exact invocation, per-task transcript notes, strengths, and failure modes. Grounded in the session logs.
Tool profiles →What did we learn?
The feature-cohort write-up: where the top cluster separates, where it ties, and what the session transcripts show about the planning/orchestration patterns that drove it.
Feature-cohort analysis →Add a tool or judge
The scaffolding flow, the plan-mode-vs-native-planning decision matrix, the cohort-symmetry obligation, and two PR checklists.
Extending guide →Reproduce a specific claim
"Why is pure rank-1 on refactor?" — step-by-step walkthroughs that re-compute headline claims from the committed artifacts.
Verification guide →How we kept the measurement honest
A benchmark is only as credible as its protocol. Four commitments you can audit in results/:
Blind evaluation
Every diff is relabeled with a NATO codename (Alpha, Bravo, Charlie…) before judging.
Markdown plan files are stripped. The label-to-tool mapping
(.mapping-DO-NOT-OPEN.json) is sealed until scoring finishes — any review
that reads it during scoring is invalid by protocol.
Five-judge weighted panel
Claude Opus 4.7 (Anthropic), GPT-5.4-pro (OpenAI), Grok-4.20 (xAI), GLM-5.1 (Z.ai), and MiMo-2.5-pro
(Xiaomi) each score the same 20-item rubric independently. Per-judge means are combined under the
pre-registered 3 / 2 / 1 / 1 / 1 weighting in
versions.lock.json. An equal-weight comparator is emitted
alongside every report; rank-1 and top-3 are identical under both rules on every task in this corpus.
Everything committed
Per trial: full session transcript (session-logs/*.jsonl), the byte-exact prompt,
wall-clock + token metrics, TSC / ESLint / Jest output, diff stats, and task-specific hard gates.
Per label: the diff the judges saw and all five judges' raw JSON outputs (with scores_pre_r1 snapshot for the R1 audit trail).
Deterministic aggregation
Two scripts re-generate every number: aggregate-results.sh
(R1 sweep → weighted mean → equal-weight comparator → σ decomposition) and
audit-cohort-symmetry.py (no-cherry-picking audit). No network, no private state.
Caveats, in plain English
These limitations are why we publish all artifacts and refuse to cite rank-positions within the top cluster. Read PAPER §4 for the full threats-to-validity list.
claude-opus-4-7. A setup tuned for Sonnet or Haiku could rank differently.pure (no addons) is top-3 on bugfix and rank-1 on refactor. Use cost, speed, and DX as the real differentiator — not the score column.auto-metrics.json; the lock list varies per task (feature locks 4 items: tsc / eslint / core-test failures / lines removed; bugfix locks 2; refactor locks 2 — see PAPER §1.5). Pre-override scores are preserved per-file under scores_pre_r1./v1/responses do not expose temperature/seed.versions.lock.json before aggregation; earlier choices of tasks and rubric items are not preregistered.versions.lock.json. Re-run for current-version claims.Re-derive every number in one command
Nothing is hidden. Run the three aggregation scripts and diff against the committed report — output should byte-match (seed 42, stdlib + numpy only, no network).
Per-task aggregation
Runs the R1 sweep, weighted mean, equal-weight comparator, and σ decomposition. Writes final-report.md + final-report.equal-weight.md.
TASK=feature ./scripts/aggregate-results.sh
R1 mechanical-fact override
Rewrites deterministic rubric items from auto-metrics.json; idempotent; preserves scores_pre_r1.
TASK=feature python3 scripts/apply-r1-override.py results/_blind-eval/Alpha
Cohort symmetry
Verifies no trial or rerun was cherry-picked. Exits non-zero on hard violations.
python3 scripts/audit-cohort-symmetry.py
Open every artifact
Per-task final-report.md, per-label blind-eval dirs, per-trial session logs.
open results/final-report.md
Full walkthrough: Verification guide · Pre-publish runbook: RERUN-PRE-PUBLISH · Paper: PAPER.md