Open benchmark · 2026-05 snapshot
Eight setups. Three tasks. Five judges. No single winner.
A blind, multi-judge evaluation of the Claude Code ecosystem — plugins, skill packs, hook kits, and a no-addon baseline — on feature, bugfix, and refactor work in a real TypeScript monorepo. 1080 judgments (3 rounds per artifact). No setup is top-2 on all three tasks — every prompt, transcript, diff, and judge score is checked in for independent re-analysis.
Per-task score intervals — the canonical view
Rank-1 by task is the canonical claim of this benchmark. Each panel plots the weighted mean (200 max) with a mean ± standard-error envelope (N=45 judgments per cell: 3 trials × 5 judges × 3 rounds). The dashed line is the cohort mean. Where the horizontal bars overlap, the pair should be read as a tie at this sample size.
Cross-task z̄ — informational only
z̄ is the equal-weight mean of per-task z-scores (each task's z computed against its 8-tool cohort mean / stdev). This benchmark deliberately does not publish a cross-task z̄ leaderboard as a headline (see caveat 08) — read the per-task panels above for the canonical claim. Shown here as a visual aid only; collapsing three tasks into one number masks the task-specific specialisation that is the main finding.
Chips show per-task z-scores. ★ Orange = tool's best task · · dashed = tool's worst task. Click a row for the tool's transcript-grounded profile — mechanism, invocation, observed behaviors, failure modes.
What we learned
No setup is top-2 on all three tasks. The right framing is "best by task", not "best overall". Three task-rooted observations; read the per-task reports for the full per-tool breakdown.
ecc leads; gstack is the only outlier.
On the greenfield Mode-2 CD Batch feature, ecc (148.97) leads superpower (143.83) and pure (142.17, rank-3) by ~5–7 weighted pts — within the between-judge σ envelope. gstack at 121.51 is the only setup cleanly below the cohort (z = −2.0); every non-gstack tool sits within 17 pts of rank-1.
Read the feature report →claudekit and ecc separate; pure is rank-3.
On a near-maturity filter bugfix, claudekit (183.67) and ecc (179.08) sit ≈ 4–9 pts above the rest. pure (baseline) lands rank-3 at 175.10 — close to the pack, but claudekit's 8.6-pt lead over pure sits outside the operational ~5-pt tie envelope (PAPER §4), so the strict "tools add no value over the bare CLI" null is rejected by the claudekit–pure pair on this task.
Read the bugfix report →pure (no addons) is rank-1.
On the aggregate-ownership refactor, the bare CLI takes rank-1 (180.97) over superpower (179.43) and bmad (178.96). The top-5 span is 3.6 weighted pts — well within the between-judge σ envelope — so this is "tools do not outperform baseline" rather than "pure dominates". Across the corpus pure is the only setup that lands top-3 on all three tasks (feature rank-3, bugfix rank-3, refactor rank-1) — the strongest counter-claim to the "you need addons" prior.
Read the refactor report →Source artifacts per task — what the tool saw, what the judges saw, the aggregated report.
| Task | PRD | Blind labels & judge requests | Report | Equal-weight |
|---|---|---|---|---|
| feature | PRD → | labels & requests → | aggregated → | equal-weight → |
| bugfix | PRD → | labels & requests → | aggregated → | equal-weight → |
| refactor | PRD → | labels & requests → | aggregated → | equal-weight → |
Behavioral fingerprints (transcript-mined)
Mean across 3 trials per (tool, task) cell, mined from the raw session JSONL.
Surfaces the most surprising finding: most setups' multi-agent architecture either does not
fire on this corpus, or fires inconsistently. Generated by
scripts/audit-sessions.py from
results/_audits/session-audit.md.
| Tool | feature | bugfix | refactor |
|---|---|---|---|
| bmad | 1.0 | 0.0 | 1.3 |
| claudekit | 2.3 | 0.7 | 0.3 |
| compound | 1.7 | 0.0 | 0.0 |
| ecc | 1.7 | 1.0 | 1.0 |
| gstack | 2.7 | 0.0 | 2.7 |
| omc | 10.7 | 3.7 | 12.3 |
| pure | 1.0 | 0.3 | 1.3 |
| superpower | 18.3 | 0.0 | 0.7 |
compound and bmad dispatch ~0 sub-agents on bugfix/refactor. superpower fan-out is 18.3 on feature, ~0 on bugfix. "Multi-agent" is task-conditional, not architectural.
| Tool | feature | bugfix | refactor |
|---|---|---|---|
| bmad | 6.0 | 2.7 | 4.7 |
| claudekit | 0.0 | 0.0 | 0.0 |
| compound | 0.0 | 0.0 | 0.0 |
| ecc | 0.0 | 0.0 | 0.0 |
| gstack | 0.0 | 0.0 | 0.0 |
| omc | 17.3 | 0.3 | 4.7 |
| pure | 0.0 | 0.0 | 0.0 |
| superpower | 1.7 | 0.0 | 0.0 |
"Setup tax": reads of the tool's own scaffolding during execution. omc dominates this metric (~17 per feature trial); bmad is the only other tool that re-reads its scaffolding on every task (2.7–6.0 per trial). The other six tools load their scaffolding once into the prompt and are done with it.
Explore the docs
The repository's docs/ folder is organized by reader intent.
Four routes in, depending on what you want.
Run one trial in ~10 minutes
Clone → pick → tool run → judge → aggregate. Minimum viable reproduction path with the full command sequence.
Quickstart →Pipeline end-to-end
The canonical flow: tasks, tools, trials, judging, aggregation, and the cohort-symmetry rule. Plus the interaction protocol between operator and tool.
Pipeline reference →Eight setups, profiled
Per-tool: upstream, version, mechanism (skills / hooks / sub-agents), exact invocation, per-task transcript notes, strengths, and failure modes. Grounded in the session logs.
Tool profiles →What did we learn?
The feature-cohort write-up: where the top cluster separates, where it ties, and what the session transcripts show about the planning/orchestration patterns that drove it.
Feature-cohort analysis →What does each skill cost?
Output tokens per score point and per line, joined from the session JSONL audit. ecc clears 25 tok/line; superpower's subagent skill burns 6,091 tok/pt with no measurable lift over the bare baseline.
Skill cost efficiency →Add a tool or judge
The scaffolding flow, the plan-mode-vs-native-planning decision matrix, the cohort-symmetry obligation, and two PR checklists.
Extending guide →Reproduce a specific claim
"Why is pure rank-1 on refactor?" — step-by-step walkthroughs that re-compute headline claims from the committed artifacts.
Verification guide →How we kept the measurement honest
A benchmark is only as credible as its protocol. Four commitments you can audit in results/:
Blind evaluation
Every diff is relabeled with a NATO codename (Alpha, Bravo, Charlie…) before judging.
Markdown plan files are stripped. The label-to-tool mapping
(.mapping-DO-NOT-OPEN.json) is sealed until scoring finishes — any review
that reads it during scoring is invalid by protocol.
Five-judge weighted panel
Claude Opus 4.7 (Anthropic), GPT-5.4-pro (OpenAI), Grok-4.20 (xAI), GLM-5.1 (Z.ai), and MiMo-2.5-pro
(Xiaomi) each score the same 20-item rubric independently. Per-judge means are combined under the
pre-registered 3 / 2 / 1 / 1 / 1 weighting in
versions.lock.json. An equal-weight comparator is emitted
alongside every report; rank-1 is identical under both rules on every task, and top-3 is identical on bugfix and refactor (feature top-3 reorders under equal weighting — see PAPER §5).
Everything committed
Trial inputs: the exact task PRD fed to every tool —
feature ·
bugfix ·
refactor
— plus per-tool prompt prefix in scripts/manual-bench.sh.
Per trial: full session transcript (session-logs/*.jsonl), the byte-exact prompt,
wall-clock + token metrics, TSC / ESLint / Jest output, diff stats, and task-specific hard gates.
Judge inputs: the verbatim request payload sent to each of the 5 judges per label per round —
e.g.
Alpha/round1/*-judge.json.request.json
— built from the
judge-prompt template
over the blinded diff. Per label: the diff the judges saw and all five judges' raw JSON outputs (with scores_pre_r1 snapshot for the R1 audit trail).
Deterministic aggregation
Two scripts re-generate every number: aggregate-results.sh
(R1 sweep → weighted mean → equal-weight comparator → σ decomposition) and
audit-cohort-symmetry.py (no-cherry-picking audit). No network, no private state.
Caveats, in plain English
These limitations are why we publish all artifacts and refuse to cite rank-positions within the top cluster. Read PAPER §4 for the full threats-to-validity list.
claude-opus-4-7. A setup tuned for Sonnet or Haiku could rank differently.Show 7 more methodological caveats
auto-metrics.json; the lock list varies per task (feature locks 4 items: tsc / eslint / core-test failures / lines removed; bugfix locks 2; refactor locks 2 — see PAPER §1.5). Pre-override scores are preserved per-file under scores_pre_r1./v1/responses do not expose temperature/seed.versions.lock.json before aggregation; earlier choices of tasks and rubric items are not preregistered.versions.lock.json. Re-run for current-version claims.Reproduce every number in one command
Nothing is hidden. Run the three aggregation scripts and diff against the committed report — output should byte-match (seed 42, stdlib + numpy only, no network).
Per-task aggregation
Runs the R1 sweep, weighted mean, equal-weight comparator, and σ decomposition. Writes final-report.md + final-report.equal-weight.md.
TASK=feature ./scripts/aggregate-results.sh
R1 mechanical-fact override
Rewrites deterministic rubric items from auto-metrics.json; idempotent; preserves scores_pre_r1.
TASK=feature python3 scripts/apply-r1-override.py results/_blind-eval/Alpha
Cohort symmetry
Verifies no trial or rerun was cherry-picked. Exits non-zero on hard violations.
python3 scripts/audit-cohort-symmetry.py
Open every artifact
Trial inputs: feature PRD · bugfix PRD · refactor PRD. Judge inputs: sample request payloads · prompt template. Reports: feature · bugfix · refactor.
open results/final-report.md
Full walkthrough: Verification guide · Pre-publish runbook: RERUN-PRE-PUBLISH · Paper: PAPER.md