Open benchmark · 2026-05 snapshot

Eight Claude Code setups, three tasks, five judges — and task-specific specialisation.

A blind, multi-judge evaluation of the Claude Code ecosystem — plugins, skill packs, hook kits, and a no-addon baseline — on feature, bugfix, and refactor work in a real TypeScript monorepo. 1080 judgments (3 rounds per artifact). No setup is top-2 on all three tasks — every prompt, transcript, diff, and judge score is checked in for independent re-analysis.

8
Claude Code setups (plugins, skill packs, hook kits, and a no-addon baseline), all on claude-opus-4-7
3
Tasks: feature · bugfix · refactor (TypeScript NX monorepo)
5
Judges (weighted): Opus 4.7 ×3 · GPT-5.4-pro ×2 · Grok-4.20 ×1 · GLM-5.1 ×1 · MiMo-2.5-pro ×1
1080
Blind-labeled judgments: 3 tasks × 8 tools × 3 trials × 5 judges × 3 rounds
72
Tool trials (8 tools × 3 trials × 3 tasks), every clone pinned to the task's base SHA
24
NATO-letter blind labels per task — diff scrubbed for tool-state directories, mapping sealed until aggregation
4 / 2 / 2
R1 mechanical-fact items locked per task from auto-metrics.json (feature locks 4: tsc / eslint / core-test failures / lines removed; bugfix locks 2; refactor locks 2 — see PAPER §1.5)
3 / 2 / 1
Judge weights pre-registered in versions.lock.json: Anthropic ×3 · OpenAI ×2 · xAI / Z.ai / Xiaomi ×1

Per-task score intervals — the canonical view

Rank-1 by task is the canonical claim of this benchmark. Each panel plots the weighted mean (200 max) with a mean ± standard-error envelope (N=45 judgments per cell: 3 trials × 5 judges × 3 rounds). The dashed line is the cohort mean. Where the horizontal bars overlap, the pair should be read as a tie at this sample size.

Cross-task z̄ — informational only

z̄ is the equal-weight mean of per-task z-scores (each task's z computed against its 8-tool cohort mean / stdev). This benchmark deliberately does not publish a cross-task z̄ leaderboard as a headline (see caveat 08) — read the per-task panels above for the canonical claim. Shown here as a visual aid only; collapsing three tasks into one number masks the task-specific specialisation that is the main finding.

Chips show per-task z-scores. Orange = tool's best task. Click a row for the tool's transcript-grounded profile — mechanism, invocation, observed behaviors, failure modes.

What we learned

No setup is top-2 on all three tasks. The right framing is "best by task", not "best overall". Three task-rooted observations; read the per-task reports for the full per-tool breakdown.

Source artifacts per task (what the tool saw · what the judges saw · the aggregated report):
featurePRD · blind-eval labels & judge requests · aggregated report · equal-weight
bugfixPRD · blind-eval labels & judge requests · aggregated report · equal-weight
refactorPRD · blind-eval labels & judge requests · aggregated report · equal-weight

Behavioral fingerprints (transcript-mined)

Mean across 3 trials per (tool, task) cell, mined from the raw session JSONL. Surfaces the most surprising finding: most setups' multi-agent architecture either does not fire on this corpus, or fires inconsistently. Generated by scripts/audit-sessions.py from results/_audits/session-audit.md.

Sub-agent dispatches per trial
Tool feature bugfix refactor
bmad1.00.01.3
claudekit2.30.70.3
compound1.70.00.0
ecc1.71.01.0
gstack2.70.02.7
omc10.73.712.3
pure1.00.31.3
superpower18.30.00.7
compound and bmad dispatch ~0 sub-agents on bugfix/refactor. superpower fan-out is 18.3 on feature, ~0 on bugfix. "Multi-agent" is task-conditional, not architectural.
Tool-config reads per trial ("setup tax")
Tool feature bugfix refactor
bmad6.02.74.7
claudekit0.00.00.0
compound0.00.00.0
ecc0.00.00.0
gstack0.00.00.0
omc17.30.34.7
pure0.00.00.0
superpower1.70.00.0
"Setup tax": reads of the tool's own scaffolding during execution. omc dominates this metric (~17 per feature trial); bmad is the only other tool that re-reads its scaffolding on every task (2.7–6.0 per trial). The other six tools load their scaffolding once into the prompt and are done with it.

Explore the docs

The repository's docs/ folder is organized by reader intent. Four routes in, depending on what you want.

How we kept the measurement honest

A benchmark is only as credible as its protocol. Four commitments you can audit in results/:

Blind evaluation

Every diff is relabeled with a NATO codename (Alpha, Bravo, Charlie…) before judging. Markdown plan files are stripped. The label-to-tool mapping (.mapping-DO-NOT-OPEN.json) is sealed until scoring finishes — any review that reads it during scoring is invalid by protocol.

Five-judge weighted panel

Claude Opus 4.7 (Anthropic), GPT-5.4-pro (OpenAI), Grok-4.20 (xAI), GLM-5.1 (Z.ai), and MiMo-2.5-pro (Xiaomi) each score the same 20-item rubric independently. Per-judge means are combined under the pre-registered 3 / 2 / 1 / 1 / 1 weighting in versions.lock.json. An equal-weight comparator is emitted alongside every report; rank-1 is identical under both rules on every task, and top-3 is identical on bugfix and refactor (feature top-3 reorders under equal weighting — see PAPER §5).

Everything committed

Trial inputs: the exact task PRD fed to every tool — feature · bugfix · refactor — plus per-tool prompt prefix in scripts/manual-bench.sh. Per trial: full session transcript (session-logs/*.jsonl), the byte-exact prompt, wall-clock + token metrics, TSC / ESLint / Jest output, diff stats, and task-specific hard gates. Judge inputs: the verbatim request payload sent to each of the 5 judges per label per round — e.g. Alpha/round1/*-judge.json.request.json — built from the judge-prompt template over the blinded diff. Per label: the diff the judges saw and all five judges' raw JSON outputs (with scores_pre_r1 snapshot for the R1 audit trail).

Deterministic aggregation

Two scripts re-generate every number: aggregate-results.sh (R1 sweep → weighted mean → equal-weight comparator → σ decomposition) and audit-cohort-symmetry.py (no-cherry-picking audit). No network, no private state.

Caveats, in plain English

These limitations are why we publish all artifacts and refuse to cite rank-positions within the top cluster. Read PAPER §4 for the full threats-to-validity list.

!
Approve-only "vibecode" execution. Trials are run by an operator who accepts whatever each tool proposes — no mid-flight steering, no plan rejection, no "try a different approach". The per-tool slash command is fired once and the operator only clicks through permission prompts. This measures autonomous one-shot capability under the pinned base model. Setups that depend on iterative human feedback (plan revision, rejecting a sub-task, mid-edit course correction) will rank lower here than in interactive pair-programming use. Read rankings as "best when the operator just keeps approving", not as effectiveness in a human-in-the-loop workflow.
01
Single codebase, single language — TypeScript. Don't assume the rankings carry to Python, Go, or Rust.
02
Single executor base modelclaude-opus-4-7. A setup tuned for Sonnet or Haiku could rank differently.
03
LLM judges diverge 31–44 pts (full panel spread). GPT-5.4-pro is consistently the harshest (19–29 pts below the per-task panel mean), MiMo-2.5-pro the most lenient. The 5-judge weighted panel and the within / between σ split are the intended mitigations — they expose drift rather than hide it.
04
Pure (no addons) is rank-1 on refactor, rank-3 on bugfix, and rank-3 on feature — top-3 on every task in this corpus. On refactor the top-5 sit within 3.6 weighted pts, so no addon meaningfully exceeds the bare CLI under the operational ~5-pt tie envelope. On bugfix the picture is split: claudekit (rank-1) and ecc (rank-2) both sit above pure, and claudekit's 8.6-pt lead is outside the tie envelope — the strict null is rejected on bugfix by that one pair. On feature, ecc (rank-1) and superpower (rank-2) lead pure by ~5–7 pts. Use cost, speed, and DX as the real differentiator on refactor, and read addon-vs-baseline gaps elsewhere against the tie envelope before claiming lift.
05
R1 mechanical-fact override is post-hoc. Deterministic items are rewritten from auto-metrics.json; the lock list varies per task (feature locks 4 items: tsc / eslint / core-test failures / lines removed; bugfix locks 2; refactor locks 2 — see PAPER §1.5). Pre-override scores are preserved per-file under scores_pre_r1.
06
Judge weights pre-registered, not derived. The 3 / 2 / 1 / 1 / 1 scheme reflects operator trust in the Anthropic and OpenAI judges; equal-weight aggregations are emitted alongside as sensitivity. Rank-1 is identical under both rules on every task; top-3 is identical on bugfix and refactor, but feature top-3 reorders under equal weighting (weighted: ecc/superpower/pure; equal-weight: ecc/compound/bmad).
07
Self-preference is not identified by this design. Every executor uses a Claude base model, so judge-family favoritism cannot be isolated. A proper audit needs a non-Anthropic executor as control.
08
Cross-task synthesis is informational only. A single cross-task z̄ leaderboard is sensitive to weighting and noisy at this sample size — read the per-task reports together. The leaderboard above is a visual aid, not a ranking.
09
Judge sampling not pinned. Temperature is fixed to 0 where the provider exposes it (OpenRouter, OpenCode Go); Claude CLI and OpenAI /v1/responses do not expose temperature/seed.
10
Not preregistered. Tasks, rubric, judge panel, weight scheme, and R1 lock list were chosen iteratively. The weight scheme is committed to versions.lock.json before aggregation; earlier choices of tasks and rubric items are not preregistered.
11
Tool-version snapshot, 2026-05. Versions captured in versions.lock.json. Re-run for current-version claims.

Re-derive every number in one command

Nothing is hidden. Run the three aggregation scripts and diff against the committed report — output should byte-match (seed 42, stdlib + numpy only, no network).

Per-task aggregation

Runs the R1 sweep, weighted mean, equal-weight comparator, and σ decomposition. Writes final-report.md + final-report.equal-weight.md.

TASK=feature ./scripts/aggregate-results.sh

R1 mechanical-fact override

Rewrites deterministic rubric items from auto-metrics.json; idempotent; preserves scores_pre_r1.

TASK=feature python3 scripts/apply-r1-override.py results/_blind-eval/Alpha

Cohort symmetry

Verifies no trial or rerun was cherry-picked. Exits non-zero on hard violations.

python3 scripts/audit-cohort-symmetry.py

Open every artifact

Trial inputs: feature PRD · bugfix PRD · refactor PRD. Judge inputs: sample request payloads · prompt template. Reports: feature · bugfix · refactor.

open results/final-report.md

Full walkthrough: Verification guide · Pre-publish runbook: RERUN-PRE-PUBLISH · Paper: PAPER.md