Open benchmark · 2026-05 snapshot

Eight setups. Three tasks. Five judges. No single winner.

A blind, multi-judge evaluation of the Claude Code ecosystem — plugins, skill packs, hook kits, and a no-addon baseline — on feature, bugfix, and refactor work in a real TypeScript monorepo. 1080 judgments (3 rounds per artifact). No setup is top-2 on all three tasks — every prompt, transcript, diff, and judge score is checked in for independent re-analysis.

Read the paper Per-task results How to verify

Claude Code setups (plugins, skill packs, hook kits, and a no-addon baseline), all on claude-opus-4-7

Tasks: feature · bugfix · refactor (TypeScript NX monorepo)

Judges (weighted): Opus 4.7 ×3 · GPT-5.4-pro ×2 · Grok-4.20 ×1 · GLM-5.1 ×1 · MiMo-2.5-pro ×1

1080

Blind-labeled judgments: 3 tasks × 8 tools × 3 trials × 5 judges × 3 rounds

Tool trials (8 tools × 3 trials × 3 tasks), every clone pinned to the task's base SHA

NATO-letter blind labels per task — diff scrubbed for tool-state directories, mapping sealed until aggregation

4 / 2 / 2

R1 mechanical-fact items locked per task from auto-metrics.json (feature locks 4: tsc / eslint / core-test failures / lines removed; bugfix locks 2; refactor locks 2 — see PAPER §1.5)

3 / 2 / 1

Judge weights pre-registered in versions.lock.json: Anthropic ×3 · OpenAI ×2 · xAI / Z.ai / Xiaomi ×1

Per-task score intervals — the canonical view

Rank-1 by task is the canonical claim of this benchmark. Each panel plots the weighted mean (200 max) with a mean ± standard-error envelope (N=45 judgments per cell: 3 trials × 5 judges × 3 rounds). The dashed line is the cohort mean. Where the horizontal bars overlap, the pair should be read as a tie at this sample size.

Cross-task z̄ — informational only

z̄ is the equal-weight mean of per-task z-scores (each task's z computed against its 8-tool cohort mean / stdev). This benchmark deliberately does not publish a cross-task z̄ leaderboard as a headline (see caveat 08) — read the per-task panels above for the canonical claim. Shown here as a visual aid only; collapsing three tasks into one number masks the task-specific specialisation that is the main finding.

Chips show per-task z-scores. ★ Orange = tool's best task · · dashed = tool's worst task. Click a row for the tool's transcript-grounded profile — mechanism, invocation, observed behaviors, failure modes.

What we learned

No setup is top-2 on all three tasks. The right framing is "best by task", not "best overall". Three task-rooted observations; read the per-task reports for the full per-tool breakdown.

01 · Feature

ecc leads; gstack is the only outlier.

On the greenfield Mode-2 CD Batch feature, ecc (148.97) leads superpower (143.83) and pure (142.17, rank-3) by ~5–7 weighted pts — within the between-judge σ envelope. gstack at 121.51 is the only setup cleanly below the cohort (z = −2.0); every non-gstack tool sits within 17 pts of rank-1.

Read the feature report →

02 · Bugfix

claudekit and ecc lead by point estimate; statistically tied.

On a near-maturity filter bugfix, claudekit (183.67) and ecc (179.08) sit ≈ 4–9 pts above the rest by point estimate. pure (baseline) lands rank-3 at 175.10. The formal MDE at n=3 trials is 22.8 pts — claudekit's 8.6-pt lead over pure is well below detection threshold, so the strict "tools add no value over the bare CLI" null is not rejected at α=0.05 / 80% power. An earlier reading rejected it against a heuristic ~5-pt tie envelope; the power analysis retracts that claim.

Read the bugfix report →

03 · Refactor

pure (no addons) is rank-1.

On the aggregate-ownership refactor, the bare CLI takes rank-1 (180.97) over superpower (179.43) and bmad (178.96). The top-5 span is 3.6 weighted pts — well within the between-judge σ envelope — so this is "tools do not outperform baseline" rather than "pure dominates". Across the corpus pure is the only setup that lands top-3 on all three tasks (feature rank-3, bugfix rank-3, refactor rank-1) — the strongest counter-claim to the "you need addons" prior.

Read the refactor report →

Source artifacts per task — what the tool saw, what the judges saw, the aggregated report.

Task	PRD	Blind labels & judge requests	Report	Equal-weight
feature	PRD →	labels & requests →	aggregated →	equal-weight →
bugfix	PRD →	labels & requests →	aggregated →	equal-weight →
refactor	PRD →	labels & requests →	aggregated →	equal-weight →

Behavioral fingerprints (transcript-mined)

Mean across 3 trials per (tool, task) cell, mined from the raw session JSONL. Surfaces the most surprising finding: most setups' multi-agent architecture either does not fire on this corpus, or fires inconsistently. Generated by scripts/audit-sessions.py from results/_audits/session-audit.md.

Sub-agent dispatches per trial
Tool	feature	bugfix	refactor
bmad	1.0	0.0	1.3
claudekit	2.3	0.7	0.3
compound	1.7	0.0	0.0
ecc	1.7	1.0	1.0
gstack	2.7	0.0	2.7
omc	10.7	3.7	12.3
pure	1.0	0.3	1.3
superpower	18.3	0.0	0.7

compound and bmad dispatch ~0 sub-agents on bugfix/refactor. superpower fan-out is 18.3 on feature, ~0 on bugfix. "Multi-agent" is task-conditional, not architectural.

Tool-config reads per trial ("setup tax")
Tool	feature	bugfix	refactor
bmad	6.0	2.7	4.7
claudekit	0.0	0.0	0.0
compound	0.0	0.0	0.0
ecc	0.0	0.0	0.0
gstack	0.0	0.0	0.0
omc	17.3	0.3	4.7
pure	0.0	0.0	0.0
superpower	1.7	0.0	0.0

"Setup tax": reads of the tool's own scaffolding during execution. omc dominates this metric (~17 per feature trial); bmad is the only other tool that re-reads its scaffolding on every task (2.7–6.0 per trial). The other six tools load their scaffolding once into the prompt and are done with it.

Explore the docs

The repository's docs/ folder is organized by reader intent. Four routes in, depending on what you want.

Guides

Run one trial in ~10 minutes

Clone → pick → tool run → judge → aggregate. Minimum viable reproduction path with the full command sequence.

Quickstart →

Methodology

Pipeline end-to-end

The canonical flow: tasks, tools, trials, judging, aggregation, and the cohort-symmetry rule. Plus the interaction protocol between operator and tool.

Pipeline reference →

Tools

Eight setups, profiled

Per-tool: upstream, version, mechanism (skills / hooks / sub-agents), exact invocation, per-task transcript notes, strengths, and failure modes. Grounded in the session logs.

Tool profiles →

Analysis

What did we learn?

The feature-cohort write-up: where the top cluster separates, where it ties, and what the session transcripts show about the planning/orchestration patterns that drove it.

Feature-cohort analysis →

Analysis

What does each skill cost?

Output tokens per score point and per line, joined from the session JSONL audit. ecc clears 25 tok/line; superpower's subagent skill burns 6,091 tok/pt with no measurable lift over the bare baseline.

Skill cost efficiency →

Extending

Add a tool or judge

The scaffolding flow, the plan-mode-vs-native-planning decision matrix, the cohort-symmetry obligation, and two PR checklists.

Extending guide →

Verification

Reproduce a specific claim

"Why is pure rank-1 on refactor?" — step-by-step walkthroughs that re-compute headline claims from the committed artifacts.

Verification guide →

How we kept the measurement honest

A benchmark is only as credible as its protocol. Four commitments you can audit in results/:

Blind evaluation

Every diff is relabeled with a NATO codename (Alpha, Bravo, Charlie…) before judging. Markdown plan files are stripped. The label-to-tool mapping (.mapping-DO-NOT-OPEN.json) is sealed until scoring finishes — any review that reads it during scoring is invalid by protocol.

Five-judge weighted panel

Claude Opus 4.7 (Anthropic), GPT-5.4-pro (OpenAI), Grok-4.20 (xAI), GLM-5.1 (Z.ai), and MiMo-2.5-pro (Xiaomi) each score the same 20-item rubric independently. Per-judge means are combined under the pre-registered 3 / 2 / 1 / 1 / 1 weighting in versions.lock.json. An equal-weight comparator is emitted alongside every report; rank-1 is identical under both rules on every task, and top-3 is identical on bugfix and refactor (feature top-3 reorders under equal weighting — see PAPER §5).

Everything committed

Trial inputs: the exact task PRD fed to every tool — feature · bugfix · refactor — plus per-tool prompt prefix in scripts/manual-bench.sh. Per trial: full session transcript (session-logs/*.jsonl), the byte-exact prompt, wall-clock + token metrics, TSC / ESLint / Jest output, diff stats, and task-specific hard gates. Judge inputs: the verbatim request payload sent to each of the 5 judges per label per round — e.g. Alpha/round1/*-judge.json.request.json — built from the judge-prompt template over the blinded diff. Per label: the diff the judges saw and all five judges' raw JSON outputs (with scores_pre_r1 snapshot for the R1 audit trail).

Deterministic aggregation

Two scripts re-generate every number: aggregate-results.sh (R1 sweep → weighted mean → equal-weight comparator → σ decomposition) and audit-cohort-symmetry.py (no-cherry-picking audit). No network, no private state.

Statistical honesty (added 2026-05)

Every per-task report now publishes Krippendorff α (feature 0.085 · bugfix 0.338 · refactor −0.085 — judges disagree on absolute scores due to lenience drift, even where orderings concur), the MDE for tool-vs-tool comparison (14–23 pts at n=3), and a per-judge z-normalized sensitivity column to verify rank stability under lenience normalization. See any report's Power analysis and Inter-rater agreement sections — e.g. bugfix MDE. Computed by compute-krippendorff.py + compute-power-analysis.py (auto-run before each aggregation; outputs in results/).

Caveats, in plain English

These limitations are why we publish all artifacts and refuse to cite rank-positions within the top cluster. Read PAPER §4 for the full threats-to-validity list.

Approve-only "vibecode" execution. Trials are run by an operator who accepts whatever each tool proposes — no mid-flight steering, no plan rejection, no "try a different approach". The per-tool slash command is fired once and the operator only clicks through permission prompts. This measures autonomous one-shot capability under the pinned base model. Setups that depend on iterative human feedback (plan revision, rejecting a sub-task, mid-edit course correction) will rank lower here than in interactive pair-programming use. Read rankings as "best when the operator just keeps approving", not as effectiveness in a human-in-the-loop workflow.

Single codebase, single language — TypeScript. Don't assume the rankings carry to Python, Go, or Rust.

Single executor base model — claude-opus-4-7. A setup tuned for Sonnet or Haiku could rank differently.

LLM judges diverge 31–44 pts (full panel spread). GPT-5.4-pro is consistently the harshest (19–29 pts below the per-task panel mean), MiMo-2.5-pro the most lenient. The 5-judge weighted panel and the within / between σ split are the intended mitigations — they expose drift rather than hide it.

Pure (no addons) is rank-1 on refactor, rank-3 on bugfix, and rank-3 on feature — top-3 on every task in this corpus. The published MDE at n=3 trials per arm is 17.99 pts (feature) · 22.83 pts (bugfix) · 13.87 pts (refactor) — every rank-1 lead in this cohort falls below detection threshold. An earlier reading rejected the strict null on bugfix via claudekit's 8.6-pt lead over pure against a heuristic ~5-pt tie envelope; that reading does not survive the formal power analysis (8.6 ≪ 22.8). The only above-MDE differences in this cohort are vs. cohort outliers (gstack on feature & bugfix; omc on refactor borderline). Treat all top-cluster rank-1 claims as "point-estimate first, statistically tied"; use cost, speed, and DX as the real differentiator. Raising n from 3 → 5 trials per cell would cut MDE by ~23% — see IMPROVEMENT-PLAN-NEXT-COHORT.md.

Show 7 more methodological caveats

R1 mechanical-fact override is post-hoc. Deterministic items are rewritten from auto-metrics.json; the lock list varies per task (feature locks 4 items: tsc / eslint / core-test failures / lines removed; bugfix locks 2; refactor locks 2 — see PAPER §1.5). Pre-override scores are preserved per-file under scores_pre_r1.

Judge weights pre-registered, not derived. The 3 / 2 / 1 / 1 / 1 scheme reflects operator trust in the Anthropic and OpenAI judges; equal-weight aggregations are emitted alongside as sensitivity. Rank-1 is identical under both rules on every task; top-3 is identical on bugfix and refactor, but feature top-3 reorders under equal weighting (weighted: ecc/superpower/pure; equal-weight: ecc/compound/bmad).

Self-preference is not identified by this design. Every executor uses a Claude base model, so judge-family favoritism cannot be isolated. A proper audit needs a non-Anthropic executor as control.

Cross-task synthesis is informational only. A single cross-task z̄ leaderboard is sensitive to weighting and noisy at this sample size — read the per-task reports together. The leaderboard above is a visual aid, not a ranking.

Judge sampling not pinned. Temperature is fixed to 0 where the provider exposes it (OpenRouter, OpenCode Go); Claude CLI and OpenAI /v1/responses do not expose temperature/seed.

Not preregistered. Tasks, rubric, judge panel, weight scheme, and R1 lock list were chosen iteratively. The weight scheme is committed to versions.lock.json before aggregation; earlier choices of tasks and rubric items are not preregistered.

Tool-version snapshot, 2026-05. Versions captured in versions.lock.json. Re-run for current-version claims.

Reproduce every number in one command

Nothing is hidden. Run the three aggregation scripts and diff against the committed report — output should byte-match (seed 42, stdlib + numpy only, no network).

Per-task aggregation

Runs the R1 sweep, weighted mean, equal-weight comparator, and σ decomposition. Writes final-report.md + final-report.equal-weight.md.

TASK=feature ./scripts/aggregate-results.sh

R1 mechanical-fact override

Rewrites deterministic rubric items from auto-metrics.json; idempotent; preserves scores_pre_r1.

TASK=feature python3 scripts/apply-r1-override.py results/_blind-eval/Alpha

Cohort symmetry

Verifies no trial or rerun was cherry-picked. Exits non-zero on hard violations.

python3 scripts/audit-cohort-symmetry.py

Open every artifact

Trial inputs: feature PRD · bugfix PRD · refactor PRD. Judge inputs: sample request payloads · prompt template. Reports: feature · bugfix · refactor.

open results/final-report.md

Full walkthrough: Verification guide · Pre-publish runbook: RERUN-PRE-PUBLISH · Paper: PAPER.md