Open benchmark · 2026-05 snapshot

Eight Claude Code setups, three tasks, five judges — and task-specific specialisation.

A blind, multi-judge evaluation of the Claude Code ecosystem — plugins, skill packs, hook kits, and a no-addon baseline — on feature, bugfix, and refactor work in a real TypeScript monorepo. 360 judgments. No setup is top-2 on all three tasks — every prompt, transcript, diff, and judge score is checked in for independent re-analysis.

8
Claude Code setups (plugins, skill packs, hook kits, and a no-addon baseline), all on claude-opus-4-7
3
Tasks: feature · bugfix · refactor (TypeScript NX monorepo)
5
Judges (weighted): Opus 4.7 ×3 · GPT-5.4-pro ×2 · Grok-4.20 ×1 · GLM-5.1 ×1 · MiMo-2.5-pro ×1
360
Blind-labeled judgments: 3 tasks × 8 tools × 3 trials × 5 judges
72
Tool trials (8 tools × 3 trials × 3 tasks), every clone pinned to the task's base SHA
24
NATO-letter blind labels per task — diff scrubbed for tool-state directories, mapping sealed until aggregation
4 / 2 / 2
R1 mechanical-fact items locked per task from auto-metrics.json (feature locks 4: tsc / eslint / core-test failures / lines removed; bugfix locks 2; refactor locks 2 — see PAPER §1.5)
3 / 2 / 1
Judge weights pre-registered in versions.lock.json: Anthropic ×3 · OpenAI ×2 · xAI / Z.ai / Xiaomi ×1

The cross-task leaderboard

z̄ is the equal-weight mean of per-task z-scores (each task's z computed against its 8-tool cohort mean / stdev). This benchmark deliberately does not publish a cross-task z̄ leaderboard as a headline — read the per-task panels below instead. Shown here as a visual aid; rank-1 by task is the canonical claim.

Chips show per-task z-scores. Orange = tool's best task. Click a row for the tool's transcript-grounded profile — mechanism, invocation, observed behaviors, failure modes.

Per-task score intervals

Each panel plots the weighted mean (200 max) with a mean ± standard-error envelope (N=15 judgments per cell: 3 trials × 5 judges). The dashed line is the cohort mean. Where the horizontal bars overlap, the pair should be read as a tie at this sample size.

How we kept the measurement honest

A benchmark is only as credible as its protocol. Four commitments you can audit in results/:

Blind evaluation

Every diff is relabeled with a NATO codename (Alpha, Bravo, Charlie…) before judging. Markdown plan files are stripped. The label-to-tool mapping (.mapping-DO-NOT-OPEN.json) is sealed until scoring finishes — any review that reads it during scoring is invalid by protocol.

Five-judge weighted panel

Claude Opus 4.7 (Anthropic), GPT-5.4-pro (OpenAI), Grok-4.20 (xAI), GLM-5.1 (Z.ai), and MiMo-2.5-pro (Xiaomi) each score the same 20-item rubric independently. Per-judge means are combined under the pre-registered 3 / 2 / 1 / 1 / 1 weighting in versions.lock.json. An equal-weight comparator is emitted alongside every report; rank-1 and top-3 are identical under both rules on every task in this corpus.

Everything committed

Per trial: full session transcript (session-logs/*.jsonl), the byte-exact prompt, wall-clock + token metrics, TSC / ESLint / Jest output, diff stats, and task-specific hard gates. Per label: the diff the judges saw and all five judges' raw JSON outputs (with scores_pre_r1 snapshot for the R1 audit trail).

Deterministic aggregation

Two scripts re-generate every number: aggregate-results.sh (R1 sweep → weighted mean → equal-weight comparator → σ decomposition) and audit-cohort-symmetry.py (no-cherry-picking audit). No network, no private state.

Caveats, in plain English

These limitations are why we publish all artifacts and refuse to cite rank-positions within the top cluster. Read PAPER §4 for the full threats-to-validity list.

01
Single codebase, single language — TypeScript. Don't assume the rankings carry to Python, Go, or Rust.
02
Single executor base modelclaude-opus-4-7. A setup tuned for Sonnet or Haiku could rank differently.
03
LLM judges diverge 31–44 pts (full panel spread). GPT-5.4-pro is consistently the harshest (19–29 pts below the per-task panel mean), MiMo-2.5-pro the most lenient. The 5-judge weighted panel and the within / between σ split are the intended mitigations — they expose drift rather than hide it.
04
The null is not rejected on 2/3 tasks. pure (no addons) is top-3 on bugfix and rank-1 on refactor. Use cost, speed, and DX as the real differentiator — not the score column.
05
R1 mechanical-fact override is post-hoc. Deterministic items are rewritten from auto-metrics.json; the lock list varies per task (feature locks 4 items: tsc / eslint / core-test failures / lines removed; bugfix locks 2; refactor locks 2 — see PAPER §1.5). Pre-override scores are preserved per-file under scores_pre_r1.
06
Judge weights pre-registered, not derived. The 3 / 2 / 1 / 1 / 1 scheme reflects operator trust in the Anthropic and OpenAI judges; equal-weight aggregations are emitted alongside as sensitivity. Rank-1 and top-3 are identical under both rules.
07
Self-preference is not identified by this design. Every executor uses a Claude base model, so judge-family favoritism cannot be isolated. A proper audit needs a non-Anthropic executor as control.
08
Cross-task synthesis is informational only. A single cross-task z̄ leaderboard is sensitive to weighting and noisy at this sample size — read the per-task reports together. The leaderboard above is a visual aid, not a ranking.
09
Judge sampling not pinned. Temperature is fixed to 0 where the provider exposes it (OpenRouter, OpenCode Go); Claude CLI and OpenAI /v1/responses do not expose temperature/seed.
10
Not preregistered. Tasks, rubric, judge panel, weight scheme, and R1 lock list were chosen iteratively. The weight scheme is committed to versions.lock.json before aggregation; earlier choices of tasks and rubric items are not preregistered.
11
Tool-version snapshot, 2026-05. Versions captured in versions.lock.json. Re-run for current-version claims.

Re-derive every number in one command

Nothing is hidden. Run the three aggregation scripts and diff against the committed report — output should byte-match (seed 42, stdlib + numpy only, no network).

Per-task aggregation

Runs the R1 sweep, weighted mean, equal-weight comparator, and σ decomposition. Writes final-report.md + final-report.equal-weight.md.

TASK=feature ./scripts/aggregate-results.sh

R1 mechanical-fact override

Rewrites deterministic rubric items from auto-metrics.json; idempotent; preserves scores_pre_r1.

TASK=feature python3 scripts/apply-r1-override.py results/_blind-eval/Alpha

Cohort symmetry

Verifies no trial or rerun was cherry-picked. Exits non-zero on hard violations.

python3 scripts/audit-cohort-symmetry.py

Open every artifact

Per-task final-report.md, per-label blind-eval dirs, per-trial session logs.

open results/final-report.md

Full walkthrough: Verification guide · Pre-publish runbook: RERUN-PRE-PUBLISH · Paper: PAPER.md