Open benchmark · 2026-04 snapshot

Nine Claude Code setups, three tasks, three judges — and a top-four statistical tie.

A blind, multi-judge evaluation of the Claude Code ecosystem — plugins, skill packs, hook kits, and a no-addon baseline — on feature, bugfix, and refactor work in a real TypeScript monorepo. 864 judgments. Every prompt, transcript, diff, and judge score is checked in — read the intervals, not the ranks.

Read the paper → Final report How to verify

Claude Code setups (plugins, skill packs, hook kits, and a no-addon baseline), all on the same base model

Tasks: feature · bugfix · refactor (TypeScript NX monorepo)

Judges: Opus 4.7 · GPT-5.4 high-reasoning · Qwen3.6-plus high-reasoning, 20-item rubric

864

Blind-labeled judgments across 288 rounds, zero missing

96.6h

Wall-clock time across every tool run (72 trials)

~1.0B

Tokens used (6.0M output + 958M cache read + 42M cache creation)

125

Claude Code sessions launched across the cohort

10.3K

Messages / conversation turns emitted by the tools

The cross-task leaderboard

z̄ is the equal-weight mean of per-task z-scores (balanced across three judges). The top-four pairwise 95% bootstrap CIs overlap on every task — treat their ordering as a cluster, not a ranking.

Chips show per-task z-scores. Orange = tool's best task. The mindful row's feature chip and the superpower bugfix chip are each the tool's weakest task. Click a row for the tool's transcript-grounded profile — mechanism, invocation, observed behaviors, failure modes.

Per-task bootstrap intervals

Each panel plots mean score (20 items × 0–10 = 200 max) with 95% bootstrap CIs (10,000 resamples, seed 42, stratified by judge). The dashed line is the cohort mean. Where horizontal bars overlap, the pair is not separable at α=0.05.

What we learned

Rank-order alone is low-signal once the top-4 CIs overlap. Three cross-cutting observations that hold up once you read the transcripts, not the leaderboard.

The top-4 is a statistical tie — and vanilla Claude Code is inside it.

bmad, ecc, pure, gstack land in CIs that overlap on every task. The only common factor: each of them forces a plan-before-code step. Architecture (multi-agent, phase-gated, skill-pack, baseline) did not correlate with landing in the top cluster. Planning discipline did.

Read the analysis →

Effective planning commands share five wording patterns.

A content-level comparison of the actual slash-command and skill markdown that drove the bugfix top-5. Explicit checklists beat prose; task-shape awareness beats generic templates; forcing functions beat suggestions; specific domain helpers beat abstract principles; hard-gated reviewers beat deferred verification.

Read the wording analysis →

Seven mechanism→outcome patterns across 27 transcript cells.

--auto gate-suppression in claudekit; setup-turn tax on mindful and omc; multi-agent overhead that needs task-size to amortize in compound; hook-based re-anchoring that is narrow-purpose by design in mindful. Observations, not laws — the benchmark has no signal on MCP servers, plugin size, or CLAUDE.md length.

Read the patterns →

Explore the docs

The repository's docs/ folder is organized by reader intent. Four routes in, depending on what you want.

Guides

Run one trial in ~10 minutes

Clone → pick → tool run → judge → aggregate. Minimum viable reproduction path with the full command sequence.

Quickstart →

Methodology

Pipeline end-to-end

The canonical flow: tasks, tools, trials, judging, aggregation, and the cohort-symmetry rule. Plus the interaction protocol between operator and tool.

Pipeline reference →

Tools

Nine setups, profiled

Per-tool: upstream, version, mechanism (skills / hooks / sub-agents), exact invocation, per-task transcript notes, strengths, and failure modes. Grounded in the session logs.

Tool profiles →

Analysis

What did we learn?

Three analysis pages: the top-4 tie explained, a wording-pattern study of the planning commands that drove the top cluster, and seven mechanism→outcome patterns drawn from all 27 (tool × task) transcript cells.

Analysis index →

Extending

Add a tool or judge

The scaffolding flow, the plan-mode-vs-native-planning decision matrix, the cohort-symmetry obligation, and two PR checklists.

Extending guide →

Verification

Reproduce a specific claim

"Why is superpower rank 9?" — eight step-by-step walkthroughs that re-compute headline claims from the committed artifacts.

Verification guide →

Timelines

What each tool actually did, per trial

Auto-extracted event timelines for every (task, tool, trial) cell — skill activations, plugin/skill files read, subagents dispatched, code mutations, Bash usage. Aggregate counts plus 27 per-tool pages.

Trial timelines →

How we kept the measurement honest

A benchmark is only as credible as its protocol. Four commitments you can audit in results/:

Blind evaluation

Every diff is relabeled with a NATO codename (Alpha, Bravo, Charlie…) before judging. Markdown plan files are stripped. The label-to-tool mapping (.mapping-DO-NOT-OPEN.json) is sealed until scoring finishes — any review that reads it during scoring is invalid by protocol.

Three-judge panel

Claude Opus 4.7, GPT-5.4 high-reasoning, and Qwen3.6-plus high-reasoning each score the same 20-item rubric independently, then the per-judge means are averaged with equal weight. Prior five-judge runs (glm / kimi / gemini / opus / codex) were retired for calibration drift — see pipeline.md §1 for the retirement rationale. Three-judge averaging is mitigation, not a silver bullet.

Everything committed

Per trial: full session transcript (session-logs/*.jsonl), the byte-exact prompt, wall-clock + token metrics, TSC / ESLint / Jest output, diff stats, and task-specific hard gates. Per label: the diff the judges saw and all three judges' raw JSON outputs.

Deterministic aggregation

Three Python scripts re-generate every number: cross-task-analysis.py (ranks + CIs), krippendorff-alpha.py (inter-rater α), audit-cohort-symmetry.py (no-cherry-picking audit). All seed 42, no network, no private state.

Eleven caveats, in plain English

These limitations are why we publish all artifacts and refuse to cite rank-positions within the top cluster. Read PAPER §7 for the full threats-to-validity list.

Single codebase, single language — TypeScript. Don't assume the rankings carry to Python, Go, or Rust.

Single executor base model — claude-opus-4-6. A setup tuned for Sonnet or Haiku could rank differently.

LLM judges diverge ±25 pts absolute. Krippendorff α on 200-point totals is negative on feature and refactor — judges disagree on the absolute scale. Per-item α is moderate (0.15–0.66). Averaging three judges mitigates drift by design; it does not remove it.

Top-4 is a statistical tie. Pairwise 95% CIs overlap on every task. Use cost, speed, and DX as the real differentiator — not the score column.

Ranking depends on the weighting scheme. Equal-weight z̄, count-weighted z̄, and rank-sum agree on the top cluster and bottom outlier but disagree on middle ordering (≥2 positions for claudekit, mindful, superpower).

Tiers are descriptive, not FWER-controlled. Complete-linkage of pairwise-overlapping CIs, no Bonferroni/Holm across the 108 cross-task cell comparisons.

Self-preference is not identified by this design. Every executor uses a Claude base model, so judge-family favoritism cannot be isolated. A proper audit needs a non-Anthropic executor as control.

Pseudoreplication. Bootstrap CIs are stratified by judge but treat multiple rounds of the same artifact as independent draws — this inflates apparent precision on close pairwise calls.

Judge sampling not pinned. Neither the Claude nor OpenCode CLI exposes temperature or seed. Round-to-round σ absorbs sampler variance; three-judge averaging is the intended mitigation.

Not preregistered. Tasks, rubric, and judge panel were chosen iteratively by the benchmark author.

Tool-version snapshot, 2026-04. Re-run for current-version claims.

Re-derive every number in one command

Nothing is hidden. Run the three aggregation scripts and diff against the committed report — output should byte-match (seed 42, stdlib + numpy only, no network).

Cross-task stats

Bootstrap CIs, pairwise tiers, ranking sensitivity, calibration.

python3 scripts/cross-task-analysis.py

Inter-rater reliability

Krippendorff α per task, pairwise judge α, per-item + totals.

python3 scripts/krippendorff-alpha.py

Cohort symmetry

Verifies no trial or rerun was cherry-picked. Exits non-zero on hard violations.

python3 scripts/audit-cohort-symmetry.py

Open every artifact

Folder index and file schemas for every transcript, diff, and judge file.

open results/README.md

Full walkthrough: Verification guide · Pipeline end-to-end: Pipeline reference · Paper: PAPER.md