Open benchmark · 2026-04 snapshot
Nine Claude Code setups, three tasks, three judges — and a top-four statistical tie.
A blind, multi-judge evaluation of the Claude Code ecosystem — plugins, skill packs, hook kits, and a no-addon baseline — on feature, bugfix, and refactor work in a real TypeScript monorepo. 864 judgments. Every prompt, transcript, diff, and judge score is checked in — read the intervals, not the ranks.
The cross-task leaderboard
z̄ is the equal-weight mean of per-task z-scores (balanced across three judges). The top-four pairwise 95% bootstrap CIs overlap on every task — treat their ordering as a cluster, not a ranking.
Chips show per-task z-scores. Orange = tool's best task. The mindful row's feature chip and the superpower bugfix chip are each the tool's weakest task. Click a row for the tool's transcript-grounded profile — mechanism, invocation, observed behaviors, failure modes.
Per-task bootstrap intervals
Each panel plots mean score (20 items × 0–10 = 200 max) with 95% bootstrap CIs (10,000 resamples, seed 42, stratified by judge). The dashed line is the cohort mean. Where horizontal bars overlap, the pair is not separable at α=0.05.
What we learned
Rank-order alone is low-signal once the top-4 CIs overlap. Three cross-cutting observations that hold up once you read the transcripts, not the leaderboard.
The top-4 is a statistical tie — and vanilla Claude Code is inside it.
bmad, ecc, pure, gstack land in CIs that overlap on every task. The only common factor: each of them forces a plan-before-code step. Architecture (multi-agent, phase-gated, skill-pack, baseline) did not correlate with landing in the top cluster. Planning discipline did.
Read the analysis →Effective planning commands share five wording patterns.
A content-level comparison of the actual slash-command and skill markdown that drove the bugfix top-5. Explicit checklists beat prose; task-shape awareness beats generic templates; forcing functions beat suggestions; specific domain helpers beat abstract principles; hard-gated reviewers beat deferred verification.
Read the wording analysis →Seven mechanism→outcome patterns across 27 transcript cells.
--auto gate-suppression in claudekit; setup-turn tax on mindful and omc;
multi-agent overhead that needs task-size to amortize in compound;
hook-based re-anchoring that is narrow-purpose by design in mindful.
Observations, not laws — the benchmark has no signal on MCP servers, plugin size, or CLAUDE.md length.
Explore the docs
The repository's docs/ folder is organized by reader intent.
Four routes in, depending on what you want.
Run one trial in ~10 minutes
Clone → pick → tool run → judge → aggregate. Minimum viable reproduction path with the full command sequence.
Quickstart →Pipeline end-to-end
The canonical flow: tasks, tools, trials, judging, aggregation, and the cohort-symmetry rule. Plus the interaction protocol between operator and tool.
Pipeline reference →Nine setups, profiled
Per-tool: upstream, version, mechanism (skills / hooks / sub-agents), exact invocation, per-task transcript notes, strengths, and failure modes. Grounded in the session logs.
Tool profiles →What did we learn?
Three analysis pages: the top-4 tie explained, a wording-pattern study of the planning commands that drove the top cluster, and seven mechanism→outcome patterns drawn from all 27 (tool × task) transcript cells.
Analysis index →Add a tool or judge
The scaffolding flow, the plan-mode-vs-native-planning decision matrix, the cohort-symmetry obligation, and two PR checklists.
Extending guide →Reproduce a specific claim
"Why is superpower rank 9?" — eight step-by-step walkthroughs that re-compute headline claims from the committed artifacts.
Verification guide →What each tool actually did, per trial
Auto-extracted event timelines for every (task, tool, trial) cell — skill activations, plugin/skill files read, subagents dispatched, code mutations, Bash usage. Aggregate counts plus 27 per-tool pages.
Trial timelines →How we kept the measurement honest
A benchmark is only as credible as its protocol. Four commitments you can audit in results/:
Blind evaluation
Every diff is relabeled with a NATO codename (Alpha, Bravo, Charlie…) before judging.
Markdown plan files are stripped. The label-to-tool mapping
(.mapping-DO-NOT-OPEN.json) is sealed until scoring finishes — any review
that reads it during scoring is invalid by protocol.
Three-judge panel
Claude Opus 4.7, GPT-5.4 high-reasoning, and Qwen3.6-plus high-reasoning each score the same 20-item rubric independently, then the per-judge means are averaged with equal weight. Prior five-judge runs (glm / kimi / gemini / opus / codex) were retired for calibration drift — see pipeline.md §1 for the retirement rationale. Three-judge averaging is mitigation, not a silver bullet.
Everything committed
Per trial: full session transcript (session-logs/*.jsonl), the byte-exact prompt,
wall-clock + token metrics, TSC / ESLint / Jest output, diff stats, and task-specific hard gates.
Per label: the diff the judges saw and all three judges' raw JSON outputs.
Deterministic aggregation
Three Python scripts re-generate every number: cross-task-analysis.py (ranks + CIs),
krippendorff-alpha.py (inter-rater α), audit-cohort-symmetry.py
(no-cherry-picking audit). All seed 42, no network, no private state.
Eleven caveats, in plain English
These limitations are why we publish all artifacts and refuse to cite rank-positions within the top cluster. Read PAPER §7 for the full threats-to-validity list.
claude-opus-4-6. A setup tuned for Sonnet or Haiku could rank differently.Re-derive every number in one command
Nothing is hidden. Run the three aggregation scripts and diff against the committed report — output should byte-match (seed 42, stdlib + numpy only, no network).
Cross-task stats
Bootstrap CIs, pairwise tiers, ranking sensitivity, calibration.
python3 scripts/cross-task-analysis.py
Inter-rater reliability
Krippendorff α per task, pairwise judge α, per-item + totals.
python3 scripts/krippendorff-alpha.py
Cohort symmetry
Verifies no trial or rerun was cherry-picked. Exits non-zero on hard violations.
python3 scripts/audit-cohort-symmetry.py
Open every artifact
Folder index and file schemas for every transcript, diff, and judge file.
open results/README.md
Full walkthrough: Verification guide · Pipeline end-to-end: Pipeline reference · Paper: PAPER.md