Open benchmark · 2026-04 snapshot
Nine Claude Code frameworks, three tasks, three judges — and a top-four statistical tie.
A blind, multi-judge evaluation of the Claude Code ecosystem on feature, bugfix, and refactor work in a real TypeScript monorepo. 756 judgments. Every prompt, transcript, diff, and judge score is checked in — read the intervals, not the ranks.
The cross-task leaderboard
z̄ is the equal-weight mean of per-task z-scores (balanced across three judges). The top-four pairwise 95% bootstrap CIs overlap on every task — treat their ordering as a cluster, not a ranking.
Chips show per-task z-scores. Orange = tool's best task. The mindful row's feature chip and the superpower bugfix chip are each the tool's weakest task.
Per-task bootstrap intervals
Each panel plots mean score (20 items × 0–10 = 200 max) with 95% bootstrap CIs (10,000 resamples, seed 42, stratified by judge). The dashed line is the cohort mean. Where horizontal bars overlap, the pair is not separable at α=0.05.
How we kept the measurement honest
A benchmark is only as credible as its protocol. Four commitments you can audit in results/:
Blind evaluation
Every diff is relabeled with a NATO codename (Alpha, Bravo, Charlie…) before judging.
Markdown plan files are stripped. The label-to-tool mapping
(.mapping-DO-NOT-OPEN.json) is sealed until scoring finishes — any review
that reads it during scoring is invalid by protocol.
Three-judge panel
Claude Opus 4.7, GPT-5.4 high-reasoning, and Qwen3.6-plus high-reasoning each score the same 20-item rubric independently, then the per-judge means are averaged with equal weight. Prior five-judge runs were retired for calibration drift (σ = 9–13, 57-point gap) — mitigation, not a silver bullet.
Everything committed
Per trial: full session transcript (session-logs/*.jsonl), the byte-exact prompt,
wall-clock + token metrics, TSC / ESLint / Jest output, diff stats, and task-specific hard gates.
Per label: the diff the judges saw and all three judges' raw JSON outputs.
Deterministic aggregation
Three Python scripts re-generate every number: cross-task-analysis.py (ranks + CIs),
krippendorff-alpha.py (inter-rater α), audit-cohort-symmetry.py
(no-cherry-picking audit). All seed 42, no network, no private state.
Twelve caveats, in plain English
These limitations are why we publish all artifacts and refuse to cite rank-positions within the top cluster. Read PAPER §7 for the full threats-to-validity list.
claude-opus-4-6. A framework tuned for Sonnet or Haiku could rank differently.omc) designed the benchmark. It lands at rank 7–8 across weighting schemes, which is self-critical; we publish all artifacts so any methodology or task-selection bias is auditable.Re-derive every number in one command
Nothing is hidden. Run the three aggregation scripts and diff against the committed report — output should byte-match (seed 42, stdlib + numpy only, no network).
Cross-task stats
Bootstrap CIs, pairwise tiers, ranking sensitivity, calibration.
python3 scripts/cross-task-analysis.py
Inter-rater reliability
Krippendorff α per task, pairwise judge α, per-item + totals.
python3 scripts/krippendorff-alpha.py
Cohort symmetry
Verifies no trial or rerun was cherry-picked. Exits non-zero on hard violations.
python3 scripts/audit-cohort-symmetry.py
Open every artifact
Folder index and file schemas for every transcript, diff, and judge file.
open results/README.md
Full walkthrough: docs/VERIFICATION-GUIDE.md · Pipeline end-to-end: docs/pipeline.md · Paper: PAPER.md