Open benchmark · 2026-04 snapshot

Nine Claude Code frameworks, three tasks, three judges — and a top-four statistical tie.

A blind, multi-judge evaluation of the Claude Code ecosystem on feature, bugfix, and refactor work in a real TypeScript monorepo. 756 judgments. Every prompt, transcript, diff, and judge score is checked in — read the intervals, not the ranks.

Read the paper → Final report How to verify

Claude Code frameworks, all running on the same base model

Tasks: feature · bugfix · refactor (TypeScript NX monorepo)

Judges: Opus 4.7 · GPT-5.4 · Qwen3.6-plus, 20-item rubric

756

Blind-labeled judgments across 252 rounds, zero missing

The cross-task leaderboard

z̄ is the equal-weight mean of per-task z-scores (balanced across three judges). The top-four pairwise 95% bootstrap CIs overlap on every task — treat their ordering as a cluster, not a ranking.

Chips show per-task z-scores. Orange = tool's best task. The mindful row's feature chip and the superpower bugfix chip are each the tool's weakest task.

Per-task bootstrap intervals

Each panel plots mean score (20 items × 0–10 = 200 max) with 95% bootstrap CIs (10,000 resamples, seed 42, stratified by judge). The dashed line is the cohort mean. Where horizontal bars overlap, the pair is not separable at α=0.05.

How we kept the measurement honest

A benchmark is only as credible as its protocol. Four commitments you can audit in results/:

Blind evaluation

Every diff is relabeled with a NATO codename (Alpha, Bravo, Charlie…) before judging. Markdown plan files are stripped. The label-to-tool mapping (.mapping-DO-NOT-OPEN.json) is sealed until scoring finishes — any review that reads it during scoring is invalid by protocol.

Three-judge panel

Claude Opus 4.7, GPT-5.4 high-reasoning, and Qwen3.6-plus high-reasoning each score the same 20-item rubric independently, then the per-judge means are averaged with equal weight. Prior five-judge runs were retired for calibration drift (σ = 9–13, 57-point gap) — mitigation, not a silver bullet.

Everything committed

Per trial: full session transcript (session-logs/*.jsonl), the byte-exact prompt, wall-clock + token metrics, TSC / ESLint / Jest output, diff stats, and task-specific hard gates. Per label: the diff the judges saw and all three judges' raw JSON outputs.

Deterministic aggregation

Three Python scripts re-generate every number: cross-task-analysis.py (ranks + CIs), krippendorff-alpha.py (inter-rater α), audit-cohort-symmetry.py (no-cherry-picking audit). All seed 42, no network, no private state.

Twelve caveats, in plain English

These limitations are why we publish all artifacts and refuse to cite rank-positions within the top cluster. Read PAPER §7 for the full threats-to-validity list.

Single codebase, single language — TypeScript. Don't assume the rankings carry to Python, Go, or Rust.

Single executor base model — claude-opus-4-6. A framework tuned for Sonnet or Haiku could rank differently.

LLM judges diverge ±25 pts absolute. Krippendorff α on 200-point totals is negative on feature and refactor — judges disagree on the absolute scale. Per-item α is moderate (0.15–0.66). Averaging three judges mitigates drift by design; it does not remove it.

Top-4 is a statistical tie. Pairwise 95% CIs overlap on every task. Use cost, speed, and DX as the real differentiator — not the score column.

Ranking depends on the weighting scheme. Equal-weight z̄, count-weighted z̄, and rank-sum agree on the top cluster and bottom outlier but disagree on middle ordering (≥2 positions for claudekit, compound, mindful).

Tiers are descriptive, not FWER-controlled. Complete-linkage of pairwise-overlapping CIs, no Bonferroni/Holm across the 108 cross-task cell comparisons.

Self-preference is not identified by this design. Every executor uses a Claude base model, so judge-family favoritism cannot be isolated. A proper audit needs a non-Anthropic executor as control.

Pseudoreplication. Bootstrap CIs are stratified by judge but treat multiple rounds of the same artifact as independent draws — this inflates apparent precision on close pairwise calls.

Judge sampling not pinned. Neither the Claude nor OpenCode CLI exposes temperature or seed. Round-to-round σ absorbs sampler variance; three-judge averaging is the intended mitigation.

Not preregistered. Tasks, rubric, and judge panel were chosen iteratively by the benchmark author.

Tool-version snapshot, 2026-04. Re-run for current-version claims.

Author-in-the-loop. The author of one tool (omc) designed the benchmark. It lands at rank 7–8 across weighting schemes, which is self-critical; we publish all artifacts so any methodology or task-selection bias is auditable.

Re-derive every number in one command

Nothing is hidden. Run the three aggregation scripts and diff against the committed report — output should byte-match (seed 42, stdlib + numpy only, no network).

Cross-task stats

Bootstrap CIs, pairwise tiers, ranking sensitivity, calibration.

python3 scripts/cross-task-analysis.py

Inter-rater reliability

Krippendorff α per task, pairwise judge α, per-item + totals.

python3 scripts/krippendorff-alpha.py

Cohort symmetry

Verifies no trial or rerun was cherry-picked. Exits non-zero on hard violations.

python3 scripts/audit-cohort-symmetry.py

Open every artifact

Folder index and file schemas for every transcript, diff, and judge file.

open results/README.md

Full walkthrough: docs/VERIFICATION-GUIDE.md · Pipeline end-to-end: docs/pipeline.md · Paper: PAPER.md