AI Coding Tool Benchmark: A Multi-Task, Multi-Judge Evaluation of Nine Claude Code Setups

Author: Randy Tran (randytran8800@gmail.com)
Date: 2026-04-22
Status: v1.1 — regenerated from 864-judgment corpus (adds round-3 for bugfix/refactor)
Repository: infina-pfa/claude-tool-benchmark

Abstract

We benchmark nine Claude Code setups — plugins, skill packs, hook kits, and a no-addon baseline — by having every setup implement the same three software-engineering tasks from the same production TypeScript codebase (the RealStake/infina-partner-sdk monorepo) under an identical base-model pin (claude-opus-4-6; per-CLI sampler defaults, see §8). Candidate diffs are scored on a 20-item, 200-point rubric by three LLM judges drawn from different model families (claude-opus-4-7, openai/gpt-5.4, qwen3.6-plus) under blind NATO-letter labels. Across 864 judgments spanning 3 tasks × 9 setups × 2–4 trials × {5 rounds (feature) or 3 rounds (bugfix/refactor)}, we find that the top four setups (ecc +0.273 z̄, bmad +0.270, pure +0.175, gstack +0.077) cluster within 0.20 z-score — their pairwise 95% bootstrap CIs overlap on every task, so their ordering is a statistical tie. superpower at z̄ = −0.315 is the only setup separated from the rest by non-overlapping CIs on at least one task (bugfix) — reported under a forced-activation harness applied to superpower only on bugfix (see §5). On the refactor task inter-judge Spearman ρ is not distinguishable from zero (CIs cross 0) — rankings on that task alone should not be cited. Judge calibrations differ systematically by ±25 points and Krippendorff α on 200-point totals is negative on feature/refactor (judges disagree on absolute scale), though α at the per-item level is moderate on feature and bugfix (+0.566, +0.629) and poor on refactor (+0.149): the three-judge balanced mean is the intended mitigation, and rankings reflect rank-order agreement (Spearman ρ) rather than absolute-scale consensus. We also run a judge calibration asymmetry check (opus vs. non-Anthropic judges); the drift is small and tool-invariant, but because every executor uses a Claude base model this design does not identify family-level self-preference — we note this as a limitation rather than a null result. Equal-weight cross-task z̄ is one of several valid summaries (judgment-count-weighted and rank-sum give different middle-tier orderings, with up to 4-position swings for individual tools); per-task CI tables are the primary deliverable and the ordered leaderboard should not be cited as a ranking. See the credibility review for known open issues, including pseudoreplication in CI estimation, non-preregistered analysis, and absence of a rubric-weight sensitivity sweep.

1. Introduction

Claude Code setups — plugins, skill packs, memory systems, multi-agent orchestrators layered on top of the base CLI — have proliferated in 2025–2026. Most are evaluated by vendors in aggregate terms (“30% faster”, “higher quality”) on single tasks against single baselines. This paper proposes a reproducible, multi-task, multi-judge benchmark that:

Uses the same base model for every tool run, isolating the tool contribution from model-capability drift.
Uses multiple distinct tasks (feature build, bugfix, refactor) to surface task-type specialization vs. broad competence.
Uses a three-judge rotation drawn from three different model families, reporting per-judge bias rather than hiding it.
Uses blind evaluation so judges never see the tool identity.
Runs each (tool, task, trial) artifact through multiple independent judgment rounds and reports round-to-round stability.

Our contributions are (a) a protocol specification, (b) an 864-judgment dataset, © a full reproduction pipeline (scripts + blind-eval mappings), and (d) per-task mean/CI tables with explicit weighting-sensitivity.

LLM-as-judge benchmarks are now common (e.g., MT-Bench, AlpacaEval, Chatbot Arena). Zheng et al. (2023) report Krippendorff α in the 0.62–0.80 range for judged conversational quality. Our prior-session work on this benchmark observed α = +0.75 at the rubric-item level but α ≈ 0 on 200-point totals when the cohort is compressed — the rubric instrument is reliable, but cohort spread falls below the judges’ aggregate noise floor. Here we switch from round-aggregation to an explicit 3-judge panel and verify that per-judge drift is large (±25 pts) but ranking-preserving.

Diff-size bias (LLM-as-judge rewarding breadth over correctness) was documented in our earlier methodology work. We partially mitigate by including a scope-discipline category in the rubric, and we report the full per-item breakdown so readers can re-weight as they wish.

3. Methodology

3.1 Task Design

All three tasks are drawn from the production TypeScript monorepo at RealStake/infina-partner-sdk (a mid-sized financial-services NX workspace), pinned to a fixed base commit per task:

Task	Type	Judgments/judge	Rounds
`feature` — TD-CD Mode 2 CD Batch	Greenfield feature build from PRD	180	5
`bugfix` — near-maturity filter	Bugfix from QA report	54	3
`refactor` — aggregate-ownership refactor	Scoped refactor from design doc	54	3

Each task ships to the tool as docs/benchmark/TASK.md in the cloned RealStake/infina-partner-sdk repository. Planning prompts are identical across tools (scripts/manual-bench.sh); tools are free to invoke their own sub-pipelines (planning phase, TDD, self-review, etc.).

3.2 Setups Evaluated

Nine Claude Code setups, all layering on the same pinned base model (claude-opus-4-6):

Setup	Approach
pure	Baseline — Claude Code with no additions
superpower	Skill-pack library
claudekit	Curated hook + command kit
bmad	BMAD method: business-modeling, multi-phase agents
mindful	Self-reflective planning loop
gstack	Opinionated stack + guardrails
compound	Compound Engineering multi-agent
ecc	Everything-Claude-Code skill registry
omc	Oh-My-Claudecode orchestration layer

3.3 Trial Execution

Per (task, tool) pair we run 2–4 independent trials. Each trial is a fresh clone of the task’s base repository with the tool’s configuration installed into an isolated HOME. The tool implements the task, runs its own test/build commands, and commits. We capture: implementation diff, test/type-check/lint output, session transcripts, wall time, and token usage.

3.4 Scoring Rubric

A 20-item rubric with four categories totaling 200 points:

Category	Max	Items cover
Correctness of the fix	70	Behavior matches spec, edge-case handling, domain-helper reuse
Tests	50	Coverage of spec branches, assertion quality, test independence
Code quality	40	Readability, naming, complexity, safe refactors
Scope discipline	40	No unrelated changes, no new config surface, respects module boundaries

Judges output a strict JSON object {"scores": {"1": 0–10, ..., "20": 0–10}, "total": int}. In our original corpus the total field was computed independently by the judge LLM; 239 of 1,166 historical records had off-by-one-to-off-by-ten discrepancies between the total field and sum(scores). We fix these in-place using sum(scores) as source-of-truth — the item-wise ranked sums are what the judge actually evaluated. The pre-correction originals are preserved in-tree as <judge>-judge.pre-correction-20260421.json siblings of the mutated files (135 files in the canonical corpus; 26 historical pilot records dropped with their parent dirs). A rubric-score sensitivity analysis under total-as-source-of-truth is an open item (see credibility-review-20260422.md finding #8).

3.5 Judge Panel

Three judges drawn from three distinct model families:

Judge	Model ID	Reasoning config
`opus`	`claude-opus-4-7`	default extended thinking
`codex`	`openai/gpt-5.4`	`--variant high` (high reasoning effort)
`qwen`	`opencode-go/qwen3.6-plus`	`--variant high`

Each artifact (tool × trial) is judged in 2–5 independent rounds. Judges are stateless between rounds (fresh context each call).

Labels are NATO-letter pseudonyms (Alpha, Bravo, …, Helix, …) with the mapping {label → (tool, trial)} stored in .mapping-DO-NOT-OPEN.json and read only by the aggregator. Judges never see tool names or directory paths indicative of the tool.

3.6 Statistical Methods

We report six lenses:

Balanced tool mean — mean of the three per-judge means for a (task, tool) cell. When per-judge n is equal (always true in the current corpus) this equals the pooled mean, but the form is preserved so that asymmetric re-runs cancel judge drift correctly.
95% bootstrap CI on the tool mean — 10,000 resamples, stratified by judge (each bootstrap draws the same number of per-judge samples as the original cell, then recomputes the balanced mean). Seed = 42.
Combined z-score (ranking) — for each task, convert each tool’s balanced mean to a z-score against the task’s cohort mean and stdev (pooled across all judgments); average the three per-task z-scores with equal weight.
Rank-sum (tie-break) — sum of the tool’s per-task rank. Independent of z-score magnitudes; robust to one-task outliers.
Stability — (a) per-tool round-to-round σ = stdev of per-round mean scores (tool reproducibility); (b) within-round judge spread = mean of max−min across the three judges per round (per-artifact judge disagreement).
Tier grouping (pairwise-overlap, complete linkage) — a candidate tool joins the current tier only when its 95% CI overlaps with every existing member of that tier. This makes “same tier” logically equivalent to “pairwise CIs all overlap” — the transitive interpretation expected of an “indistinguishable cluster.” Greedy single-linkage adjacency (new tier iff prev.lo > curr.hi) produces tiers that are intransitive and can merge tools whose CIs do not pairwise overlap, so we do not use it. Tier grouping is not a family-wise-error-rate (FWER) controlling procedure; no Bonferroni/Holm correction is applied across the 108 cross-task cell comparisons. Only individually cited pairwise separations (§4.3 “pairwise-disjoint” lists) carry statistical weight. We also report rank-sum and judgment-count-weighted z̄ as sensitivity checks: the three cross-task summaries agree on the top cluster and the bottom outlier but disagree on middle ordering — rankings within the middle should not be cited as leaderboard positions.
Inter-rater agreement (two metrics). Spearman ρ per judge pair on artifact-level totals, with 2,000-resample bootstrap 95% CI (tests rank-order agreement). Krippendorff α (interval level) per task on (a) per-item scores (rubric-item × label × judge) and (b) totals (label × judge), reported via scripts/krippendorff-alpha.py (tests absolute-scale agreement; sensitive to calibration drift in ways Spearman ρ is not).
Judge calibration asymmetry check. opus_mean − mean(codex_mean, qwen_mean) per (task, tool), reported with both the mean Δ across tools and the within-range spread (does the drift vary by tool?). Identification note: every executor uses the same Anthropic base (claude-opus-4-6), so this statistic cannot distinguish family-level self-preference from judge-calibration drift. A uniform Δ is consistent with either. A true self-preference test would require non-Anthropic-base executor runs.
Cohort rerun symmetry not formally audited. CLAUDE.md specifies that if any trial T<N> is rerun, T<N> must be rerun for all 9 tools. scripts/audit-cohort-symmetry.py reports observed per-trial timestamp spreads, base-commit divergences, and archived reruns so this can be verified after the fact; see its output before citing per-trial columns in comparison.

All statistics are produced by two scripts: per-task balanced means and methodology tables by scripts/aggregate-results.sh; CIs, tier grouping, Spearman CIs, α, sensitivity analysis, and calibration asymmetry by scripts/cross-task-analysis.py. Inputs are restricted to directories matching ^round[0-9]+$ — pilot and sample dirs (roundcotpilot, roundcotsample*) are excluded so the corpus size is deterministic.

4. Results

Corpus: 864 judgments total (540 feature + 162 bugfix + 162 refactor). Cohort mean ± stdev: feature = 119.54 ± 26.01; bugfix = 168.24 ± 16.48; refactor = 159.04 ± 25.50.

4.1 Combined Ranking (Mean Z-Score)

Rank	Tool	z̄	feature z	bugfix z	refactor z
1	ecc	+0.273	+0.091	+0.767	−0.041
2	bmad	+0.270	+0.215	+0.582	+0.012
3	pure	+0.175	+0.077	+0.397	+0.051
4	gstack	+0.077	+0.158	+0.188	−0.115
5	mindful	−0.006	−0.388	+0.272	+0.097
6	claudekit	−0.123	−0.113	−0.318	+0.062
7	compound	−0.144	−0.082	−0.443	+0.092
8	omc	−0.205	−0.108	−0.672	+0.164
9	superpower	−0.315	+0.149	−0.773	−0.322

The top-4 cluster (ecc/bmad/pure/gstack) spans 0.196 z-score. Their per-task 95% bootstrap CIs overlap on every task (see §4.3), so their relative ordering is not statistically distinguishable at the current sample size. superpower at z̄ = −0.315 is the only tool whose combined score is separated from the rest on at least one task by non-overlapping CIs (its bugfix CI [150.6, 160.3] sits below the next-lowest tool’s CI; see §4.3). This outlier is conditional on the forced-activation bugfix harness (§5); the raw-prompt score from the pre-rewrite superpower trials, archived at results/bugfix/superpower/archive-activation-20260421/, is not in the leaderboard shown here.

4.2 Rank-Sum

Rank	Tool	rank-sum	feature	bugfix	refactor
1	bmad	9	1	2	6
2	ecc	12	4	1	7
3	pure	13	5	3	5
4	gstack	15	2	5	8
5	mindful	15	9	4	2
6	compound	16	6	7	3
7	omc	16	7	8	1
8	claudekit	18	8	6	4
9	superpower	21	3	9	9

Rank-sum broadly agrees with z-score on the top cluster and bottom outlier but disagrees in the middle — e.g., mindful and omc move up under rank-sum because their per-task weak positions are penalized only once, not magnified by the task’s higher cohort stdev. Since most per-task rank positions are indistinguishable from adjacent positions at 95% confidence (see tiering below), rank-sum should be read as a summary of ordinal positions, not as a second ranking.

4.3 Per-Task Detail

All tables include 95% bootstrap CIs on the balanced tool mean (10,000 resamples, stratified by judge, seed 42). Tiers are assigned by the pairwise-overlap complete-linkage rule in §3.6 (membership ⇔ pairwise CI overlap with all tier-mates). FINAL-REPORT also lists explicit pairwise-disjoint tool pairs per task for transitivity-free comparison.

feature

Tier	Tool	Mean /200	95% CI	σ	n
T1	bmad	125.13	[122.7, 127.4]	25.95	60
T1	gstack	123.65	[120.5, 126.8]	27.24	60
T1	superpower	123.42	[120.6, 126.2]	26.75	60
T1	ecc	121.90	[119.2, 124.7]	26.13	60
T1	pure	121.53	[118.0, 124.9]	27.91	60
T2	compound	117.40	[114.0, 120.7]	25.57	60
T2	omc	116.73	[114.2, 119.4]	20.78	60
T2	claudekit	116.60	[113.4, 119.7]	25.33	60
T3	mindful	109.45	[105.8, 113.2]	25.70	60

Three pairwise-overlap tiers on feature: T1 = {bmad, gstack, superpower, ecc, pure}; T2 = {compound, omc, claudekit}; T3 = {mindful}. The T1/T2 boundary formally separates the top-5 from the middle-3; 15 of 36 tool-pairs on feature have fully disjoint 95% CIs (see FINAL-REPORT §3). Within each tier, pairwise CI overlap means the ordering is not statistically distinguishable.

bugfix

Tier	Tool	Mean /200	95% CI	σ	n
T1	ecc	180.89	[178.9, 182.7]	11.69	18
T1	bmad	177.83	[175.2, 180.3]	13.83	18
T2	pure	174.78	[171.5, 178.1]	13.07	18
T2	mindful	172.72	[167.2, 177.8]	16.36	18
T2	gstack	171.33	[166.4, 176.0]	13.24	18
T3	claudekit	163.00	[160.2, 165.8]	11.27	18
T3	compound	160.94	[157.4, 164.4]	13.91	18
T3	omc	157.17	[151.2, 162.6]	17.96	18
T3	superpower	155.50	[150.6, 160.3]	16.06	18

Three pairwise-overlap tiers on bugfix: {ecc, bmad} › {pure, mindful, gstack} › {claudekit, compound, omc, superpower}. superpower trials fired 2 Skill calls each (the harness gates completion on /superpowers:systematic-debugging invocation at session start and /superpowers:verification-before-completion at exit); hard gates 5/5 PASS; scope files 2. Explicitly triggering the skills is the operating condition under which these scores hold; see §5 for the mechanism. 23 of 36 pairs have disjoint 95% CIs on bugfix.

refactor

Tier	Tool	Mean /200	95% CI	σ	n
T1	omc	163.22	[161.1, 165.4]	24.74	18
T1	mindful	161.50	[159.0, 163.8]	26.14	18
T1	compound	161.39	[158.5, 164.3]	27.60	18
T1	claudekit	160.61	[157.7, 164.2]	25.46	18
T1	pure	160.33	[157.8, 162.9]	27.17	18
T1	bmad	159.33	[156.5, 162.0]	26.61	18
T2	ecc	158.00	[156.1, 160.0]	24.03	18
T2	gstack	156.11	[152.8, 159.4]	23.13	18
T3	superpower	150.83	[145.5, 155.2]	27.85	18

Three pairwise-overlap tiers on refactor: {omc, mindful, compound, claudekit, pure, bmad} › {ecc, gstack} › {superpower}. Only 9 of 36 pairs have disjoint CIs, mostly involving superpower against the top tier. Combined with near-zero inter-judge Spearman ρ (§4.4) and negative Krippendorff α on totals (§4.4b), per-task refactor rankings are noise-dominated and should not be cited in isolation.

4.4 Judge Calibration

Pooled mean ± stdev, across all nine tools, per (task, judge):

Task	opus	codex	qwen
feature	115.4 ± 12.8	94.4 ± 8.9	148.8 ± 16.7
bugfix	171.6 ± 16.8	154.0 ± 18.0	179.1 ± 16.0
refactor	164.6 ± 9.9	127.0 ± 8.5	185.4 ± 7.5

Drift from the three-judge mean (positive = generous, negative = harsh):

Task	Δ opus	Δ codex	Δ qwen
feature	-4.1	-25.1	+29.2
bugfix	+3.3	-14.2	+10.9
refactor	+5.6	-32.0	+26.4

codex is systematically harsh (Δ = −14 to −31), qwen is systematically generous (+10 to +28), and opus is the neutral anchor (±6). This is not a ranking contamination: the three judges’ per-artifact rank orders are Spearman-correlated:

Task	ρ(opus, codex) [95% CI]	ρ(opus, qwen) [95% CI]	ρ(codex, qwen) [95% CI]	n pairs
feature	+0.26 [+0.11, +0.38]	+0.57 [+0.46, +0.66]	+0.22 [+0.06, +0.36]	180
bugfix	+0.79 [+0.64, +0.88]	+0.75 [+0.61, +0.84]	+0.74 [+0.60, +0.82]	54
refactor	+0.14 [−0.14, +0.41]	+0.08 [−0.21, +0.37]	+0.21 [−0.06, +0.48]	54

On feature and bugfix every judge-pair CI excludes zero — inter-judge rank-order agreement is significantly positive. On refactor every judge-pair CI straddles zero — rank agreement is not distinguishable from chance. The most likely cause is cohort compression (all 9 tools within 14 points on a 200-point scale) leaving judges in the noise regime.

4.4b Krippendorff α — absolute-scale agreement

Spearman ρ measures rank-order agreement but is insensitive to absolute-scale drift. Krippendorff α (interval level) tests whether judges assign similar absolute scores, not just similar rankings. Running scripts/krippendorff-alpha.py on the current corpus:

Task	α (per-item, 20 items × labels)	α (totals, per label)	Pair opus/codex	Pair opus/qwen	Pair codex/qwen
feature	+0.566	−0.286	−0.342	−0.323	−0.808
bugfix	+0.629	+0.287	+0.317	+0.685	−0.144
refactor	+0.149	−0.425	−0.797	−0.620	−0.908

Per-item α is materially higher than totals α on every task: judges agree on relative rubric-item weighting (which items matter more) but disagree on absolute scale (the summed total). On feature and refactor, α on totals is negative — observed inter-judge disagreement on the summed-total scale exceeds what chance alone would produce, consistent with the ±25-pt calibration drift reported above. Spearman ρ was positive on feature and bugfix (totals level) even though α on totals was negative on feature, because judges still ranked artifacts similarly even when their absolute scales diverged.

Implication: the three-judge balanced mean (equal-weight mean of per-judge means) is the intended mitigation given this disagreement shape — it cancels additive per-judge bias by giving each judge’s mean equal weight rather than pooling raw scores. It does not estimate or remove all calibration structure (e.g., non-linear scale differences or judge-specific variance). A single-judge result on this corpus would diverge from the panel result by more than a within-judge replicate suggests, so citing single-judge leaderboards under-reports uncertainty accordingly.

4.5 Judge Calibration Asymmetry (opus vs. non-Anthropic judges)

A common concern with LLM-as-judge panels that mix model families is that the judge from the same family as the executor’s base model may systematically inflate scores (e.g., opus judging claude-opus-produced diffs). We compute a candidate diagnostic — per (task, tool), opus_mean − mean(codex_mean, qwen_mean) — and report mean and within-range spread across the 9 tools:

Task	Mean Δ(opus − others)	Range across 9 tools	Within-range spread
feature	−6.22	[−9.90, −0.42]	9.47
bugfix	+4.97	[+1.75, +8.25]	6.50
refactor	+8.39	[+1.50, +14.50]	13.00

The Δ is small (single-digit per task) and approximately uniform across tools (within-range spread ≈6–13 pts). That shape is consistent with calibration drift — opus is a slightly different absolute scorer than the non-Anthropic pair on each task, but does not inflate any specific tool relative to the others.

Identification limitation (important). This test cannot distinguish family-level self-preference from simple judge-calibration drift, because every executor in this study uses the same Anthropic base model (claude-opus-4-6). A uniform opus offset could equally reflect (a) opus’s intrinsic scoring conservatism on long-artifact tasks, or (b) a family-preference effect applied identically to all (Anthropic-base) runs. We cannot disentangle these without non-Anthropic-base executor runs as a control condition. We therefore report this section as a calibration asymmetry check, not a self-preference audit: absence of tool-specific inflation is what the data show, and the absence of family-level inflation is not identified by this design.

4.6 Stability

Two complementary stability measures:

Per-round σ — stdev of a tool’s round-level mean scores across the 2–5 independent judgment rounds (how much would the tool’s cell mean move if we re-ran the judges?).
Inter-judge spread — mean within-round max-min across the three judges for each (label, round) artifact (how much do the three judges disagree on a single artifact?).

Tool	feature σ	bugfix σ	refactor σ
ecc	1.6	3.0	0.6
bmad	0.9	1.9	0.8
pure	1.1	3.4	1.8
gstack	3.3	0.7	0.4
mindful	1.4	2.7	2.7
claudekit	0.6	0.8	3.4
compound	2.5	1.7	2.0
omc	1.8	2.2	2.0
superpower	2.2	1.3	4.6

Per-round σ is ≤ 5 pts for every (tool, task) cell — re-running the judges would move a tool’s mean by only a few points in expectation. At n=3 rounds per (tool, judge) for bugfix/refactor the σ estimator has 2 degrees of freedom and very wide actual uncertainty, so these should be read as rough stability indicators, not precise estimates. Inter-judge spread dominates the uncertainty envelope at the per-artifact level (30–65 pts across the three judges on a typical feature artifact), which is why the 3-judge panel and bootstrap CIs are essential — individual judgments are noisy, but averaged panels converge.

5. Case Study: superpower on bugfix — skill-activation is the operating condition

superpower is the only setup whose bugfix CI is fully disjoint from the cluster above it (T3 is a singleton {claudekit, compound, omc} tier-mate and superpower sits inside it with CI [150.6, 160.3] — below every non-T3 tool). The separation is not a broad capability gap — on feature the setup lands inside T1 (z = +0.149) — and skills were only consulted on this task when the harness explicitly invoked them.

Forcing activation. The harness pins the bugfix prompt to two slash-command triggers: /superpowers:systematic-debugging at session start and /superpowers:verification-before-completion as a completion gate. With these triggers in place, two fresh trials (t1, t2) run against the same base commit and TASK.md and three judgment rounds produced: mean 155.50, z = −0.773, CI [150.6, 160.3], T3 (with {claudekit, compound, omc}). Note: this forced-activation harness was applied only to superpower’s bugfix branch; no other tool’s bugfix branch received an analogous intervention. The pre-rewrite “raw-prompt” superpower trials are archived at results/bugfix/superpower/archive-activation-20260421/. Under the cohort-rerun-symmetry rule in CLAUDE.md, a cleaner design is either (a) publish the raw-prompt score as the honest measurement, or (b) apply a matched forced-activation harness to all 9 tools; the current corpus does neither, which the credibility review flags as a methodological confound.

Interpretation. When the harness explicitly invokes them, the superpower skills produce T2-level output on bugfix. They do not produce T1-level output, and they are not invoked by default on a bare task prompt without the slash-command trigger. Any claim about superpower-on-bugfix score therefore has an “under the activation protocol” qualifier attached; the corpus reported in this paper uses the forced-activation protocol for bugfix.

The case study illustrates two design decisions in the methodology:

Scope-discipline rubric items (e.g., #18 “no new config surface”) are load-bearing on the bugfix task and visibly penalize setups that add configuration to fix a filter.
Multi-task benchmarks are essential: a single-task evaluation would have shown superpower either fine (feature, refactor) or separated (bugfix). Only the three-task cross-section exposes the pattern as task-sensitivity rather than a broad capability gap.

6. Discussion

What the ranking supports. On 3 tasks × 9 tools × 3 judges × 2–5 rounds, 8 of 9 tools land within a 0.4 z-score band. The top-4 cluster (bmad, ecc, pure, gstack) has overlapping per-task 95% CIs on every task — their relative ordering is not statistically distinguishable at this sample size. At current precision, claims of “tool X is the best” are unsupported; the only claim that survives the CI test is “tool Y (superpower) has a task-specific quality gap on task Z (bugfix) under the forced-activation protocol” — its bugfix CI is fully separated from the next-lowest tool’s CI.

What the judges say. LLM judges from three model families agree on rank order on 2 of 3 tasks but disagree on absolute calibration by ±25 pts. On refactor they disagree on rank order too (Spearman CI straddles zero). This should be the default expectation for LLM-as-judge benchmarks: a three-judge panel is necessary to cancel absolute drift, but not sufficient to disambiguate tools when the cohort is compressed (tools all within ~14 pts on a 200-pt scale). Single-judge benchmarks under-report uncertainty by roughly an order of magnitude.

Task-type matters. The cohort mean on feature (feature build from PRD) is 119.5, on bugfix (bugfix) 165.0, on refactor (refactor) 158.9. Feature build is the hardest task-type in our sample; bugfix and refactor are markedly easier, and the refactor task in particular compressed the cohort to the point where inter-judge agreement collapsed. A benchmark that used only a single easy task would lose most of the between-tool signal.

Why pure (baseline) is top-4. Our strongest null hypothesis is “a Claude Code setup adds no value over the bare CLI.” pure lands at rank 3 by z̄ (+0.175) and rank 3 by rank-sum, inside the top cluster, with CIs that overlap bmad/ecc/gstack on every task. This does not reject the null at current precision. Tools may still add value on dimensions not captured by the rubric — developer experience, cost, speed, debugging ergonomics — or at larger sample sizes; we simply cannot distinguish their code-quality output from baseline from 162–540 judgments per task. A reference implementation in results/_human-reference/ scored ~24.95 above the top tool, confirming the ceiling is well above pure’s score but that the tool-vs-pure gap is small. The _human-reference is a single hand-authored artifact (n=1); the 24.95-pt ceiling gap has no error bar and should be treated as a single reading, not a distributional estimate.

Critic-review track record. Earlier review passes flagged: (i) the aggregator silently ingested pilot/sample round dirs (inflating per-tool n); (ii) “top-4 tie” was asserted without a CI computation; (iii) report numbers differed by ≈0.01 z between files. Those fixes — canonical round filter (^round[0-9]+$ only), 10,000-resample stratified bootstrap CIs, tier grouping by non-overlapping CIs, and single-source regeneration from scripts/cross-task-analysis.py — are in place. A subsequent external skeptical-reader audit (docs/analysis/credibility-review-20260422.md) identified 25 open issues against this release, ~8 of which (pseudoreplication in bootstrap CIs, non-uniform bugfix harness on superpower, post-hoc round-3 collection that changed the leader, in-place mutation of 239 judge JSONs without originals preserved, absence of rubric-weight sensitivity run, n=1 _human-reference baseline, no multiplicity correction across 108 cross-task cell comparisons, and tier-algorithm edit after the data existed) remain unaddressed in this version. Read this paper alongside that review. The corrected headline between prior draft (v1.0, 2-round bugfix/refactor) and this release (v1.1, 3-round) flipped the #1 and #2 positions (bmad ↔ ecc) and moved z̄ values by up to 0.05 — a larger change than the earlier “≤0.02 z” wording implied.

7. Limitations and Threats to Validity

Single-repository, single-language. All tasks are TypeScript in one internal NX monorepo. Generalization to Python, Go, Rust, or polyglot codebases is untested.
Single executor base model. All tool runs use claude-opus-4-6. A tool that specifically targets weaknesses of a different model family might rank differently on sonnet, haiku, gpt, or gemini base models.
Self-preference not identified. Because every executor uses the same Anthropic base model, §4.5 cannot distinguish family-level self-preference from uniform judge-calibration drift. The data show no tool-specific opus inflation; they are silent on family-level favoritism. A proper self-preference audit would require non-Anthropic-base runs as a control — future work.
LLM-as-judge systematic biases. We partially mitigate with three judges from three model families and rubric categories that penalize breadth-only padding (scope discipline). We cannot rule out biases shared across all three judge families, nor prompt-sensitivity effects inherited from our single prompt template.
Pseudoreplication in CI estimation. The stratified bootstrap treats each (trial, round, judge) score as an independent draw within its judge stratum, but rounds re-judge the same artifact. This inflates apparent precision relative to a trial-clustered resampling scheme. The qualitative separations we report (superpower/bugfix at −1.8σ; feature T1/T2/T3 boundaries) are robust to this; close pairwise calls (e.g., within-tier ordering in §4.3) should not be cited as “statistically significant.”
Sample size per cell. feature has n=60 per (tool) — 95% CI half-width ≈ 3 pts; within-tier tools (within ≈6 pts) are indistinguishable. bugfix and refactor have n=18 per cell (= 2 trials × 3 rounds × 3 judges), 6 observations per judge stratum; percentile-bootstrap coverage at n=6/stratum is approximate. Minimum detectable effect at 80% power on the small-n tasks is ≈15–20 pts; to separate the current top-4 on feature at 80% power would require roughly 3× the judgment budget.
Judge sampling not pinned. The CLIs used for each judge (claude, opencode) do not expose temperature, top-p, or sampler seed. Round-to-round σ partially reflects sampler variance rather than reasoning variance; the three-judge panel and round averaging are the intended mitigations. See also the reproducibility gap note in §8.
Equal-weight cross-task z̄ is a design choice, not a neutral estimator. Tasks contribute 540/162/162 judgments but are weighted equally. Judgment-count-weighted z̄ and rank-sum reorder the middle tiers (see FINAL-REPORT §2). The per-task CI tables are the primary deliverable; cross-task summaries should be read as multiple lenses, not a single leaderboard.
Not preregistered. Tasks, rubric, judge panel, and analysis script were chosen iteratively by the benchmark author. Future iterations should publish these decisions with commit timestamps before running any tools.
Tool version snapshot. Each tool was run at its 2026-04 release. Subsequent versions may change the ranking.
Task selection bias. We wrote the task briefs. A different author writing a different set of three tasks would produce different rankings. We release task briefs and artifacts so the bias is auditable, not mitigated.
Cohort rerun symmetry not formally audited. CLAUDE.md specifies that if any trial T<N> is rerun, T<N> must be rerun for all 9 tools. scripts/audit-cohort-symmetry.py reports observed per-trial timestamp spreads, base-commit divergences, and archived reruns so this can be verified after the fact; see its output before citing per-trial columns in comparison.

8. Reproducibility

The full pipeline is reproducible from infina-pfa/claude-tool-benchmark. Set BENCH_REPO to a clone URL of your target repository (this paper’s corpus uses RealStake/infina-partner-sdk), then:

# 1. Create a fresh clone of the base repo for (task, trial):
TASK=refactor ./scripts/create-clones.sh 1 2

# 2. Execute the tool on the task (per trial):
TASK=refactor ./scripts/manual-bench.sh bmad 1

# 3. Generate blind-eval labels + mapping:
TASK=refactor ./scripts/blind-eval-setup.sh

# 4. Judge a single label (per judge, per round):
TASK=refactor ROUND=1 ./scripts/judge-opus.sh  Alpha
TASK=refactor ROUND=1 ./scripts/judge-codex.sh Alpha
TASK=refactor ROUND=1 ./scripts/judge-qwen.sh  Alpha

# 5. Per-task aggregation (balanced mean, 3-judge panel):
TASK=refactor ./scripts/aggregate-results.sh

# 6. Inter-rater reliability (Krippendorff α per task, pairwise):
python3 scripts/krippendorff-alpha.py

# 7. Cross-task statistics (bootstrap CIs, pairwise tiers, sensitivity, calibration):
python3 scripts/cross-task-analysis.py

# 8. Cohort-rerun symmetry audit (validates CLAUDE.md rerun-protocol):
python3 scripts/audit-cohort-symmetry.py

Canonical aggregation rules (enforced in both scripts):

Round filter: dirs matching ^round[0-9]+$ only. Pilot/sample dirs (roundcotpilot, roundcotsample*) are excluded so the corpus size is deterministic.
Score per judge file: sum(scores.values()) (the total field is ignored — 239 historical records had off-by-one-to-off-by-ten drift).
Tool mean: balanced mean of per-judge means (equals the pooled mean when per-judge n is equal).
Bootstrap: 10,000 resamples stratified by judge, seed = 42.
Judges: JUDGES = ('opus', 'codex', 'qwen').

All raw judge JSONs are committed under results/<task>/_blind-eval/<LABEL>/round<N>/<judge>-judge.json, alongside judge-prompt.md (the full prompt including rubric) and implementation-diff.patch (the artifact being judged). Label → (tool, trial) mapping is at .mapping-DO-NOT-OPEN.json in each _blind-eval/. Any additional judge model can be run against the committed prompts.

Model CLI invocations (version-pinned where the CLI exposes it):

claude --model claude-opus-4-6 ...                     # tool executor
claude --model claude-opus-4-7 ...                     # opus judge (extended thinking default)
codex --model openai/gpt-5.4 --variant high ...        # codex judge
opencode --model opencode-go/qwen3.6-plus --variant high ...  # qwen judge

Temperature/top-p/seed are each tool’s CLI defaults; we report the CLI incantation rather than the sampling parameters because the CLIs abstract them. This is a reproducibility gap we note in §7.

9. Conclusion

The strongest single claim this data supports is: among nine 2026-04 Claude Code setups on three mid-sized TypeScript tasks, no single setup separates from the top-4 cluster (ecc, bmad, pure, gstack) at 95% CI on any single task, and only one setup (superpower) shows a task-specific quality gap on the bugfix task with non-overlapping CIs — under a forced-activation harness applied to that tool only. Inter-judge rank-order agreement is significantly positive on the feature-build and bugfix tasks (Spearman ρ 0.22–0.79) but collapses on the refactor task (ρ CIs straddle zero, Krippendorff α on totals is negative). Absolute-scale judge calibration varies by ±25 pts and α on 200-point totals is negative on feature and refactor — multi-judge panels and explicit uncertainty reporting are necessary, and single-judge leaderboards on this corpus would under-report uncertainty by roughly an order of magnitude. We also run a judge calibration asymmetry check of the Anthropic-family judge vs. the non-Anthropic pair: the drift is small and tool-invariant, consistent with benign calibration drift — but because every executor uses an Anthropic base model, this design cannot identify family-level self-preference, which we note as a limitation rather than a null result. We publish the full judgment corpus, judge prompts, tool artifacts, and the bootstrap/tier/Spearman/α/sensitivity scripts for independent re-scoring and re-analysis.

Appendix A — Data Files

Path	Contents
`results/FINAL-REPORT-3JUDGE-20260422.md`	Tabular summary (shorter form of this paper)
`results/final-report.md`	feature per-trial detail
`results/bugfix/final-report.md`	bugfix per-trial detail
`results/refactor/final-report.md`	refactor per-trial detail
`results/<task>/_blind-eval/<LABEL>/round<N>/<judge>-judge.json`	Raw per-judgment JSON (scores dict + sum-valid total)
`results/<task>/_blind-eval/<LABEL>/judge-prompt.md`	Full judge prompt including PRD, context, artifact, and 20-item rubric
`results/<task>/_blind-eval/<LABEL>/implementation-diff.patch`	The artifact being judged
`results/<task>/_blind-eval/.mapping-DO-NOT-OPEN.json`	label → (tool, trial) mapping
`results/<tool>/t<N>/`	Per-trial artifacts: session logs, diff stats, eslint/tsc output, metrics
`docs/analysis/trial-timelines/`	Per-trial event timelines (skill activations, plugin/skill files read, subagents dispatched, code mutations, Bash usage) auto-extracted from every `session-logs/*.jsonl`. One file per (task, tool) with sections per trial.
`docs/analysis/trial-timelines/aggregate.md`	Per-(tool, task) aggregate table (mean/min/max for subagents, skill files, Bash, tests, etc.) — canonical source for cross-tool count claims. Regenerated by `scripts/extract-trial-timeline.py`.
`results/_human-reference/`	Hand-authored reference implementation (methodology anchor)

Appendix B — Code-Quality Metrics Captured (not scored)

Every trial additionally produces (not consumed by the rubric, but available at results/<tool>/t<N>/):

auto-metrics.json — wall-time, token counts, cost, session count
diff-stats.txt — LOC added/removed, files touched
eslint-output.txt — lint warnings and errors
tsc-output.txt — TypeScript compiler output
test-output.txt — test runner output
commits.txt — SHAs captured at baseline and post-implementation

These can be used for cost/speed analysis, hard-gate filtering, or independent scoring.

Comments, corrections, and independent re-analyses welcome — file an Issue on the repo.