AI Coding Tool Benchmark: A Multi-Task, Multi-Judge Evaluation of Nine Claude Code Setups

Author: Randy Tran (randytran8800@gmail.com)
Date: 2026-04-22
Status: v1.1 — regenerated from 864-judgment corpus (adds round-3 for bugfix/refactor)
Repository: infina-pfa/claude-tool-benchmark


Abstract

We benchmark nine Claude Code setups — plugins, skill packs, hook kits, and a no-addon baseline — by having every setup implement the same three software-engineering tasks from the same production TypeScript codebase (the RealStake/infina-partner-sdk monorepo) under an identical base-model pin (claude-opus-4-6; per-CLI sampler defaults, see §8). Candidate diffs are scored on a 20-item, 200-point rubric by three LLM judges drawn from different model families (claude-opus-4-7, openai/gpt-5.4, qwen3.6-plus) under blind NATO-letter labels. Across 864 judgments spanning 3 tasks × 9 setups × 2–4 trials × {5 rounds (feature) or 3 rounds (bugfix/refactor)}, we find that the top four setups (ecc +0.273 z̄, bmad +0.270, pure +0.175, gstack +0.077) cluster within 0.20 z-score — their pairwise 95% bootstrap CIs overlap on every task, so their ordering is a statistical tie. superpower at z̄ = −0.315 is the only setup separated from the rest by non-overlapping CIs on at least one task (bugfix) — reported under a forced-activation harness applied to superpower only on bugfix (see §5). On the refactor task inter-judge Spearman ρ is not distinguishable from zero (CIs cross 0) — rankings on that task alone should not be cited. Judge calibrations differ systematically by ±25 points and Krippendorff α on 200-point totals is negative on feature/refactor (judges disagree on absolute scale), though α at the per-item level is moderate on feature and bugfix (+0.566, +0.629) and poor on refactor (+0.149): the three-judge balanced mean is the intended mitigation, and rankings reflect rank-order agreement (Spearman ρ) rather than absolute-scale consensus. We also run a judge calibration asymmetry check (opus vs. non-Anthropic judges); the drift is small and tool-invariant, but because every executor uses a Claude base model this design does not identify family-level self-preference — we note this as a limitation rather than a null result. Equal-weight cross-task z̄ is one of several valid summaries (judgment-count-weighted and rank-sum give different middle-tier orderings, with up to 4-position swings for individual tools); per-task CI tables are the primary deliverable and the ordered leaderboard should not be cited as a ranking. See the credibility review for known open issues, including pseudoreplication in CI estimation, non-preregistered analysis, and absence of a rubric-weight sensitivity sweep.

1. Introduction

Claude Code setups — plugins, skill packs, memory systems, multi-agent orchestrators layered on top of the base CLI — have proliferated in 2025–2026. Most are evaluated by vendors in aggregate terms (“30% faster”, “higher quality”) on single tasks against single baselines. This paper proposes a reproducible, multi-task, multi-judge benchmark that:

  1. Uses the same base model for every tool run, isolating the tool contribution from model-capability drift.
  2. Uses multiple distinct tasks (feature build, bugfix, refactor) to surface task-type specialization vs. broad competence.
  3. Uses a three-judge rotation drawn from three different model families, reporting per-judge bias rather than hiding it.
  4. Uses blind evaluation so judges never see the tool identity.
  5. Runs each (tool, task, trial) artifact through multiple independent judgment rounds and reports round-to-round stability.

Our contributions are (a) a protocol specification, (b) an 864-judgment dataset, © a full reproduction pipeline (scripts + blind-eval mappings), and (d) per-task mean/CI tables with explicit weighting-sensitivity.

LLM-as-judge benchmarks are now common (e.g., MT-Bench, AlpacaEval, Chatbot Arena). Zheng et al. (2023) report Krippendorff α in the 0.62–0.80 range for judged conversational quality. Our prior-session work on this benchmark observed α = +0.75 at the rubric-item level but α ≈ 0 on 200-point totals when the cohort is compressed — the rubric instrument is reliable, but cohort spread falls below the judges’ aggregate noise floor. Here we switch from round-aggregation to an explicit 3-judge panel and verify that per-judge drift is large (±25 pts) but ranking-preserving.

Diff-size bias (LLM-as-judge rewarding breadth over correctness) was documented in our earlier methodology work. We partially mitigate by including a scope-discipline category in the rubric, and we report the full per-item breakdown so readers can re-weight as they wish.

3. Methodology

3.1 Task Design

All three tasks are drawn from the production TypeScript monorepo at RealStake/infina-partner-sdk (a mid-sized financial-services NX workspace), pinned to a fixed base commit per task:

Task Type Judgments/judge Rounds
feature — TD-CD Mode 2 CD Batch Greenfield feature build from PRD 180 5
bugfix — near-maturity filter Bugfix from QA report 54 3
refactor — aggregate-ownership refactor Scoped refactor from design doc 54 3

Each task ships to the tool as docs/benchmark/TASK.md in the cloned RealStake/infina-partner-sdk repository. Planning prompts are identical across tools (scripts/manual-bench.sh); tools are free to invoke their own sub-pipelines (planning phase, TDD, self-review, etc.).

3.2 Setups Evaluated

Nine Claude Code setups, all layering on the same pinned base model (claude-opus-4-6):

Setup Approach
pure Baseline — Claude Code with no additions
superpower Skill-pack library
claudekit Curated hook + command kit
bmad BMAD method: business-modeling, multi-phase agents
mindful Self-reflective planning loop
gstack Opinionated stack + guardrails
compound Compound Engineering multi-agent
ecc Everything-Claude-Code skill registry
omc Oh-My-Claudecode orchestration layer

3.3 Trial Execution

Per (task, tool) pair we run 2–4 independent trials. Each trial is a fresh clone of the task’s base repository with the tool’s configuration installed into an isolated HOME. The tool implements the task, runs its own test/build commands, and commits. We capture: implementation diff, test/type-check/lint output, session transcripts, wall time, and token usage.

3.4 Scoring Rubric

A 20-item rubric with four categories totaling 200 points:

Category Max Items cover
Correctness of the fix 70 Behavior matches spec, edge-case handling, domain-helper reuse
Tests 50 Coverage of spec branches, assertion quality, test independence
Code quality 40 Readability, naming, complexity, safe refactors
Scope discipline 40 No unrelated changes, no new config surface, respects module boundaries

Judges output a strict JSON object {"scores": {"1": 0–10, ..., "20": 0–10}, "total": int}. In our original corpus the total field was computed independently by the judge LLM; 239 of 1,166 historical records had off-by-one-to-off-by-ten discrepancies between the total field and sum(scores). We fix these in-place using sum(scores) as source-of-truth — the item-wise ranked sums are what the judge actually evaluated. The pre-correction originals are preserved in-tree as <judge>-judge.pre-correction-20260421.json siblings of the mutated files (135 files in the canonical corpus; 26 historical pilot records dropped with their parent dirs). A rubric-score sensitivity analysis under total-as-source-of-truth is an open item (see credibility-review-20260422.md finding #8).

3.5 Judge Panel

Three judges drawn from three distinct model families:

Judge Model ID Reasoning config
opus claude-opus-4-7 default extended thinking
codex openai/gpt-5.4 --variant high (high reasoning effort)
qwen opencode-go/qwen3.6-plus --variant high

Each artifact (tool × trial) is judged in 2–5 independent rounds. Judges are stateless between rounds (fresh context each call).

Labels are NATO-letter pseudonyms (Alpha, Bravo, …, Helix, …) with the mapping {label → (tool, trial)} stored in .mapping-DO-NOT-OPEN.json and read only by the aggregator. Judges never see tool names or directory paths indicative of the tool.

3.6 Statistical Methods

We report six lenses:

All statistics are produced by two scripts: per-task balanced means and methodology tables by scripts/aggregate-results.sh; CIs, tier grouping, Spearman CIs, α, sensitivity analysis, and calibration asymmetry by scripts/cross-task-analysis.py. Inputs are restricted to directories matching ^round[0-9]+$ — pilot and sample dirs (roundcotpilot, roundcotsample*) are excluded so the corpus size is deterministic.

4. Results

Corpus: 864 judgments total (540 feature + 162 bugfix + 162 refactor). Cohort mean ± stdev: feature = 119.54 ± 26.01; bugfix = 168.24 ± 16.48; refactor = 159.04 ± 25.50.

4.1 Combined Ranking (Mean Z-Score)

Rank Tool feature z bugfix z refactor z
1 ecc +0.273 +0.091 +0.767 −0.041
2 bmad +0.270 +0.215 +0.582 +0.012
3 pure +0.175 +0.077 +0.397 +0.051
4 gstack +0.077 +0.158 +0.188 −0.115
5 mindful −0.006 −0.388 +0.272 +0.097
6 claudekit −0.123 −0.113 −0.318 +0.062
7 compound −0.144 −0.082 −0.443 +0.092
8 omc −0.205 −0.108 −0.672 +0.164
9 superpower −0.315 +0.149 −0.773 −0.322

The top-4 cluster (ecc/bmad/pure/gstack) spans 0.196 z-score. Their per-task 95% bootstrap CIs overlap on every task (see §4.3), so their relative ordering is not statistically distinguishable at the current sample size. superpower at z̄ = −0.315 is the only tool whose combined score is separated from the rest on at least one task by non-overlapping CIs (its bugfix CI [150.6, 160.3] sits below the next-lowest tool’s CI; see §4.3). This outlier is conditional on the forced-activation bugfix harness (§5); the raw-prompt score from the pre-rewrite superpower trials, archived at results/bugfix/superpower/archive-activation-20260421/, is not in the leaderboard shown here.

4.2 Rank-Sum

Rank Tool rank-sum feature bugfix refactor
1 bmad 9 1 2 6
2 ecc 12 4 1 7
3 pure 13 5 3 5
4 gstack 15 2 5 8
5 mindful 15 9 4 2
6 compound 16 6 7 3
7 omc 16 7 8 1
8 claudekit 18 8 6 4
9 superpower 21 3 9 9

Rank-sum broadly agrees with z-score on the top cluster and bottom outlier but disagrees in the middle — e.g., mindful and omc move up under rank-sum because their per-task weak positions are penalized only once, not magnified by the task’s higher cohort stdev. Since most per-task rank positions are indistinguishable from adjacent positions at 95% confidence (see tiering below), rank-sum should be read as a summary of ordinal positions, not as a second ranking.

4.3 Per-Task Detail

All tables include 95% bootstrap CIs on the balanced tool mean (10,000 resamples, stratified by judge, seed 42). Tiers are assigned by the pairwise-overlap complete-linkage rule in §3.6 (membership ⇔ pairwise CI overlap with all tier-mates). FINAL-REPORT also lists explicit pairwise-disjoint tool pairs per task for transitivity-free comparison.

feature

Tier Tool Mean /200 95% CI σ n
T1 bmad 125.13 [122.7, 127.4] 25.95 60
T1 gstack 123.65 [120.5, 126.8] 27.24 60
T1 superpower 123.42 [120.6, 126.2] 26.75 60
T1 ecc 121.90 [119.2, 124.7] 26.13 60
T1 pure 121.53 [118.0, 124.9] 27.91 60
T2 compound 117.40 [114.0, 120.7] 25.57 60
T2 omc 116.73 [114.2, 119.4] 20.78 60
T2 claudekit 116.60 [113.4, 119.7] 25.33 60
T3 mindful 109.45 [105.8, 113.2] 25.70 60

Three pairwise-overlap tiers on feature: T1 = {bmad, gstack, superpower, ecc, pure}; T2 = {compound, omc, claudekit}; T3 = {mindful}. The T1/T2 boundary formally separates the top-5 from the middle-3; 15 of 36 tool-pairs on feature have fully disjoint 95% CIs (see FINAL-REPORT §3). Within each tier, pairwise CI overlap means the ordering is not statistically distinguishable.

bugfix

Tier Tool Mean /200 95% CI σ n
T1 ecc 180.89 [178.9, 182.7] 11.69 18
T1 bmad 177.83 [175.2, 180.3] 13.83 18
T2 pure 174.78 [171.5, 178.1] 13.07 18
T2 mindful 172.72 [167.2, 177.8] 16.36 18
T2 gstack 171.33 [166.4, 176.0] 13.24 18
T3 claudekit 163.00 [160.2, 165.8] 11.27 18
T3 compound 160.94 [157.4, 164.4] 13.91 18
T3 omc 157.17 [151.2, 162.6] 17.96 18
T3 superpower 155.50 [150.6, 160.3] 16.06 18

Three pairwise-overlap tiers on bugfix: {ecc, bmad} › {pure, mindful, gstack} › {claudekit, compound, omc, superpower}. superpower trials fired 2 Skill calls each (the harness gates completion on /superpowers:systematic-debugging invocation at session start and /superpowers:verification-before-completion at exit); hard gates 5/5 PASS; scope files 2. Explicitly triggering the skills is the operating condition under which these scores hold; see §5 for the mechanism. 23 of 36 pairs have disjoint 95% CIs on bugfix.

refactor

Tier Tool Mean /200 95% CI σ n
T1 omc 163.22 [161.1, 165.4] 24.74 18
T1 mindful 161.50 [159.0, 163.8] 26.14 18
T1 compound 161.39 [158.5, 164.3] 27.60 18
T1 claudekit 160.61 [157.7, 164.2] 25.46 18
T1 pure 160.33 [157.8, 162.9] 27.17 18
T1 bmad 159.33 [156.5, 162.0] 26.61 18
T2 ecc 158.00 [156.1, 160.0] 24.03 18
T2 gstack 156.11 [152.8, 159.4] 23.13 18
T3 superpower 150.83 [145.5, 155.2] 27.85 18

Three pairwise-overlap tiers on refactor: {omc, mindful, compound, claudekit, pure, bmad} › {ecc, gstack} › {superpower}. Only 9 of 36 pairs have disjoint CIs, mostly involving superpower against the top tier. Combined with near-zero inter-judge Spearman ρ (§4.4) and negative Krippendorff α on totals (§4.4b), per-task refactor rankings are noise-dominated and should not be cited in isolation.

4.4 Judge Calibration

Pooled mean ± stdev, across all nine tools, per (task, judge):

Task opus codex qwen
feature 115.4 ± 12.8 94.4 ± 8.9 148.8 ± 16.7
bugfix 171.6 ± 16.8 154.0 ± 18.0 179.1 ± 16.0
refactor 164.6 ± 9.9 127.0 ± 8.5 185.4 ± 7.5

Drift from the three-judge mean (positive = generous, negative = harsh):

Task Δ opus Δ codex Δ qwen
feature -4.1 -25.1 +29.2
bugfix +3.3 -14.2 +10.9
refactor +5.6 -32.0 +26.4

codex is systematically harsh (Δ = −14 to −31), qwen is systematically generous (+10 to +28), and opus is the neutral anchor (±6). This is not a ranking contamination: the three judges’ per-artifact rank orders are Spearman-correlated:

Task ρ(opus, codex) [95% CI] ρ(opus, qwen) [95% CI] ρ(codex, qwen) [95% CI] n pairs
feature +0.26 [+0.11, +0.38] +0.57 [+0.46, +0.66] +0.22 [+0.06, +0.36] 180
bugfix +0.79 [+0.64, +0.88] +0.75 [+0.61, +0.84] +0.74 [+0.60, +0.82] 54
refactor +0.14 [−0.14, +0.41] +0.08 [−0.21, +0.37] +0.21 [−0.06, +0.48] 54

On feature and bugfix every judge-pair CI excludes zero — inter-judge rank-order agreement is significantly positive. On refactor every judge-pair CI straddles zero — rank agreement is not distinguishable from chance. The most likely cause is cohort compression (all 9 tools within 14 points on a 200-point scale) leaving judges in the noise regime.

4.4b Krippendorff α — absolute-scale agreement

Spearman ρ measures rank-order agreement but is insensitive to absolute-scale drift. Krippendorff α (interval level) tests whether judges assign similar absolute scores, not just similar rankings. Running scripts/krippendorff-alpha.py on the current corpus:

Task α (per-item, 20 items × labels) α (totals, per label) Pair opus/codex Pair opus/qwen Pair codex/qwen
feature +0.566 −0.286 −0.342 −0.323 −0.808
bugfix +0.629 +0.287 +0.317 +0.685 −0.144
refactor +0.149 −0.425 −0.797 −0.620 −0.908

Per-item α is materially higher than totals α on every task: judges agree on relative rubric-item weighting (which items matter more) but disagree on absolute scale (the summed total). On feature and refactor, α on totals is negative — observed inter-judge disagreement on the summed-total scale exceeds what chance alone would produce, consistent with the ±25-pt calibration drift reported above. Spearman ρ was positive on feature and bugfix (totals level) even though α on totals was negative on feature, because judges still ranked artifacts similarly even when their absolute scales diverged.

Implication: the three-judge balanced mean (equal-weight mean of per-judge means) is the intended mitigation given this disagreement shape — it cancels additive per-judge bias by giving each judge’s mean equal weight rather than pooling raw scores. It does not estimate or remove all calibration structure (e.g., non-linear scale differences or judge-specific variance). A single-judge result on this corpus would diverge from the panel result by more than a within-judge replicate suggests, so citing single-judge leaderboards under-reports uncertainty accordingly.

4.5 Judge Calibration Asymmetry (opus vs. non-Anthropic judges)

A common concern with LLM-as-judge panels that mix model families is that the judge from the same family as the executor’s base model may systematically inflate scores (e.g., opus judging claude-opus-produced diffs). We compute a candidate diagnostic — per (task, tool), opus_mean − mean(codex_mean, qwen_mean) — and report mean and within-range spread across the 9 tools:

Task Mean Δ(opus − others) Range across 9 tools Within-range spread
feature −6.22 [−9.90, −0.42] 9.47
bugfix +4.97 [+1.75, +8.25] 6.50
refactor +8.39 [+1.50, +14.50] 13.00

The Δ is small (single-digit per task) and approximately uniform across tools (within-range spread ≈6–13 pts). That shape is consistent with calibration drift — opus is a slightly different absolute scorer than the non-Anthropic pair on each task, but does not inflate any specific tool relative to the others.

Identification limitation (important). This test cannot distinguish family-level self-preference from simple judge-calibration drift, because every executor in this study uses the same Anthropic base model (claude-opus-4-6). A uniform opus offset could equally reflect (a) opus’s intrinsic scoring conservatism on long-artifact tasks, or (b) a family-preference effect applied identically to all (Anthropic-base) runs. We cannot disentangle these without non-Anthropic-base executor runs as a control condition. We therefore report this section as a calibration asymmetry check, not a self-preference audit: absence of tool-specific inflation is what the data show, and the absence of family-level inflation is not identified by this design.

4.6 Stability

Two complementary stability measures:

Tool feature σ bugfix σ refactor σ
ecc 1.6 3.0 0.6
bmad 0.9 1.9 0.8
pure 1.1 3.4 1.8
gstack 3.3 0.7 0.4
mindful 1.4 2.7 2.7
claudekit 0.6 0.8 3.4
compound 2.5 1.7 2.0
omc 1.8 2.2 2.0
superpower 2.2 1.3 4.6

Per-round σ is ≤ 5 pts for every (tool, task) cell — re-running the judges would move a tool’s mean by only a few points in expectation. At n=3 rounds per (tool, judge) for bugfix/refactor the σ estimator has 2 degrees of freedom and very wide actual uncertainty, so these should be read as rough stability indicators, not precise estimates. Inter-judge spread dominates the uncertainty envelope at the per-artifact level (30–65 pts across the three judges on a typical feature artifact), which is why the 3-judge panel and bootstrap CIs are essential — individual judgments are noisy, but averaged panels converge.

5. Case Study: superpower on bugfix — skill-activation is the operating condition

superpower is the only setup whose bugfix CI is fully disjoint from the cluster above it (T3 is a singleton {claudekit, compound, omc} tier-mate and superpower sits inside it with CI [150.6, 160.3] — below every non-T3 tool). The separation is not a broad capability gap — on feature the setup lands inside T1 (z = +0.149) — and skills were only consulted on this task when the harness explicitly invoked them.

Forcing activation. The harness pins the bugfix prompt to two slash-command triggers: /superpowers:systematic-debugging at session start and /superpowers:verification-before-completion as a completion gate. With these triggers in place, two fresh trials (t1, t2) run against the same base commit and TASK.md and three judgment rounds produced: mean 155.50, z = −0.773, CI [150.6, 160.3], T3 (with {claudekit, compound, omc}). Note: this forced-activation harness was applied only to superpower’s bugfix branch; no other tool’s bugfix branch received an analogous intervention. The pre-rewrite “raw-prompt” superpower trials are archived at results/bugfix/superpower/archive-activation-20260421/. Under the cohort-rerun-symmetry rule in CLAUDE.md, a cleaner design is either (a) publish the raw-prompt score as the honest measurement, or (b) apply a matched forced-activation harness to all 9 tools; the current corpus does neither, which the credibility review flags as a methodological confound.

Interpretation. When the harness explicitly invokes them, the superpower skills produce T2-level output on bugfix. They do not produce T1-level output, and they are not invoked by default on a bare task prompt without the slash-command trigger. Any claim about superpower-on-bugfix score therefore has an “under the activation protocol” qualifier attached; the corpus reported in this paper uses the forced-activation protocol for bugfix.

The case study illustrates two design decisions in the methodology:

  1. Scope-discipline rubric items (e.g., #18 “no new config surface”) are load-bearing on the bugfix task and visibly penalize setups that add configuration to fix a filter.
  2. Multi-task benchmarks are essential: a single-task evaluation would have shown superpower either fine (feature, refactor) or separated (bugfix). Only the three-task cross-section exposes the pattern as task-sensitivity rather than a broad capability gap.

6. Discussion

What the ranking supports. On 3 tasks × 9 tools × 3 judges × 2–5 rounds, 8 of 9 tools land within a 0.4 z-score band. The top-4 cluster (bmad, ecc, pure, gstack) has overlapping per-task 95% CIs on every task — their relative ordering is not statistically distinguishable at this sample size. At current precision, claims of “tool X is the best” are unsupported; the only claim that survives the CI test is “tool Y (superpower) has a task-specific quality gap on task Z (bugfix) under the forced-activation protocol” — its bugfix CI is fully separated from the next-lowest tool’s CI.

What the judges say. LLM judges from three model families agree on rank order on 2 of 3 tasks but disagree on absolute calibration by ±25 pts. On refactor they disagree on rank order too (Spearman CI straddles zero). This should be the default expectation for LLM-as-judge benchmarks: a three-judge panel is necessary to cancel absolute drift, but not sufficient to disambiguate tools when the cohort is compressed (tools all within ~14 pts on a 200-pt scale). Single-judge benchmarks under-report uncertainty by roughly an order of magnitude.

Task-type matters. The cohort mean on feature (feature build from PRD) is 119.5, on bugfix (bugfix) 165.0, on refactor (refactor) 158.9. Feature build is the hardest task-type in our sample; bugfix and refactor are markedly easier, and the refactor task in particular compressed the cohort to the point where inter-judge agreement collapsed. A benchmark that used only a single easy task would lose most of the between-tool signal.

Why pure (baseline) is top-4. Our strongest null hypothesis is “a Claude Code setup adds no value over the bare CLI.” pure lands at rank 3 by z̄ (+0.175) and rank 3 by rank-sum, inside the top cluster, with CIs that overlap bmad/ecc/gstack on every task. This does not reject the null at current precision. Tools may still add value on dimensions not captured by the rubric — developer experience, cost, speed, debugging ergonomics — or at larger sample sizes; we simply cannot distinguish their code-quality output from baseline from 162–540 judgments per task. A reference implementation in results/_human-reference/ scored ~24.95 above the top tool, confirming the ceiling is well above pure’s score but that the tool-vs-pure gap is small. The _human-reference is a single hand-authored artifact (n=1); the 24.95-pt ceiling gap has no error bar and should be treated as a single reading, not a distributional estimate.

Critic-review track record. Earlier review passes flagged: (i) the aggregator silently ingested pilot/sample round dirs (inflating per-tool n); (ii) “top-4 tie” was asserted without a CI computation; (iii) report numbers differed by ≈0.01 z between files. Those fixes — canonical round filter (^round[0-9]+$ only), 10,000-resample stratified bootstrap CIs, tier grouping by non-overlapping CIs, and single-source regeneration from scripts/cross-task-analysis.py — are in place. A subsequent external skeptical-reader audit (docs/analysis/credibility-review-20260422.md) identified 25 open issues against this release, ~8 of which (pseudoreplication in bootstrap CIs, non-uniform bugfix harness on superpower, post-hoc round-3 collection that changed the leader, in-place mutation of 239 judge JSONs without originals preserved, absence of rubric-weight sensitivity run, n=1 _human-reference baseline, no multiplicity correction across 108 cross-task cell comparisons, and tier-algorithm edit after the data existed) remain unaddressed in this version. Read this paper alongside that review. The corrected headline between prior draft (v1.0, 2-round bugfix/refactor) and this release (v1.1, 3-round) flipped the #1 and #2 positions (bmad ↔ ecc) and moved z̄ values by up to 0.05 — a larger change than the earlier “≤0.02 z” wording implied.

7. Limitations and Threats to Validity

8. Reproducibility

The full pipeline is reproducible from infina-pfa/claude-tool-benchmark. Set BENCH_REPO to a clone URL of your target repository (this paper’s corpus uses RealStake/infina-partner-sdk), then:

# 1. Create a fresh clone of the base repo for (task, trial):
TASK=refactor ./scripts/create-clones.sh 1 2

# 2. Execute the tool on the task (per trial):
TASK=refactor ./scripts/manual-bench.sh bmad 1

# 3. Generate blind-eval labels + mapping:
TASK=refactor ./scripts/blind-eval-setup.sh

# 4. Judge a single label (per judge, per round):
TASK=refactor ROUND=1 ./scripts/judge-opus.sh  Alpha
TASK=refactor ROUND=1 ./scripts/judge-codex.sh Alpha
TASK=refactor ROUND=1 ./scripts/judge-qwen.sh  Alpha

# 5. Per-task aggregation (balanced mean, 3-judge panel):
TASK=refactor ./scripts/aggregate-results.sh

# 6. Inter-rater reliability (Krippendorff α per task, pairwise):
python3 scripts/krippendorff-alpha.py

# 7. Cross-task statistics (bootstrap CIs, pairwise tiers, sensitivity, calibration):
python3 scripts/cross-task-analysis.py

# 8. Cohort-rerun symmetry audit (validates CLAUDE.md rerun-protocol):
python3 scripts/audit-cohort-symmetry.py

Canonical aggregation rules (enforced in both scripts):

All raw judge JSONs are committed under results/<task>/_blind-eval/<LABEL>/round<N>/<judge>-judge.json, alongside judge-prompt.md (the full prompt including rubric) and implementation-diff.patch (the artifact being judged). Label → (tool, trial) mapping is at .mapping-DO-NOT-OPEN.json in each _blind-eval/. Any additional judge model can be run against the committed prompts.

Model CLI invocations (version-pinned where the CLI exposes it):

claude --model claude-opus-4-6 ...                     # tool executor
claude --model claude-opus-4-7 ...                     # opus judge (extended thinking default)
codex --model openai/gpt-5.4 --variant high ...        # codex judge
opencode --model opencode-go/qwen3.6-plus --variant high ...  # qwen judge

Temperature/top-p/seed are each tool’s CLI defaults; we report the CLI incantation rather than the sampling parameters because the CLIs abstract them. This is a reproducibility gap we note in §7.

9. Conclusion

The strongest single claim this data supports is: among nine 2026-04 Claude Code setups on three mid-sized TypeScript tasks, no single setup separates from the top-4 cluster (ecc, bmad, pure, gstack) at 95% CI on any single task, and only one setup (superpower) shows a task-specific quality gap on the bugfix task with non-overlapping CIs — under a forced-activation harness applied to that tool only. Inter-judge rank-order agreement is significantly positive on the feature-build and bugfix tasks (Spearman ρ 0.22–0.79) but collapses on the refactor task (ρ CIs straddle zero, Krippendorff α on totals is negative). Absolute-scale judge calibration varies by ±25 pts and α on 200-point totals is negative on feature and refactor — multi-judge panels and explicit uncertainty reporting are necessary, and single-judge leaderboards on this corpus would under-report uncertainty by roughly an order of magnitude. We also run a judge calibration asymmetry check of the Anthropic-family judge vs. the non-Anthropic pair: the drift is small and tool-invariant, consistent with benign calibration drift — but because every executor uses an Anthropic base model, this design cannot identify family-level self-preference, which we note as a limitation rather than a null result. We publish the full judgment corpus, judge prompts, tool artifacts, and the bootstrap/tier/Spearman/α/sensitivity scripts for independent re-scoring and re-analysis.

Appendix A — Data Files

Path Contents
results/FINAL-REPORT-3JUDGE-20260422.md Tabular summary (shorter form of this paper)
results/final-report.md feature per-trial detail
results/bugfix/final-report.md bugfix per-trial detail
results/refactor/final-report.md refactor per-trial detail
results/<task>/_blind-eval/<LABEL>/round<N>/<judge>-judge.json Raw per-judgment JSON (scores dict + sum-valid total)
results/<task>/_blind-eval/<LABEL>/judge-prompt.md Full judge prompt including PRD, context, artifact, and 20-item rubric
results/<task>/_blind-eval/<LABEL>/implementation-diff.patch The artifact being judged
results/<task>/_blind-eval/.mapping-DO-NOT-OPEN.json label → (tool, trial) mapping
results/<tool>/t<N>/ Per-trial artifacts: session logs, diff stats, eslint/tsc output, metrics
docs/analysis/trial-timelines/ Per-trial event timelines (skill activations, plugin/skill files read, subagents dispatched, code mutations, Bash usage) auto-extracted from every session-logs/*.jsonl. One file per (task, tool) with sections per trial.
docs/analysis/trial-timelines/aggregate.md Per-(tool, task) aggregate table (mean/min/max for subagents, skill files, Bash, tests, etc.) — canonical source for cross-tool count claims. Regenerated by scripts/extract-trial-timeline.py.
results/_human-reference/ Hand-authored reference implementation (methodology anchor)

Appendix B — Code-Quality Metrics Captured (not scored)

Every trial additionally produces (not consumed by the rubric, but available at results/<tool>/t<N>/):

These can be used for cost/speed analysis, hard-gate filtering, or independent scoring.


Comments, corrections, and independent re-analyses welcome — file an Issue on the repo.