AI Coding Tool Benchmark: A Multi-Task, Multi-Judge Evaluation of Nine Claude Code Setups
Author: Randy Tran (randytran8800@gmail.com)
Date: 2026-04-22
Status: v1.1 — regenerated from 864-judgment corpus (adds round-3 for bugfix/refactor)
Repository: infina-pfa/claude-tool-benchmark
Abstract
We benchmark nine Claude Code setups — plugins, skill packs, hook kits, and a no-addon baseline — by having every setup implement the same three software-engineering tasks from the same production TypeScript codebase (the RealStake/infina-partner-sdk monorepo) under an identical base-model pin (claude-opus-4-6; per-CLI sampler defaults, see §8). Candidate diffs are scored on a 20-item, 200-point rubric by three LLM judges drawn from different model families (claude-opus-4-7, openai/gpt-5.4, qwen3.6-plus) under blind NATO-letter labels. Across 864 judgments spanning 3 tasks × 9 setups × 2–4 trials × {5 rounds (feature) or 3 rounds (bugfix/refactor)}, we find that the top four setups (ecc +0.273 z̄, bmad +0.270, pure +0.175, gstack +0.077) cluster within 0.20 z-score — their pairwise 95% bootstrap CIs overlap on every task, so their ordering is a statistical tie. superpower at z̄ = −0.315 is the only setup separated from the rest by non-overlapping CIs on at least one task (bugfix) — reported under a forced-activation harness applied to superpower only on bugfix (see §5). On the refactor task inter-judge Spearman ρ is not distinguishable from zero (CIs cross 0) — rankings on that task alone should not be cited. Judge calibrations differ systematically by ±25 points and Krippendorff α on 200-point totals is negative on feature/refactor (judges disagree on absolute scale), though α at the per-item level is moderate on feature and bugfix (+0.566, +0.629) and poor on refactor (+0.149): the three-judge balanced mean is the intended mitigation, and rankings reflect rank-order agreement (Spearman ρ) rather than absolute-scale consensus. We also run a judge calibration asymmetry check (opus vs. non-Anthropic judges); the drift is small and tool-invariant, but because every executor uses a Claude base model this design does not identify family-level self-preference — we note this as a limitation rather than a null result. Equal-weight cross-task z̄ is one of several valid summaries (judgment-count-weighted and rank-sum give different middle-tier orderings, with up to 4-position swings for individual tools); per-task CI tables are the primary deliverable and the ordered leaderboard should not be cited as a ranking. See the credibility review for known open issues, including pseudoreplication in CI estimation, non-preregistered analysis, and absence of a rubric-weight sensitivity sweep.
1. Introduction
Claude Code setups — plugins, skill packs, memory systems, multi-agent orchestrators layered on top of the base CLI — have proliferated in 2025–2026. Most are evaluated by vendors in aggregate terms (“30% faster”, “higher quality”) on single tasks against single baselines. This paper proposes a reproducible, multi-task, multi-judge benchmark that:
- Uses the same base model for every tool run, isolating the tool contribution from model-capability drift.
- Uses multiple distinct tasks (feature build, bugfix, refactor) to surface task-type specialization vs. broad competence.
- Uses a three-judge rotation drawn from three different model families, reporting per-judge bias rather than hiding it.
- Uses blind evaluation so judges never see the tool identity.
- Runs each (tool, task, trial) artifact through multiple independent judgment rounds and reports round-to-round stability.
Our contributions are (a) a protocol specification, (b) an 864-judgment dataset, © a full reproduction pipeline (scripts + blind-eval mappings), and (d) per-task mean/CI tables with explicit weighting-sensitivity.
2. Related Work
LLM-as-judge benchmarks are now common (e.g., MT-Bench, AlpacaEval, Chatbot Arena). Zheng et al. (2023) report Krippendorff α in the 0.62–0.80 range for judged conversational quality. Our prior-session work on this benchmark observed α = +0.75 at the rubric-item level but α ≈ 0 on 200-point totals when the cohort is compressed — the rubric instrument is reliable, but cohort spread falls below the judges’ aggregate noise floor. Here we switch from round-aggregation to an explicit 3-judge panel and verify that per-judge drift is large (±25 pts) but ranking-preserving.
Diff-size bias (LLM-as-judge rewarding breadth over correctness) was documented in our earlier methodology work. We partially mitigate by including a scope-discipline category in the rubric, and we report the full per-item breakdown so readers can re-weight as they wish.
3. Methodology
3.1 Task Design
All three tasks are drawn from the production TypeScript monorepo at RealStake/infina-partner-sdk (a mid-sized financial-services NX workspace), pinned to a fixed base commit per task:
| Task | Type | Judgments/judge | Rounds |
|---|---|---|---|
feature — TD-CD Mode 2 CD Batch |
Greenfield feature build from PRD | 180 | 5 |
bugfix — near-maturity filter |
Bugfix from QA report | 54 | 3 |
refactor — aggregate-ownership refactor |
Scoped refactor from design doc | 54 | 3 |
Each task ships to the tool as docs/benchmark/TASK.md in the cloned RealStake/infina-partner-sdk repository. Planning prompts are identical across tools (scripts/manual-bench.sh); tools are free to invoke their own sub-pipelines (planning phase, TDD, self-review, etc.).
3.2 Setups Evaluated
Nine Claude Code setups, all layering on the same pinned base model (claude-opus-4-6):
| Setup | Approach |
|---|---|
| pure | Baseline — Claude Code with no additions |
| superpower | Skill-pack library |
| claudekit | Curated hook + command kit |
| bmad | BMAD method: business-modeling, multi-phase agents |
| mindful | Self-reflective planning loop |
| gstack | Opinionated stack + guardrails |
| compound | Compound Engineering multi-agent |
| ecc | Everything-Claude-Code skill registry |
| omc | Oh-My-Claudecode orchestration layer |
3.3 Trial Execution
Per (task, tool) pair we run 2–4 independent trials. Each trial is a fresh clone of the task’s base repository with the tool’s configuration installed into an isolated HOME. The tool implements the task, runs its own test/build commands, and commits. We capture: implementation diff, test/type-check/lint output, session transcripts, wall time, and token usage.
3.4 Scoring Rubric
A 20-item rubric with four categories totaling 200 points:
| Category | Max | Items cover |
|---|---|---|
| Correctness of the fix | 70 | Behavior matches spec, edge-case handling, domain-helper reuse |
| Tests | 50 | Coverage of spec branches, assertion quality, test independence |
| Code quality | 40 | Readability, naming, complexity, safe refactors |
| Scope discipline | 40 | No unrelated changes, no new config surface, respects module boundaries |
Judges output a strict JSON object {"scores": {"1": 0–10, ..., "20": 0–10}, "total": int}. In our original corpus the total field was computed independently by the judge LLM; 239 of 1,166 historical records had off-by-one-to-off-by-ten discrepancies between the total field and sum(scores). We fix these in-place using sum(scores) as source-of-truth — the item-wise ranked sums are what the judge actually evaluated. The pre-correction originals are preserved in-tree as <judge>-judge.pre-correction-20260421.json siblings of the mutated files (135 files in the canonical corpus; 26 historical pilot records dropped with their parent dirs). A rubric-score sensitivity analysis under total-as-source-of-truth is an open item (see credibility-review-20260422.md finding #8).
3.5 Judge Panel
Three judges drawn from three distinct model families:
| Judge | Model ID | Reasoning config |
|---|---|---|
opus |
claude-opus-4-7 |
default extended thinking |
codex |
openai/gpt-5.4 |
--variant high (high reasoning effort) |
qwen |
opencode-go/qwen3.6-plus |
--variant high |
Each artifact (tool × trial) is judged in 2–5 independent rounds. Judges are stateless between rounds (fresh context each call).
Labels are NATO-letter pseudonyms (Alpha, Bravo, …, Helix, …) with the mapping {label → (tool, trial)} stored in .mapping-DO-NOT-OPEN.json and read only by the aggregator. Judges never see tool names or directory paths indicative of the tool.
3.6 Statistical Methods
We report six lenses:
- Balanced tool mean — mean of the three per-judge means for a (task, tool) cell. When per-judge n is equal (always true in the current corpus) this equals the pooled mean, but the form is preserved so that asymmetric re-runs cancel judge drift correctly.
- 95% bootstrap CI on the tool mean — 10,000 resamples, stratified by judge (each bootstrap draws the same number of per-judge samples as the original cell, then recomputes the balanced mean). Seed = 42.
- Combined z-score (ranking) — for each task, convert each tool’s balanced mean to a z-score against the task’s cohort mean and stdev (pooled across all judgments); average the three per-task z-scores with equal weight.
- Rank-sum (tie-break) — sum of the tool’s per-task rank. Independent of z-score magnitudes; robust to one-task outliers.
- Stability — (a) per-tool round-to-round σ = stdev of per-round mean scores (tool reproducibility); (b) within-round judge spread = mean of max−min across the three judges per round (per-artifact judge disagreement).
- Tier grouping (pairwise-overlap, complete linkage) — a candidate tool joins the current tier only when its 95% CI overlaps with every existing member of that tier. This makes “same tier” logically equivalent to “pairwise CIs all overlap” — the transitive interpretation expected of an “indistinguishable cluster.” Greedy single-linkage adjacency (new tier iff
prev.lo > curr.hi) produces tiers that are intransitive and can merge tools whose CIs do not pairwise overlap, so we do not use it. Tier grouping is not a family-wise-error-rate (FWER) controlling procedure; no Bonferroni/Holm correction is applied across the 108 cross-task cell comparisons. Only individually cited pairwise separations (§4.3 “pairwise-disjoint” lists) carry statistical weight. We also report rank-sum and judgment-count-weighted z̄ as sensitivity checks: the three cross-task summaries agree on the top cluster and the bottom outlier but disagree on middle ordering — rankings within the middle should not be cited as leaderboard positions. - Inter-rater agreement (two metrics). Spearman ρ per judge pair on artifact-level totals, with 2,000-resample bootstrap 95% CI (tests rank-order agreement). Krippendorff α (interval level) per task on (a) per-item scores (rubric-item × label × judge) and (b) totals (label × judge), reported via
scripts/krippendorff-alpha.py(tests absolute-scale agreement; sensitive to calibration drift in ways Spearman ρ is not). - Judge calibration asymmetry check.
opus_mean − mean(codex_mean, qwen_mean)per (task, tool), reported with both the mean Δ across tools and the within-range spread (does the drift vary by tool?). Identification note: every executor uses the same Anthropic base (claude-opus-4-6), so this statistic cannot distinguish family-level self-preference from judge-calibration drift. A uniform Δ is consistent with either. A true self-preference test would require non-Anthropic-base executor runs. - Cohort rerun symmetry not formally audited.
CLAUDE.mdspecifies that if any trial T<N> is rerun, T<N> must be rerun for all 9 tools.scripts/audit-cohort-symmetry.pyreports observed per-trial timestamp spreads, base-commit divergences, and archived reruns so this can be verified after the fact; see its output before citing per-trial columns in comparison.
All statistics are produced by two scripts: per-task balanced means and methodology tables by scripts/aggregate-results.sh; CIs, tier grouping, Spearman CIs, α, sensitivity analysis, and calibration asymmetry by scripts/cross-task-analysis.py. Inputs are restricted to directories matching ^round[0-9]+$ — pilot and sample dirs (roundcotpilot, roundcotsample*) are excluded so the corpus size is deterministic.
4. Results
Corpus: 864 judgments total (540 feature + 162 bugfix + 162 refactor). Cohort mean ± stdev: feature = 119.54 ± 26.01; bugfix = 168.24 ± 16.48; refactor = 159.04 ± 25.50.
4.1 Combined Ranking (Mean Z-Score)
| Rank | Tool | z̄ | feature z | bugfix z | refactor z |
|---|---|---|---|---|---|
| 1 | ecc | +0.273 | +0.091 | +0.767 | −0.041 |
| 2 | bmad | +0.270 | +0.215 | +0.582 | +0.012 |
| 3 | pure | +0.175 | +0.077 | +0.397 | +0.051 |
| 4 | gstack | +0.077 | +0.158 | +0.188 | −0.115 |
| 5 | mindful | −0.006 | −0.388 | +0.272 | +0.097 |
| 6 | claudekit | −0.123 | −0.113 | −0.318 | +0.062 |
| 7 | compound | −0.144 | −0.082 | −0.443 | +0.092 |
| 8 | omc | −0.205 | −0.108 | −0.672 | +0.164 |
| 9 | superpower | −0.315 | +0.149 | −0.773 | −0.322 |
The top-4 cluster (ecc/bmad/pure/gstack) spans 0.196 z-score. Their per-task 95% bootstrap CIs overlap on every task (see §4.3), so their relative ordering is not statistically distinguishable at the current sample size. superpower at z̄ = −0.315 is the only tool whose combined score is separated from the rest on at least one task by non-overlapping CIs (its bugfix CI [150.6, 160.3] sits below the next-lowest tool’s CI; see §4.3). This outlier is conditional on the forced-activation bugfix harness (§5); the raw-prompt score from the pre-rewrite superpower trials, archived at results/bugfix/superpower/archive-activation-20260421/, is not in the leaderboard shown here.
4.2 Rank-Sum
| Rank | Tool | rank-sum | feature | bugfix | refactor |
|---|---|---|---|---|---|
| 1 | bmad | 9 | 1 | 2 | 6 |
| 2 | ecc | 12 | 4 | 1 | 7 |
| 3 | pure | 13 | 5 | 3 | 5 |
| 4 | gstack | 15 | 2 | 5 | 8 |
| 5 | mindful | 15 | 9 | 4 | 2 |
| 6 | compound | 16 | 6 | 7 | 3 |
| 7 | omc | 16 | 7 | 8 | 1 |
| 8 | claudekit | 18 | 8 | 6 | 4 |
| 9 | superpower | 21 | 3 | 9 | 9 |
Rank-sum broadly agrees with z-score on the top cluster and bottom outlier but disagrees in the middle — e.g., mindful and omc move up under rank-sum because their per-task weak positions are penalized only once, not magnified by the task’s higher cohort stdev. Since most per-task rank positions are indistinguishable from adjacent positions at 95% confidence (see tiering below), rank-sum should be read as a summary of ordinal positions, not as a second ranking.
4.3 Per-Task Detail
All tables include 95% bootstrap CIs on the balanced tool mean (10,000 resamples, stratified by judge, seed 42). Tiers are assigned by the pairwise-overlap complete-linkage rule in §3.6 (membership ⇔ pairwise CI overlap with all tier-mates). FINAL-REPORT also lists explicit pairwise-disjoint tool pairs per task for transitivity-free comparison.
feature
| Tier | Tool | Mean /200 | 95% CI | σ | n |
|---|---|---|---|---|---|
| T1 | bmad | 125.13 | [122.7, 127.4] | 25.95 | 60 |
| T1 | gstack | 123.65 | [120.5, 126.8] | 27.24 | 60 |
| T1 | superpower | 123.42 | [120.6, 126.2] | 26.75 | 60 |
| T1 | ecc | 121.90 | [119.2, 124.7] | 26.13 | 60 |
| T1 | pure | 121.53 | [118.0, 124.9] | 27.91 | 60 |
| T2 | compound | 117.40 | [114.0, 120.7] | 25.57 | 60 |
| T2 | omc | 116.73 | [114.2, 119.4] | 20.78 | 60 |
| T2 | claudekit | 116.60 | [113.4, 119.7] | 25.33 | 60 |
| T3 | mindful | 109.45 | [105.8, 113.2] | 25.70 | 60 |
Three pairwise-overlap tiers on feature: T1 = {bmad, gstack, superpower, ecc, pure}; T2 = {compound, omc, claudekit}; T3 = {mindful}. The T1/T2 boundary formally separates the top-5 from the middle-3; 15 of 36 tool-pairs on feature have fully disjoint 95% CIs (see FINAL-REPORT §3). Within each tier, pairwise CI overlap means the ordering is not statistically distinguishable.
bugfix
| Tier | Tool | Mean /200 | 95% CI | σ | n |
|---|---|---|---|---|---|
| T1 | ecc | 180.89 | [178.9, 182.7] | 11.69 | 18 |
| T1 | bmad | 177.83 | [175.2, 180.3] | 13.83 | 18 |
| T2 | pure | 174.78 | [171.5, 178.1] | 13.07 | 18 |
| T2 | mindful | 172.72 | [167.2, 177.8] | 16.36 | 18 |
| T2 | gstack | 171.33 | [166.4, 176.0] | 13.24 | 18 |
| T3 | claudekit | 163.00 | [160.2, 165.8] | 11.27 | 18 |
| T3 | compound | 160.94 | [157.4, 164.4] | 13.91 | 18 |
| T3 | omc | 157.17 | [151.2, 162.6] | 17.96 | 18 |
| T3 | superpower | 155.50 | [150.6, 160.3] | 16.06 | 18 |
Three pairwise-overlap tiers on bugfix: {ecc, bmad} › {pure, mindful, gstack} › {claudekit, compound, omc, superpower}. superpower trials fired 2 Skill calls each (the harness gates completion on /superpowers:systematic-debugging invocation at session start and /superpowers:verification-before-completion at exit); hard gates 5/5 PASS; scope files 2. Explicitly triggering the skills is the operating condition under which these scores hold; see §5 for the mechanism. 23 of 36 pairs have disjoint 95% CIs on bugfix.
refactor
| Tier | Tool | Mean /200 | 95% CI | σ | n |
|---|---|---|---|---|---|
| T1 | omc | 163.22 | [161.1, 165.4] | 24.74 | 18 |
| T1 | mindful | 161.50 | [159.0, 163.8] | 26.14 | 18 |
| T1 | compound | 161.39 | [158.5, 164.3] | 27.60 | 18 |
| T1 | claudekit | 160.61 | [157.7, 164.2] | 25.46 | 18 |
| T1 | pure | 160.33 | [157.8, 162.9] | 27.17 | 18 |
| T1 | bmad | 159.33 | [156.5, 162.0] | 26.61 | 18 |
| T2 | ecc | 158.00 | [156.1, 160.0] | 24.03 | 18 |
| T2 | gstack | 156.11 | [152.8, 159.4] | 23.13 | 18 |
| T3 | superpower | 150.83 | [145.5, 155.2] | 27.85 | 18 |
Three pairwise-overlap tiers on refactor: {omc, mindful, compound, claudekit, pure, bmad} › {ecc, gstack} › {superpower}. Only 9 of 36 pairs have disjoint CIs, mostly involving superpower against the top tier. Combined with near-zero inter-judge Spearman ρ (§4.4) and negative Krippendorff α on totals (§4.4b), per-task refactor rankings are noise-dominated and should not be cited in isolation.
4.4 Judge Calibration
Pooled mean ± stdev, across all nine tools, per (task, judge):
| Task | opus | codex | qwen |
|---|---|---|---|
| feature | 115.4 ± 12.8 | 94.4 ± 8.9 | 148.8 ± 16.7 |
| bugfix | 171.6 ± 16.8 | 154.0 ± 18.0 | 179.1 ± 16.0 |
| refactor | 164.6 ± 9.9 | 127.0 ± 8.5 | 185.4 ± 7.5 |
Drift from the three-judge mean (positive = generous, negative = harsh):
| Task | Δ opus | Δ codex | Δ qwen |
|---|---|---|---|
| feature | -4.1 | -25.1 | +29.2 |
| bugfix | +3.3 | -14.2 | +10.9 |
| refactor | +5.6 | -32.0 | +26.4 |
codex is systematically harsh (Δ = −14 to −31), qwen is systematically generous (+10 to +28), and opus is the neutral anchor (±6). This is not a ranking contamination: the three judges’ per-artifact rank orders are Spearman-correlated:
| Task | ρ(opus, codex) [95% CI] | ρ(opus, qwen) [95% CI] | ρ(codex, qwen) [95% CI] | n pairs |
|---|---|---|---|---|
| feature | +0.26 [+0.11, +0.38] | +0.57 [+0.46, +0.66] | +0.22 [+0.06, +0.36] | 180 |
| bugfix | +0.79 [+0.64, +0.88] | +0.75 [+0.61, +0.84] | +0.74 [+0.60, +0.82] | 54 |
| refactor | +0.14 [−0.14, +0.41] | +0.08 [−0.21, +0.37] | +0.21 [−0.06, +0.48] | 54 |
On feature and bugfix every judge-pair CI excludes zero — inter-judge rank-order agreement is significantly positive. On refactor every judge-pair CI straddles zero — rank agreement is not distinguishable from chance. The most likely cause is cohort compression (all 9 tools within 14 points on a 200-point scale) leaving judges in the noise regime.
4.4b Krippendorff α — absolute-scale agreement
Spearman ρ measures rank-order agreement but is insensitive to absolute-scale drift. Krippendorff α (interval level) tests whether judges assign similar absolute scores, not just similar rankings. Running scripts/krippendorff-alpha.py on the current corpus:
| Task | α (per-item, 20 items × labels) | α (totals, per label) | Pair opus/codex | Pair opus/qwen | Pair codex/qwen |
|---|---|---|---|---|---|
| feature | +0.566 | −0.286 | −0.342 | −0.323 | −0.808 |
| bugfix | +0.629 | +0.287 | +0.317 | +0.685 | −0.144 |
| refactor | +0.149 | −0.425 | −0.797 | −0.620 | −0.908 |
Per-item α is materially higher than totals α on every task: judges agree on relative rubric-item weighting (which items matter more) but disagree on absolute scale (the summed total). On feature and refactor, α on totals is negative — observed inter-judge disagreement on the summed-total scale exceeds what chance alone would produce, consistent with the ±25-pt calibration drift reported above. Spearman ρ was positive on feature and bugfix (totals level) even though α on totals was negative on feature, because judges still ranked artifacts similarly even when their absolute scales diverged.
Implication: the three-judge balanced mean (equal-weight mean of per-judge means) is the intended mitigation given this disagreement shape — it cancels additive per-judge bias by giving each judge’s mean equal weight rather than pooling raw scores. It does not estimate or remove all calibration structure (e.g., non-linear scale differences or judge-specific variance). A single-judge result on this corpus would diverge from the panel result by more than a within-judge replicate suggests, so citing single-judge leaderboards under-reports uncertainty accordingly.
4.5 Judge Calibration Asymmetry (opus vs. non-Anthropic judges)
A common concern with LLM-as-judge panels that mix model families is that the judge from the same family as the executor’s base model may systematically inflate scores (e.g., opus judging claude-opus-produced diffs). We compute a candidate diagnostic — per (task, tool), opus_mean − mean(codex_mean, qwen_mean) — and report mean and within-range spread across the 9 tools:
| Task | Mean Δ(opus − others) | Range across 9 tools | Within-range spread |
|---|---|---|---|
| feature | −6.22 | [−9.90, −0.42] | 9.47 |
| bugfix | +4.97 | [+1.75, +8.25] | 6.50 |
| refactor | +8.39 | [+1.50, +14.50] | 13.00 |
The Δ is small (single-digit per task) and approximately uniform across tools (within-range spread ≈6–13 pts). That shape is consistent with calibration drift — opus is a slightly different absolute scorer than the non-Anthropic pair on each task, but does not inflate any specific tool relative to the others.
Identification limitation (important). This test cannot distinguish family-level self-preference from simple judge-calibration drift, because every executor in this study uses the same Anthropic base model (claude-opus-4-6). A uniform opus offset could equally reflect (a) opus’s intrinsic scoring conservatism on long-artifact tasks, or (b) a family-preference effect applied identically to all (Anthropic-base) runs. We cannot disentangle these without non-Anthropic-base executor runs as a control condition. We therefore report this section as a calibration asymmetry check, not a self-preference audit: absence of tool-specific inflation is what the data show, and the absence of family-level inflation is not identified by this design.
4.6 Stability
Two complementary stability measures:
- Per-round σ — stdev of a tool’s round-level mean scores across the 2–5 independent judgment rounds (how much would the tool’s cell mean move if we re-ran the judges?).
- Inter-judge spread — mean within-round max-min across the three judges for each (label, round) artifact (how much do the three judges disagree on a single artifact?).
| Tool | feature σ | bugfix σ | refactor σ |
|---|---|---|---|
| ecc | 1.6 | 3.0 | 0.6 |
| bmad | 0.9 | 1.9 | 0.8 |
| pure | 1.1 | 3.4 | 1.8 |
| gstack | 3.3 | 0.7 | 0.4 |
| mindful | 1.4 | 2.7 | 2.7 |
| claudekit | 0.6 | 0.8 | 3.4 |
| compound | 2.5 | 1.7 | 2.0 |
| omc | 1.8 | 2.2 | 2.0 |
| superpower | 2.2 | 1.3 | 4.6 |
Per-round σ is ≤ 5 pts for every (tool, task) cell — re-running the judges would move a tool’s mean by only a few points in expectation. At n=3 rounds per (tool, judge) for bugfix/refactor the σ estimator has 2 degrees of freedom and very wide actual uncertainty, so these should be read as rough stability indicators, not precise estimates. Inter-judge spread dominates the uncertainty envelope at the per-artifact level (30–65 pts across the three judges on a typical feature artifact), which is why the 3-judge panel and bootstrap CIs are essential — individual judgments are noisy, but averaged panels converge.
5. Case Study: superpower on bugfix — skill-activation is the operating condition
superpower is the only setup whose bugfix CI is fully disjoint from the cluster above it (T3 is a singleton {claudekit, compound, omc} tier-mate and superpower sits inside it with CI [150.6, 160.3] — below every non-T3 tool). The separation is not a broad capability gap — on feature the setup lands inside T1 (z = +0.149) — and skills were only consulted on this task when the harness explicitly invoked them.
Forcing activation. The harness pins the bugfix prompt to two slash-command triggers: /superpowers:systematic-debugging at session start and /superpowers:verification-before-completion as a completion gate. With these triggers in place, two fresh trials (t1, t2) run against the same base commit and TASK.md and three judgment rounds produced: mean 155.50, z = −0.773, CI [150.6, 160.3], T3 (with {claudekit, compound, omc}). Note: this forced-activation harness was applied only to superpower’s bugfix branch; no other tool’s bugfix branch received an analogous intervention. The pre-rewrite “raw-prompt” superpower trials are archived at results/bugfix/superpower/archive-activation-20260421/. Under the cohort-rerun-symmetry rule in CLAUDE.md, a cleaner design is either (a) publish the raw-prompt score as the honest measurement, or (b) apply a matched forced-activation harness to all 9 tools; the current corpus does neither, which the credibility review flags as a methodological confound.
Interpretation. When the harness explicitly invokes them, the superpower skills produce T2-level output on bugfix. They do not produce T1-level output, and they are not invoked by default on a bare task prompt without the slash-command trigger. Any claim about superpower-on-bugfix score therefore has an “under the activation protocol” qualifier attached; the corpus reported in this paper uses the forced-activation protocol for bugfix.
The case study illustrates two design decisions in the methodology:
- Scope-discipline rubric items (e.g., #18 “no new config surface”) are load-bearing on the bugfix task and visibly penalize setups that add configuration to fix a filter.
- Multi-task benchmarks are essential: a single-task evaluation would have shown superpower either fine (feature, refactor) or separated (bugfix). Only the three-task cross-section exposes the pattern as task-sensitivity rather than a broad capability gap.
6. Discussion
What the ranking supports. On 3 tasks × 9 tools × 3 judges × 2–5 rounds, 8 of 9 tools land within a 0.4 z-score band. The top-4 cluster (bmad, ecc, pure, gstack) has overlapping per-task 95% CIs on every task — their relative ordering is not statistically distinguishable at this sample size. At current precision, claims of “tool X is the best” are unsupported; the only claim that survives the CI test is “tool Y (superpower) has a task-specific quality gap on task Z (bugfix) under the forced-activation protocol” — its bugfix CI is fully separated from the next-lowest tool’s CI.
What the judges say. LLM judges from three model families agree on rank order on 2 of 3 tasks but disagree on absolute calibration by ±25 pts. On refactor they disagree on rank order too (Spearman CI straddles zero). This should be the default expectation for LLM-as-judge benchmarks: a three-judge panel is necessary to cancel absolute drift, but not sufficient to disambiguate tools when the cohort is compressed (tools all within ~14 pts on a 200-pt scale). Single-judge benchmarks under-report uncertainty by roughly an order of magnitude.
Task-type matters. The cohort mean on feature (feature build from PRD) is 119.5, on bugfix (bugfix) 165.0, on refactor (refactor) 158.9. Feature build is the hardest task-type in our sample; bugfix and refactor are markedly easier, and the refactor task in particular compressed the cohort to the point where inter-judge agreement collapsed. A benchmark that used only a single easy task would lose most of the between-tool signal.
Why pure (baseline) is top-4. Our strongest null hypothesis is “a Claude Code setup adds no value over the bare CLI.” pure lands at rank 3 by z̄ (+0.175) and rank 3 by rank-sum, inside the top cluster, with CIs that overlap bmad/ecc/gstack on every task. This does not reject the null at current precision. Tools may still add value on dimensions not captured by the rubric — developer experience, cost, speed, debugging ergonomics — or at larger sample sizes; we simply cannot distinguish their code-quality output from baseline from 162–540 judgments per task. A reference implementation in results/_human-reference/ scored ~24.95 above the top tool, confirming the ceiling is well above pure’s score but that the tool-vs-pure gap is small. The _human-reference is a single hand-authored artifact (n=1); the 24.95-pt ceiling gap has no error bar and should be treated as a single reading, not a distributional estimate.
Critic-review track record. Earlier review passes flagged: (i) the aggregator silently ingested pilot/sample round dirs (inflating per-tool n); (ii) “top-4 tie” was asserted without a CI computation; (iii) report numbers differed by ≈0.01 z between files. Those fixes — canonical round filter (^round[0-9]+$ only), 10,000-resample stratified bootstrap CIs, tier grouping by non-overlapping CIs, and single-source regeneration from scripts/cross-task-analysis.py — are in place. A subsequent external skeptical-reader audit (docs/analysis/credibility-review-20260422.md) identified 25 open issues against this release, ~8 of which (pseudoreplication in bootstrap CIs, non-uniform bugfix harness on superpower, post-hoc round-3 collection that changed the leader, in-place mutation of 239 judge JSONs without originals preserved, absence of rubric-weight sensitivity run, n=1 _human-reference baseline, no multiplicity correction across 108 cross-task cell comparisons, and tier-algorithm edit after the data existed) remain unaddressed in this version. Read this paper alongside that review. The corrected headline between prior draft (v1.0, 2-round bugfix/refactor) and this release (v1.1, 3-round) flipped the #1 and #2 positions (bmad ↔ ecc) and moved z̄ values by up to 0.05 — a larger change than the earlier “≤0.02 z” wording implied.
7. Limitations and Threats to Validity
- Single-repository, single-language. All tasks are TypeScript in one internal NX monorepo. Generalization to Python, Go, Rust, or polyglot codebases is untested.
- Single executor base model. All tool runs use
claude-opus-4-6. A tool that specifically targets weaknesses of a different model family might rank differently on sonnet, haiku, gpt, or gemini base models. - Self-preference not identified. Because every executor uses the same Anthropic base model, §4.5 cannot distinguish family-level self-preference from uniform judge-calibration drift. The data show no tool-specific opus inflation; they are silent on family-level favoritism. A proper self-preference audit would require non-Anthropic-base runs as a control — future work.
- LLM-as-judge systematic biases. We partially mitigate with three judges from three model families and rubric categories that penalize breadth-only padding (scope discipline). We cannot rule out biases shared across all three judge families, nor prompt-sensitivity effects inherited from our single prompt template.
- Pseudoreplication in CI estimation. The stratified bootstrap treats each (trial, round, judge) score as an independent draw within its judge stratum, but rounds re-judge the same artifact. This inflates apparent precision relative to a trial-clustered resampling scheme. The qualitative separations we report (superpower/bugfix at −1.8σ; feature T1/T2/T3 boundaries) are robust to this; close pairwise calls (e.g., within-tier ordering in §4.3) should not be cited as “statistically significant.”
- Sample size per cell. feature has n=60 per (tool) — 95% CI half-width ≈ 3 pts; within-tier tools (within ≈6 pts) are indistinguishable. bugfix and refactor have n=18 per cell (= 2 trials × 3 rounds × 3 judges), 6 observations per judge stratum; percentile-bootstrap coverage at n=6/stratum is approximate. Minimum detectable effect at 80% power on the small-n tasks is ≈15–20 pts; to separate the current top-4 on feature at 80% power would require roughly 3× the judgment budget.
- Judge sampling not pinned. The CLIs used for each judge (
claude,opencode) do not expose temperature, top-p, or sampler seed. Round-to-round σ partially reflects sampler variance rather than reasoning variance; the three-judge panel and round averaging are the intended mitigations. See also the reproducibility gap note in §8. - Equal-weight cross-task z̄ is a design choice, not a neutral estimator. Tasks contribute 540/162/162 judgments but are weighted equally. Judgment-count-weighted z̄ and rank-sum reorder the middle tiers (see FINAL-REPORT §2). The per-task CI tables are the primary deliverable; cross-task summaries should be read as multiple lenses, not a single leaderboard.
- Not preregistered. Tasks, rubric, judge panel, and analysis script were chosen iteratively by the benchmark author. Future iterations should publish these decisions with commit timestamps before running any tools.
- Tool version snapshot. Each tool was run at its 2026-04 release. Subsequent versions may change the ranking.
- Task selection bias. We wrote the task briefs. A different author writing a different set of three tasks would produce different rankings. We release task briefs and artifacts so the bias is auditable, not mitigated.
- Cohort rerun symmetry not formally audited.
CLAUDE.mdspecifies that if any trial T<N> is rerun, T<N> must be rerun for all 9 tools.scripts/audit-cohort-symmetry.pyreports observed per-trial timestamp spreads, base-commit divergences, and archived reruns so this can be verified after the fact; see its output before citing per-trial columns in comparison.
8. Reproducibility
The full pipeline is reproducible from infina-pfa/claude-tool-benchmark. Set BENCH_REPO to a clone URL of your target repository (this paper’s corpus uses RealStake/infina-partner-sdk), then:
# 1. Create a fresh clone of the base repo for (task, trial):
TASK=refactor ./scripts/create-clones.sh 1 2
# 2. Execute the tool on the task (per trial):
TASK=refactor ./scripts/manual-bench.sh bmad 1
# 3. Generate blind-eval labels + mapping:
TASK=refactor ./scripts/blind-eval-setup.sh
# 4. Judge a single label (per judge, per round):
TASK=refactor ROUND=1 ./scripts/judge-opus.sh Alpha
TASK=refactor ROUND=1 ./scripts/judge-codex.sh Alpha
TASK=refactor ROUND=1 ./scripts/judge-qwen.sh Alpha
# 5. Per-task aggregation (balanced mean, 3-judge panel):
TASK=refactor ./scripts/aggregate-results.sh
# 6. Inter-rater reliability (Krippendorff α per task, pairwise):
python3 scripts/krippendorff-alpha.py
# 7. Cross-task statistics (bootstrap CIs, pairwise tiers, sensitivity, calibration):
python3 scripts/cross-task-analysis.py
# 8. Cohort-rerun symmetry audit (validates CLAUDE.md rerun-protocol):
python3 scripts/audit-cohort-symmetry.py
Canonical aggregation rules (enforced in both scripts):
- Round filter: dirs matching
^round[0-9]+$only. Pilot/sample dirs (roundcotpilot,roundcotsample*) are excluded so the corpus size is deterministic. - Score per judge file:
sum(scores.values())(thetotalfield is ignored — 239 historical records had off-by-one-to-off-by-ten drift). - Tool mean: balanced mean of per-judge means (equals the pooled mean when per-judge n is equal).
- Bootstrap: 10,000 resamples stratified by judge, seed = 42.
- Judges:
JUDGES = ('opus', 'codex', 'qwen').
All raw judge JSONs are committed under results/<task>/_blind-eval/<LABEL>/round<N>/<judge>-judge.json, alongside judge-prompt.md (the full prompt including rubric) and implementation-diff.patch (the artifact being judged). Label → (tool, trial) mapping is at .mapping-DO-NOT-OPEN.json in each _blind-eval/. Any additional judge model can be run against the committed prompts.
Model CLI invocations (version-pinned where the CLI exposes it):
claude --model claude-opus-4-6 ... # tool executor
claude --model claude-opus-4-7 ... # opus judge (extended thinking default)
codex --model openai/gpt-5.4 --variant high ... # codex judge
opencode --model opencode-go/qwen3.6-plus --variant high ... # qwen judge
Temperature/top-p/seed are each tool’s CLI defaults; we report the CLI incantation rather than the sampling parameters because the CLIs abstract them. This is a reproducibility gap we note in §7.
9. Conclusion
The strongest single claim this data supports is: among nine 2026-04 Claude Code setups on three mid-sized TypeScript tasks, no single setup separates from the top-4 cluster (ecc, bmad, pure, gstack) at 95% CI on any single task, and only one setup (superpower) shows a task-specific quality gap on the bugfix task with non-overlapping CIs — under a forced-activation harness applied to that tool only. Inter-judge rank-order agreement is significantly positive on the feature-build and bugfix tasks (Spearman ρ 0.22–0.79) but collapses on the refactor task (ρ CIs straddle zero, Krippendorff α on totals is negative). Absolute-scale judge calibration varies by ±25 pts and α on 200-point totals is negative on feature and refactor — multi-judge panels and explicit uncertainty reporting are necessary, and single-judge leaderboards on this corpus would under-report uncertainty by roughly an order of magnitude. We also run a judge calibration asymmetry check of the Anthropic-family judge vs. the non-Anthropic pair: the drift is small and tool-invariant, consistent with benign calibration drift — but because every executor uses an Anthropic base model, this design cannot identify family-level self-preference, which we note as a limitation rather than a null result. We publish the full judgment corpus, judge prompts, tool artifacts, and the bootstrap/tier/Spearman/α/sensitivity scripts for independent re-scoring and re-analysis.
Appendix A — Data Files
| Path | Contents |
|---|---|
results/FINAL-REPORT-3JUDGE-20260422.md |
Tabular summary (shorter form of this paper) |
results/final-report.md |
feature per-trial detail |
results/bugfix/final-report.md |
bugfix per-trial detail |
results/refactor/final-report.md |
refactor per-trial detail |
results/<task>/_blind-eval/<LABEL>/round<N>/<judge>-judge.json |
Raw per-judgment JSON (scores dict + sum-valid total) |
results/<task>/_blind-eval/<LABEL>/judge-prompt.md |
Full judge prompt including PRD, context, artifact, and 20-item rubric |
results/<task>/_blind-eval/<LABEL>/implementation-diff.patch |
The artifact being judged |
results/<task>/_blind-eval/.mapping-DO-NOT-OPEN.json |
label → (tool, trial) mapping |
results/<tool>/t<N>/ |
Per-trial artifacts: session logs, diff stats, eslint/tsc output, metrics |
docs/analysis/trial-timelines/ |
Per-trial event timelines (skill activations, plugin/skill files read, subagents dispatched, code mutations, Bash usage) auto-extracted from every session-logs/*.jsonl. One file per (task, tool) with sections per trial. |
docs/analysis/trial-timelines/aggregate.md |
Per-(tool, task) aggregate table (mean/min/max for subagents, skill files, Bash, tests, etc.) — canonical source for cross-tool count claims. Regenerated by scripts/extract-trial-timeline.py. |
results/_human-reference/ |
Hand-authored reference implementation (methodology anchor) |
Appendix B — Code-Quality Metrics Captured (not scored)
Every trial additionally produces (not consumed by the rubric, but available at results/<tool>/t<N>/):
auto-metrics.json— wall-time, token counts, cost, session countdiff-stats.txt— LOC added/removed, files touchedeslint-output.txt— lint warnings and errorstsc-output.txt— TypeScript compiler outputtest-output.txt— test runner outputcommits.txt— SHAs captured at baseline and post-implementation
These can be used for cost/speed analysis, hard-gate filtering, or independent scoring.
Comments, corrections, and independent re-analyses welcome — file an Issue on the repo.