AI Coding Tool Benchmark

A multi-task, multi-judge evaluation of 9 Claude Code setups (plugins, skill packs, hook kits, and a no-addon baseline) on 3 real-world software-engineering tasks from the RealStake/infina-partner-sdk monorepo.


TL;DR

Rank Tool Note
1 ecc +0.273 top-4 cluster (overlapping CIs on ≥2/3 tasks)
2 bmad +0.270 top-4 cluster
3 pure +0.175 top-4 cluster — baseline performs within cluster
4 gstack +0.077 top-4 cluster
5 mindful −0.006
6 claudekit −0.123
7 compound −0.144
8 omc −0.205
9 superpower −0.315 outlier on bugfix — under forced-activation harness; see §5 of PAPER.md

864 judgments across 3 tasks (540 feature + 162 bugfix + 162 refactor), 9 setups × 2–4 trials × {5 rounds (feature) or 3 rounds (bugfix/refactor)}, 3 judges (opus / codex / qwen). Top-4 CIs overlap on every task, so their ordering is not statistically distinguishable — do not cite rank positions within the top 4 as a ranking. Equal-weight z̄ (displayed above) disagrees with count-weighted z̄ and rank-sum on middle ordering; rank swings up to 4 positions (see FINAL-REPORT §2). The null hypothesis “tools add no value over pure baseline” is not rejected at this precision. On refactor inter-judge Spearman ρ CIs straddle zero and Krippendorff α(per-item)=+0.149; rankings on that task alone are noise-dominated. A hand-authored _human-reference scores ≈25 pts above the top tool (n=1; no error bar) — headline z̄ values above are tool-relative, not absolute-quality. See PAPER.md §4.3–4.4 and credibility review for details.


Reproduce

# One (task, tool, trial) run:
TASK=refactor ./scripts/create-clones.sh 1
TASK=refactor ./scripts/manual-bench.sh bmad 1

# Blind labels + mapping:
TASK=refactor ./scripts/blind-eval-setup.sh

# Judge (per-judge, per-round):
TASK=refactor ROUND=1 ./scripts/judge-opus.sh Alpha
TASK=refactor ROUND=1 ./scripts/judge-codex.sh Alpha
TASK=refactor ROUND=1 ./scripts/judge-qwen.sh Alpha

# Aggregate per-task (balanced mean, 3-judge panel):
TASK=refactor ./scripts/aggregate-results.sh

# Inter-rater reliability (Krippendorff α, per task + pairwise):
python3 scripts/krippendorff-alpha.py

# Cross-task stats (bootstrap CIs, pairwise tiers, ranking sensitivity, calibration):
python3 scripts/cross-task-analysis.py

# Cohort-rerun symmetry audit (verifies rerun protocol):
python3 scripts/audit-cohort-symmetry.py

See PAPER.md §8 for full pipeline and judge-prompt locations. All script sources live under scripts/.


Layout

scripts/                 — pipeline (create-clones, manual-bench, judge-*, aggregate)
docs/                    — task briefs, pipeline notes
config/                  — per-tool config templates
results/
  FINAL-REPORT-*.md      — cross-task summary report
  final-report.md        — feature per-trial detail
  _blind-eval/           — feature judged artifacts (opus/codex/qwen × 5 rounds)
  _human-reference/      — hand-authored reference implementation (anchor)
  bugfix/, refactor/     — task-scoped results (per-task final-report.md inside)
  <tool>/t<N>/           — per-trial execution artifacts
CLAUDE.md                — internal operator notes

Browse the tree on GitHub: infina-pfa/claude-tool-benchmark.


Caveats

  1. Single codebase, single language (TypeScript NX monorepo). Don’t assume these rankings generalize to Python/Go/Rust.
  2. Single executor base model (claude-opus-4-6). A setup that specializes on sonnet/haiku may rank differently.
  3. LLM judges diverge ±25 pts absolute. Rank-order agreement (Spearman ρ) is strong on feature and bugfix (0.22–0.88) but collapses on refactor (CIs straddle 0). Krippendorff α on 200-point totals is negative on feature and refactor — judges disagree on absolute scale. Per-item α is moderate (0.15–0.66). The 3-judge balanced mean averages drift out by design, but single-judge claims on this corpus under-report uncertainty by ≈1× order of magnitude.
  4. Top-4 is a statistical tie — top-4 pairwise 95% bootstrap CIs overlap on every task. Use cost/speed/DX as the real differentiator, not score.
  5. Ranking depends on the weighting scheme. Equal-weight z̄, judgment-count-weighted z̄, and rank-sum agree on the top cluster and bottom outlier but disagree on middle ordering (≥2 rank positions of movement for claudekit/mindful/superpower). Read the per-task CI tables, not a single cross-task leaderboard.
  6. Tier groupings are by pairwise-overlap complete-linkage (descriptive), not a FWER-controlling procedure. No Bonferroni/Holm applied across the 108 cross-task cell comparisons. Only individually cited pairwise separations (FINAL-REPORT §3) carry statistical weight.
  7. Self-preference is NOT audited — it’s not identified by this design. Every executor uses a Claude base model, so the §4.5 check in the paper is a judge-calibration asymmetry measurement, not a family-favoritism test. A true self-preference audit would need non-Anthropic-base executor runs as a control.
  8. Pseudoreplication caveat. Bootstrap CIs are stratified by judge but treat multiple judgment rounds of the same artifact as independent draws, inflating apparent precision. Close pairwise calls within tiers should not be cited as statistically significant.
  9. Judge sampling not pinned. Claude / OpenCode CLIs don’t expose temperature or sampler seed. Round-to-round σ reflects sampler variance; three-judge averaging is the intended mitigation, not a fix.
  10. Not preregistered. Tasks, rubric, and judge panel chosen iteratively by the benchmark author.
  11. Tool-version snapshot, 2026-04. Re-run for current-version claims.

See PAPER.md §7 for the full threats-to-validity list.


License

Results and methodology published openly for reference and independent re-analysis. See individual files for upstream tool licenses. Source repository: infina-pfa/claude-tool-benchmark.