Verification Guide — Reproducing the Benchmark Claims
Guide for a reader who wants to independently verify any claim made in PAPER.md, README.md, or results/FINAL-REPORT-3JUDGE-20260422.md. Everything needed is checked into results/ — no private state, no network access.
0. Setup
git clone <this-repo>
cd ai-tool-benchmark
python3 -m pip install --user --break-system-packages numpy krippendorff
Only numpy and krippendorff are needed for re-computation. Scripts use only stdlib + numpy, no framework.
1. “Where does a tool’s score come from?”
Claim example: “bmad feature-task z̄ = +0.270, bootstrap 95% CI [a, b]” (from FINAL-REPORT-3JUDGE-20260422.md).
Chain of evidence:
- Raw judge files —
results/_blind-eval/<label>/round<N>/{opus,codex,qwen}-judge.json, where<label>is whichever NATO letter maps tobmadinresults/_blind-eval/.mapping-DO-NOT-OPEN.json. - Canonical score per file —
sum(scores.values())(authoritative; do not use the storedtotalfield). - Canonical round filter — only dirs matching
^round[0-9]+$. Pilot/sample dirs are excluded. - Aggregation —
scripts/cross-task-analysis.pyreads these files, computes balanced mean (equal-weight average of per-judge means), z-score within task, and stratified bootstrap 95% CI (10,000 resamples, seed 42, stratified by judge). - Output —
results/cross-task-stats.json+results/FINAL-REPORT-3JUDGE-20260422.md.
To verify:
# Snapshot the current committed report + stats, then regenerate and compare.
cp results/FINAL-REPORT-3JUDGE-20260422.md /tmp/final-report.before.md
cp results/cross-task-stats.json /tmp/cross-task-stats.before.json
python3 scripts/cross-task-analysis.py # rewrites FINAL-REPORT + cross-task-stats.json in place
diff /tmp/final-report.before.md results/FINAL-REPORT-3JUDGE-20260422.md
diff /tmp/cross-task-stats.before.json results/cross-task-stats.json
The script is deterministic (seed 42). Both diffs should be empty — if not, either your dependency versions differ or the committed artifacts have drifted from their generating source.
2. “Why is superpower ranked 9th on bugfix?”
Claim example: “superpower’s bugfix CI [151.2, 161.0] sits below the next-lowest tool’s CI and does not overlap T1, even under the forced-activation protocol (/superpowers:systematic-debugging at session start, /superpowers:verification-before-completion at exit). The T2 output is a skill-content ceiling on this task, not an activation failure.”
Walk the chain:
results/bugfix/_blind-eval/.mapping-DO-NOT-OPEN.json— find which label issuperpower t1,superpower t2.results/bugfix/_blind-eval/<label>/implementation-diff.patch— inspect the actual code diff the judges saw.results/bugfix/_blind-eval/<label>/round1/{opus,codex,qwen}-judge.json— read thenotesfield and rubric scores; compare to T1 cells.results/bugfix/superpower/t1/session-logs/<uuid>.jsonl— the full session transcript. Confirm 2Skillcalls fired (systematic-debugging, verification-before-completion).results/bugfix/superpower/t1/hard-gates.json— automated scope/regression gates (5/5 PASS).
Cross-reference: PAPER.md §5 narrates this chain.
3. “Krippendorff α on totals is negative for feature and refactor”
Claim example: “Judges disagree on absolute scale more than chance would predict on feature (α = −0.286) and refactor (α = −0.406).”
Verify:
python3 scripts/krippendorff-alpha.py
cat results/krippendorff-alpha.json
The JSON has totals (α on summed scores) and per_item (α averaged across the 20 rubric items) for each of the 3 tasks, plus pairwise judge-pair α. Compare against PAPER §4.4b.
Interpretation note: negative α means observed disagreement exceeds chance-expected disagreement on the measured (interval) scale. It does not mean judges disagree on ranking — rank-order agreement is a separate Spearman ρ computation.
4. “Top-4 is a statistical tie”
Claim example: “bmad, ecc, pure, gstack are not separable — all pairwise 95% CIs overlap on every task.”
Verify:
Open results/cross-task-stats.json. Each task has a tiers array (complete-linkage: a tool joins a tier only if its CI overlaps with every existing tier member) and a pairwise_disjoint list (transitivity-free pairs where CIs actually don’t overlap). For the top-4 claim:
import json
s = json.load(open('results/cross-task-stats.json'))
for task in ('feature', 'bugfix', 'refactor'):
pairs = s['per_task'][task]['pairwise_disjoint']
top4 = {'bmad', 'ecc', 'pure', 'gstack'}
print(task, [p for p in pairs if p[0] in top4 and p[1] in top4])
Expected: empty list for every task.
5. “No hidden sampling of trials / no cherry-picking reruns”
Claim example: The cohort-symmetry rule in docs/methodology/pipeline.md requires that if trial t<N> is rerun for one tool, it’s rerun for all 9 before the trial is used in comparison.
Verify:
python3 scripts/audit-cohort-symmetry.py
Reads results/<task>/<tool>/t<N>/sessions/*.meta.json across all 9 tools, groups by trial index, and reports:
- Missing trials per tool (hard violation → non-zero exit)
- Base-commit divergence within a trial (hard violation → non-zero exit)
-
24h timestamp span within a trial (soft warning)
- Any archived reruns under
results/<task>/<tool>/archive-t<N>-<date>/
The script currently reports 8 base-commit divergences on feature t1 (documented in PAPER.md §7) and 3 archived reruns (claudekit / ecc / omc t1–t2). Those are disclosed, not hidden.
6. “The judge didn’t see the tool identity”
Verify:
- Open any
results/_blind-eval/<label>/judge-prompt.md. Confirm nothing in the prompt names the tool. - Open
implementation-diff.patchin the same dir. Confirm the diff uses paths from the actual repo (no tool-specific markers like_bmad/orclaudekit/). - Open
.mapping-DO-NOT-OPEN.jsononly after you’ve formed your own expectation.
The judge receives: PRD + reference codebase markdown + the diff patch + the 20-item rubric + JSON schema. Nothing else.
7. “The judges use the canonical model settings”
Claim example: “All judges run at their provider’s default sampling temperature; seed not exposed by either CLI.”
Verify:
claude --help | grep -iE 'temp|seed|sampl' # Claude CLI — no results
opencode run --help | grep -iE 'temp|seed|sampl' # OpenCode CLI — no results
Comment headers in scripts/judge-{opus,codex,qwen}.sh document this. Round-to-round variance reported in FINAL-REPORT §7 absorbs the sampler noise; three-judge averaging is the mitigation, not a fix.
8. “The ranking doesn’t flip under different weighting”
Claim example: “Equal-weight z̄, count-weighted z̄, and rank-sum agree on the top cluster and bottom outlier but disagree on middle ordering.”
Verify:
Open results/cross-task-stats.json → ranking_sensitivity section. It lists each tool’s rank under all three schemes and flags tools with ≥2 rank-positions of movement (claudekit, mindful, superpower as of 20260422).
Also rendered in FINAL-REPORT-3JUDGE-20260422.md §2.
9. Reproducing the pipeline end-to-end
If you want to re-run the benchmark (not just re-aggregate):
See …/methodology/pipeline.md for the full clone → execute → judge → aggregate flow.
Minimum re-run for one (task, tool, trial):
TASK=refactor ./scripts/create-clones.sh 1
TASK=refactor ./scripts/setup-tool-config.sh bmad 1
TASK=refactor ./scripts/manual-bench.sh bmad 1
# → paste prompt, run tool, exit
# → run the printed one-liner (SHA capture + collect-metrics)
TASK=refactor ./scripts/blind-eval-setup.sh
TASK=refactor ROUND=1 ./scripts/judge-opus.sh <label>
TASK=refactor ROUND=1 ./scripts/judge-codex.sh <label>
TASK=refactor ROUND=1 ./scripts/judge-qwen.sh <label>
./scripts/aggregate-results.sh
python3 scripts/cross-task-analysis.py
python3 scripts/krippendorff-alpha.py
10. What this benchmark does not let you verify
Being explicit about what’s outside the scope of the artifact:
- Judge self-preference at the family level — all 9 executors use a Claude base model, so Anthropic-family favoritism is not identified by this design. A true audit would need a non-Anthropic-base executor as control.
- Generalization to other languages / codebases — single TypeScript NX monorepo (
RealStake/infina-partner-sdk). - Tool-version drift — results are a 2026-04 snapshot.
- Judge-pool completeness — 3 judges is a minimum for the panel estimator; larger panels would tighten CIs further.
Each limitation is disclosed in PAPER.md §7 and README.md Caveats.
Quick Reference
| Question | File to open |
|---|---|
| What was the exact prompt sent to this tool? | results/<task>/<tool>/t<N>/phase1-prompt.txt |
| What did the tool actually do? | results/<task>/<tool>/t<N>/session-logs/<uuid>.jsonl |
| What happened in a trial — at a glance? | docs/analysis/trial-timelines/<task>/<tool>.md — pre-extracted timeline of skill activations, plugin/skill file reads, subagents, mutations, Bash usage |
| What did the judge see? | results/<task>/_blind-eval/<label>/judge-prompt.md + implementation-diff.patch |
| What did each judge score? | results/<task>/_blind-eval/<label>/round<N>/<judge>-judge.json |
| How is the aggregate computed? | scripts/cross-task-analysis.py |
| Which label is which tool? | results/<task>/_blind-eval/.mapping-DO-NOT-OPEN.json (after judging is done) |
| What are the integrity guarantees? | This file + results/README.md |