Verification Guide — Reproducing the Benchmark Claims

Guide for a reader who wants to independently verify any claim made in PAPER.md, README.md, or results/FINAL-REPORT-3JUDGE-20260422.md. Everything needed is checked into results/ — no private state, no network access.


0. Setup

git clone <this-repo>
cd ai-tool-benchmark
python3 -m pip install --user --break-system-packages numpy krippendorff

Only numpy and krippendorff are needed for re-computation. Scripts use only stdlib + numpy, no framework.


1. “Where does a tool’s score come from?”

Claim example:bmad feature-task z̄ = +0.270, bootstrap 95% CI [a, b]” (from FINAL-REPORT-3JUDGE-20260422.md).

Chain of evidence:

  1. Raw judge filesresults/_blind-eval/<label>/round<N>/{opus,codex,qwen}-judge.json, where <label> is whichever NATO letter maps to bmad in results/_blind-eval/.mapping-DO-NOT-OPEN.json.
  2. Canonical score per filesum(scores.values()) (authoritative; do not use the stored total field).
  3. Canonical round filter — only dirs matching ^round[0-9]+$. Pilot/sample dirs are excluded.
  4. Aggregationscripts/cross-task-analysis.py reads these files, computes balanced mean (equal-weight average of per-judge means), z-score within task, and stratified bootstrap 95% CI (10,000 resamples, seed 42, stratified by judge).
  5. Outputresults/cross-task-stats.json + results/FINAL-REPORT-3JUDGE-20260422.md.

To verify:

# Snapshot the current committed report + stats, then regenerate and compare.
cp results/FINAL-REPORT-3JUDGE-20260422.md /tmp/final-report.before.md
cp results/cross-task-stats.json /tmp/cross-task-stats.before.json
python3 scripts/cross-task-analysis.py       # rewrites FINAL-REPORT + cross-task-stats.json in place
diff /tmp/final-report.before.md      results/FINAL-REPORT-3JUDGE-20260422.md
diff /tmp/cross-task-stats.before.json results/cross-task-stats.json

The script is deterministic (seed 42). Both diffs should be empty — if not, either your dependency versions differ or the committed artifacts have drifted from their generating source.


2. “Why is superpower ranked 9th on bugfix?”

Claim example:superpower’s bugfix CI [151.2, 161.0] sits below the next-lowest tool’s CI and does not overlap T1, even under the forced-activation protocol (/superpowers:systematic-debugging at session start, /superpowers:verification-before-completion at exit). The T2 output is a skill-content ceiling on this task, not an activation failure.”

Walk the chain:

  1. results/bugfix/_blind-eval/.mapping-DO-NOT-OPEN.json — find which label is superpower t1, superpower t2.
  2. results/bugfix/_blind-eval/<label>/implementation-diff.patch — inspect the actual code diff the judges saw.
  3. results/bugfix/_blind-eval/<label>/round1/{opus,codex,qwen}-judge.json — read the notes field and rubric scores; compare to T1 cells.
  4. results/bugfix/superpower/t1/session-logs/<uuid>.jsonl — the full session transcript. Confirm 2 Skill calls fired (systematic-debugging, verification-before-completion).
  5. results/bugfix/superpower/t1/hard-gates.json — automated scope/regression gates (5/5 PASS).

Cross-reference: PAPER.md §5 narrates this chain.


3. “Krippendorff α on totals is negative for feature and refactor”

Claim example: “Judges disagree on absolute scale more than chance would predict on feature (α = −0.286) and refactor (α = −0.406).”

Verify:

python3 scripts/krippendorff-alpha.py
cat results/krippendorff-alpha.json

The JSON has totals (α on summed scores) and per_item (α averaged across the 20 rubric items) for each of the 3 tasks, plus pairwise judge-pair α. Compare against PAPER §4.4b.

Interpretation note: negative α means observed disagreement exceeds chance-expected disagreement on the measured (interval) scale. It does not mean judges disagree on ranking — rank-order agreement is a separate Spearman ρ computation.


4. “Top-4 is a statistical tie”

Claim example:bmad, ecc, pure, gstack are not separable — all pairwise 95% CIs overlap on every task.”

Verify:

Open results/cross-task-stats.json. Each task has a tiers array (complete-linkage: a tool joins a tier only if its CI overlaps with every existing tier member) and a pairwise_disjoint list (transitivity-free pairs where CIs actually don’t overlap). For the top-4 claim:

import json
s = json.load(open('results/cross-task-stats.json'))
for task in ('feature', 'bugfix', 'refactor'):
    pairs = s['per_task'][task]['pairwise_disjoint']
    top4 = {'bmad', 'ecc', 'pure', 'gstack'}
    print(task, [p for p in pairs if p[0] in top4 and p[1] in top4])

Expected: empty list for every task.


5. “No hidden sampling of trials / no cherry-picking reruns”

Claim example: The cohort-symmetry rule in docs/methodology/pipeline.md requires that if trial t<N> is rerun for one tool, it’s rerun for all 9 before the trial is used in comparison.

Verify:

python3 scripts/audit-cohort-symmetry.py

Reads results/<task>/<tool>/t<N>/sessions/*.meta.json across all 9 tools, groups by trial index, and reports:

The script currently reports 8 base-commit divergences on feature t1 (documented in PAPER.md §7) and 3 archived reruns (claudekit / ecc / omc t1–t2). Those are disclosed, not hidden.


6. “The judge didn’t see the tool identity”

Verify:

  1. Open any results/_blind-eval/<label>/judge-prompt.md. Confirm nothing in the prompt names the tool.
  2. Open implementation-diff.patch in the same dir. Confirm the diff uses paths from the actual repo (no tool-specific markers like _bmad/ or claudekit/).
  3. Open .mapping-DO-NOT-OPEN.json only after you’ve formed your own expectation.

The judge receives: PRD + reference codebase markdown + the diff patch + the 20-item rubric + JSON schema. Nothing else.


7. “The judges use the canonical model settings”

Claim example: “All judges run at their provider’s default sampling temperature; seed not exposed by either CLI.”

Verify:

claude --help | grep -iE 'temp|seed|sampl'        # Claude CLI — no results
opencode run --help | grep -iE 'temp|seed|sampl'  # OpenCode CLI — no results

Comment headers in scripts/judge-{opus,codex,qwen}.sh document this. Round-to-round variance reported in FINAL-REPORT §7 absorbs the sampler noise; three-judge averaging is the mitigation, not a fix.


8. “The ranking doesn’t flip under different weighting”

Claim example: “Equal-weight z̄, count-weighted z̄, and rank-sum agree on the top cluster and bottom outlier but disagree on middle ordering.”

Verify:

Open results/cross-task-stats.jsonranking_sensitivity section. It lists each tool’s rank under all three schemes and flags tools with ≥2 rank-positions of movement (claudekit, mindful, superpower as of 20260422).

Also rendered in FINAL-REPORT-3JUDGE-20260422.md §2.


9. Reproducing the pipeline end-to-end

If you want to re-run the benchmark (not just re-aggregate):

See …/methodology/pipeline.md for the full clone → execute → judge → aggregate flow.

Minimum re-run for one (task, tool, trial):

TASK=refactor ./scripts/create-clones.sh 1
TASK=refactor ./scripts/setup-tool-config.sh bmad 1
TASK=refactor ./scripts/manual-bench.sh bmad 1
# → paste prompt, run tool, exit
# → run the printed one-liner (SHA capture + collect-metrics)

TASK=refactor ./scripts/blind-eval-setup.sh
TASK=refactor ROUND=1 ./scripts/judge-opus.sh  <label>
TASK=refactor ROUND=1 ./scripts/judge-codex.sh <label>
TASK=refactor ROUND=1 ./scripts/judge-qwen.sh  <label>

./scripts/aggregate-results.sh
python3 scripts/cross-task-analysis.py
python3 scripts/krippendorff-alpha.py

10. What this benchmark does not let you verify

Being explicit about what’s outside the scope of the artifact:

Each limitation is disclosed in PAPER.md §7 and README.md Caveats.


Quick Reference

Question File to open
What was the exact prompt sent to this tool? results/<task>/<tool>/t<N>/phase1-prompt.txt
What did the tool actually do? results/<task>/<tool>/t<N>/session-logs/<uuid>.jsonl
What happened in a trial — at a glance? docs/analysis/trial-timelines/<task>/<tool>.md — pre-extracted timeline of skill activations, plugin/skill file reads, subagents, mutations, Bash usage
What did the judge see? results/<task>/_blind-eval/<label>/judge-prompt.md + implementation-diff.patch
What did each judge score? results/<task>/_blind-eval/<label>/round<N>/<judge>-judge.json
How is the aggregate computed? scripts/cross-task-analysis.py
Which label is which tool? results/<task>/_blind-eval/.mapping-DO-NOT-OPEN.json (after judging is done)
What are the integrity guarantees? This file + results/README.md