Verification Guide — Reproducing the Benchmark Claims

Guide for a reader who wants to independently verify any claim made in PAPER.md, README.md, or results/FINAL-REPORT-3JUDGE-20260422.md. Everything needed is checked into results/ — no private state, no network access.

0. Setup

git clone <this-repo>
cd ai-tool-benchmark
python3 -m pip install --user --break-system-packages numpy krippendorff

Only numpy and krippendorff are needed for re-computation. Scripts use only stdlib + numpy, no framework.

1. “Where does a tool’s score come from?”

Claim example: “bmad feature-task z̄ = +0.270, bootstrap 95% CI [a, b]” (from FINAL-REPORT-3JUDGE-20260422.md).

Chain of evidence:

Raw judge files — results/_blind-eval/<label>/round<N>/{opus,codex,qwen}-judge.json, where <label> is whichever NATO letter maps to bmad in results/_blind-eval/.mapping-DO-NOT-OPEN.json.
Canonical score per file — sum(scores.values()) (authoritative; do not use the stored total field).
Canonical round filter — only dirs matching ^round[0-9]+$. Pilot/sample dirs are excluded.
Aggregation — scripts/cross-task-analysis.py reads these files, computes balanced mean (equal-weight average of per-judge means), z-score within task, and stratified bootstrap 95% CI (10,000 resamples, seed 42, stratified by judge).
Output — results/cross-task-stats.json + results/FINAL-REPORT-3JUDGE-20260422.md.

To verify:

# Snapshot the current committed report + stats, then regenerate and compare.
cp results/FINAL-REPORT-3JUDGE-20260422.md /tmp/final-report.before.md
cp results/cross-task-stats.json /tmp/cross-task-stats.before.json
python3 scripts/cross-task-analysis.py       # rewrites FINAL-REPORT + cross-task-stats.json in place
diff /tmp/final-report.before.md      results/FINAL-REPORT-3JUDGE-20260422.md
diff /tmp/cross-task-stats.before.json results/cross-task-stats.json

The script is deterministic (seed 42). Both diffs should be empty — if not, either your dependency versions differ or the committed artifacts have drifted from their generating source.

2. “Why is `superpower` ranked 9th on bugfix?”

Claim example: “superpower’s bugfix CI [151.2, 161.0] sits below the next-lowest tool’s CI and does not overlap T1, even under the forced-activation protocol (/superpowers:systematic-debugging at session start, /superpowers:verification-before-completion at exit). The T2 output is a skill-content ceiling on this task, not an activation failure.”

Walk the chain:

results/bugfix/_blind-eval/.mapping-DO-NOT-OPEN.json — find which label is superpower t1, superpower t2.
results/bugfix/_blind-eval/<label>/implementation-diff.patch — inspect the actual code diff the judges saw.
results/bugfix/_blind-eval/<label>/round1/{opus,codex,qwen}-judge.json — read the notes field and rubric scores; compare to T1 cells.
results/bugfix/superpower/t1/session-logs/<uuid>.jsonl — the full session transcript. Confirm 2 Skill calls fired (systematic-debugging, verification-before-completion).
results/bugfix/superpower/t1/hard-gates.json — automated scope/regression gates (5/5 PASS).

Cross-reference: PAPER.md §5 narrates this chain.

3. “Krippendorff α on totals is negative for feature and refactor”

Claim example: “Judges disagree on absolute scale more than chance would predict on feature (α = −0.286) and refactor (α = −0.406).”

Verify:

python3 scripts/krippendorff-alpha.py
cat results/krippendorff-alpha.json

The JSON has totals (α on summed scores) and per_item (α averaged across the 20 rubric items) for each of the 3 tasks, plus pairwise judge-pair α. Compare against PAPER §4.4b.

Interpretation note: negative α means observed disagreement exceeds chance-expected disagreement on the measured (interval) scale. It does not mean judges disagree on ranking — rank-order agreement is a separate Spearman ρ computation.

4. “Top-4 is a statistical tie”

Claim example: “bmad, ecc, pure, gstack are not separable — all pairwise 95% CIs overlap on every task.”

Verify:

Open results/cross-task-stats.json. Each task has a tiers array (complete-linkage: a tool joins a tier only if its CI overlaps with every existing tier member) and a pairwise_disjoint list (transitivity-free pairs where CIs actually don’t overlap). For the top-4 claim:

import json
s = json.load(open('results/cross-task-stats.json'))
for task in ('feature', 'bugfix', 'refactor'):
    pairs = s['per_task'][task]['pairwise_disjoint']
    top4 = {'bmad', 'ecc', 'pure', 'gstack'}
    print(task, [p for p in pairs if p[0] in top4 and p[1] in top4])

Expected: empty list for every task.

5. “No hidden sampling of trials / no cherry-picking reruns”

Claim example: The cohort-symmetry rule in docs/methodology/pipeline.md requires that if trial t<N> is rerun for one tool, it’s rerun for all 9 before the trial is used in comparison.

Verify:

python3 scripts/audit-cohort-symmetry.py

Reads results/<task>/<tool>/t<N>/sessions/*.meta.json across all 9 tools, groups by trial index, and reports:

Missing trials per tool (hard violation → non-zero exit)
Base-commit divergence within a trial (hard violation → non-zero exit)
24h timestamp span within a trial (soft warning)
Any archived reruns under results/<task>/<tool>/archive-t<N>-<date>/

The script currently reports 8 base-commit divergences on feature t1 (documented in PAPER.md §7) and 3 archived reruns (claudekit / ecc / omc t1–t2). Those are disclosed, not hidden.

6. “The judge didn’t see the tool identity”

Verify:

Open any results/_blind-eval/<label>/judge-prompt.md. Confirm nothing in the prompt names the tool.
Open implementation-diff.patch in the same dir. Confirm the diff uses paths from the actual repo (no tool-specific markers like _bmad/ or claudekit/).
Open .mapping-DO-NOT-OPEN.json only after you’ve formed your own expectation.

The judge receives: PRD + reference codebase markdown + the diff patch + the 20-item rubric + JSON schema. Nothing else.

7. “The judges use the canonical model settings”

Claim example: “All judges run at their provider’s default sampling temperature; seed not exposed by either CLI.”

Verify:

claude --help | grep -iE 'temp|seed|sampl'        # Claude CLI — no results
opencode run --help | grep -iE 'temp|seed|sampl'  # OpenCode CLI — no results

Comment headers in scripts/judge-{opus,codex,qwen}.sh document this. Round-to-round variance reported in FINAL-REPORT §7 absorbs the sampler noise; three-judge averaging is the mitigation, not a fix.

8. “The ranking doesn’t flip under different weighting”

Claim example: “Equal-weight z̄, count-weighted z̄, and rank-sum agree on the top cluster and bottom outlier but disagree on middle ordering.”

Verify:

Open results/cross-task-stats.json → ranking_sensitivity section. It lists each tool’s rank under all three schemes and flags tools with ≥2 rank-positions of movement (claudekit, mindful, superpower as of 20260422).

Also rendered in FINAL-REPORT-3JUDGE-20260422.md §2.

9. Reproducing the pipeline end-to-end

If you want to re-run the benchmark (not just re-aggregate):

See …/methodology/pipeline.md for the full clone → execute → judge → aggregate flow.

Minimum re-run for one (task, tool, trial):

TASK=refactor ./scripts/create-clones.sh 1
TASK=refactor ./scripts/setup-tool-config.sh bmad 1
TASK=refactor ./scripts/manual-bench.sh bmad 1
# → paste prompt, run tool, exit
# → run the printed one-liner (SHA capture + collect-metrics)

TASK=refactor ./scripts/blind-eval-setup.sh
TASK=refactor ROUND=1 ./scripts/judge-opus.sh  <label>
TASK=refactor ROUND=1 ./scripts/judge-codex.sh <label>
TASK=refactor ROUND=1 ./scripts/judge-qwen.sh  <label>

./scripts/aggregate-results.sh
python3 scripts/cross-task-analysis.py
python3 scripts/krippendorff-alpha.py

10. What this benchmark does not let you verify

Being explicit about what’s outside the scope of the artifact:

Judge self-preference at the family level — all 9 executors use a Claude base model, so Anthropic-family favoritism is not identified by this design. A true audit would need a non-Anthropic-base executor as control.
Generalization to other languages / codebases — single TypeScript NX monorepo (RealStake/infina-partner-sdk).
Tool-version drift — results are a 2026-04 snapshot.
Judge-pool completeness — 3 judges is a minimum for the panel estimator; larger panels would tighten CIs further.

Each limitation is disclosed in PAPER.md §7 and README.md Caveats.

Quick Reference

Question	File to open
What was the exact prompt sent to this tool?	`results/<task>/<tool>/t<N>/phase1-prompt.txt`
What did the tool actually do?	`results/<task>/<tool>/t<N>/session-logs/<uuid>.jsonl`
What happened in a trial — at a glance?	`docs/analysis/trial-timelines/<task>/<tool>.md` — pre-extracted timeline of skill activations, plugin/skill file reads, subagents, mutations, Bash usage
What did the judge see?	`results/<task>/_blind-eval/<label>/judge-prompt.md` + `implementation-diff.patch`
What did each judge score?	`results/<task>/_blind-eval/<label>/round<N>/<judge>-judge.json`
How is the aggregate computed?	`scripts/cross-task-analysis.py`
Which label is which tool?	`results/<task>/_blind-eval/.mapping-DO-NOT-OPEN.json` (after judging is done)
What are the integrity guarantees?	This file + `results/README.md`