# Verification Guide — Reproducing the Benchmark Claims

Guide for a reader who wants to independently verify any claim made in `PAPER.md`, `README.md`, or `results/FINAL-REPORT-3JUDGE-20260421.md`. Everything needed is checked into `results/` — no private state, no network access.

---

## 0. Setup

```bash
git clone <this-repo>
cd ai-tool-benchmark
python3 -m pip install --user --break-system-packages numpy krippendorff
```

Only `numpy` and `krippendorff` are needed for re-computation. Scripts use only stdlib + numpy, no framework.

---

## 1. "Where does a tool's score come from?"

**Claim example:** "`bmad` feature-task z̄ = +0.297, bootstrap 95% CI [a, b]" (from `FINAL-REPORT-3JUDGE-20260421.md`).

**Chain of evidence:**

1. **Raw judge files** — `results/_blind-eval/<label>/round<N>/{opus,codex,qwen}-judge.json`, where `<label>` is whichever NATO letter maps to `bmad` in `results/_blind-eval/.mapping-DO-NOT-OPEN.json`.
2. **Canonical score per file** — `sum(scores.values())` (authoritative; do not use the stored `total` field).
3. **Canonical round filter** — only dirs matching `^round[0-9]+$`. Pilot/sample dirs are excluded.
4. **Aggregation** — `scripts/cross-task-analysis.py` reads these files, computes balanced mean (equal-weight average of per-judge means), z-score within task, and stratified bootstrap 95% CI (10,000 resamples, seed 42, stratified by judge).
5. **Output** — `results/cross-task-stats.json` + `results/FINAL-REPORT-3JUDGE-20260421.md`.

**To verify:**
```bash
python3 scripts/cross-task-analysis.py
diff <(python3 scripts/cross-task-analysis.py | tail -200) results/FINAL-REPORT-3JUDGE-20260421.md
```

The script is deterministic (seed 42). Output should byte-match the committed report.

---

## 2. "Why is `superpower` ranked 9th on bugfix?"

**Claim example:** "`superpower` is the only outlier, driven by scope-discipline failure on bugfix."

**Walk the chain:**

1. `results/bugfix/_blind-eval/.mapping-DO-NOT-OPEN.json` — find which label is `superpower t1`, `superpower t2`.
2. `results/bugfix/_blind-eval/<label>/implementation-diff.patch` — inspect the actual code diff the judges saw.
3. `results/bugfix/_blind-eval/<label>/round1/{opus,codex,qwen}-judge.json` — read the `notes` field and rubric scores; the low items should cluster in scope/regression categories.
4. `results/bugfix/superpower/t1/session-logs/<uuid>.jsonl` — the **full session transcript**. Grep for tool calls that touched files outside the stated scope.
5. `results/bugfix/superpower/t1/hard-gates.json` — automated scope/regression gates.

**Cross-reference:** `PAPER.md` §5 Case Study narrates this chain.

---

## 3. "Krippendorff α on totals is negative for feature and refactor"

**Claim example:** "Judges disagree on absolute scale more than chance would predict on feature (α = −0.286) and refactor (α = −0.406)."

**Verify:**

```bash
python3 scripts/krippendorff-alpha.py
cat results/krippendorff-alpha.json
```

The JSON has `totals` (α on summed scores) and `per_item` (α averaged across the 20 rubric items) for each of the 3 tasks, plus pairwise judge-pair α. Compare against the paper's §4.4b table.

**Interpretation note:** negative α means **observed disagreement exceeds chance-expected disagreement on the measured (interval) scale**. It does not mean judges disagree on ranking — rank-order agreement is a separate Spearman ρ computation.

---

## 4. "Top-4 is a statistical tie"

**Claim example:** "`bmad`, `ecc`, `pure`, `gstack` are not separable — all pairwise 95% CIs overlap on every task."

**Verify:**

Open `results/cross-task-stats.json`. Each task has a `tiers` array (complete-linkage: a tool joins a tier only if its CI overlaps with every existing tier member) and a `pairwise_disjoint` list (transitivity-free pairs where CIs actually don't overlap). For the top-4 claim:

```python
import json
s = json.load(open('results/cross-task-stats.json'))
for task in ('feature', 'bugfix', 'refactor'):
    pairs = s['per_task'][task]['pairwise_disjoint']
    top4 = {'bmad', 'ecc', 'pure', 'gstack'}
    print(task, [p for p in pairs if p[0] in top4 and p[1] in top4])
```

Expected: empty list for every task.

---

## 5. "No hidden sampling of trials / no cherry-picking reruns"

**Claim example:** The cohort-symmetry rule in `docs/pipeline.md` requires that if trial `t<N>` is rerun for one tool, it's rerun for all 9 before the trial is used in comparison.

**Verify:**

```bash
python3 scripts/audit-cohort-symmetry.py
```

Reads `results/<task>/<tool>/t<N>/sessions/*.meta.json` across all 9 tools, groups by trial index, and reports:
- Missing trials per tool (hard violation → non-zero exit)
- Base-commit divergence within a trial (hard violation → non-zero exit)
- >24h timestamp span within a trial (soft warning)
- Any archived reruns under `results/<task>/<tool>/archive-t<N>-<date>/`

The script currently reports 8 base-commit divergences on feature t1 (documented in `PAPER.md` §7) and 3 archived reruns (claudekit / ecc / omc t1–t2). Those are disclosed, not hidden.

---

## 6. "The judge didn't see the tool identity"

**Verify:**

1. Open any `results/_blind-eval/<label>/judge-prompt.md`. Confirm nothing in the prompt names the tool.
2. Open `implementation-diff.patch` in the same dir. Confirm the diff uses paths from the actual repo (no tool-specific markers like `_bmad/` or `claudekit/`).
3. Open `.mapping-DO-NOT-OPEN.json` only after you've formed your own expectation.

The judge receives: PRD + reference codebase markdown + the diff patch + the 20-item rubric + JSON schema. Nothing else.

---

## 7. "The judges use the canonical model settings"

**Claim example:** "All judges run at their provider's default sampling temperature; seed not exposed by either CLI."

**Verify:**

```bash
claude --help | grep -iE 'temp|seed|sampl'        # Claude CLI — no results
opencode run --help | grep -iE 'temp|seed|sampl'  # OpenCode CLI — no results
```

Comment headers in `scripts/judge-{opus,codex,qwen}.sh` document this. Round-to-round variance reported in `FINAL-REPORT` §7 absorbs the sampler noise; three-judge averaging is the mitigation, not a fix.

---

## 8. "The ranking doesn't flip under different weighting"

**Claim example:** "Equal-weight z̄, count-weighted z̄, and rank-sum agree on the top cluster and bottom outlier but disagree on middle ordering."

**Verify:**

Open `results/cross-task-stats.json` → `ranking_sensitivity` section. It lists each tool's rank under all three schemes and flags tools with ≥2 rank-positions of movement (`mindful`, `compound`, `claudekit` as of 2026-04-21).

Also rendered in `FINAL-REPORT-3JUDGE-20260421.md` §2.

---

## 9. Reproducing the pipeline end-to-end

If you want to re-run the benchmark (not just re-aggregate):

See **[pipeline.md](pipeline.md)** for the full clone → execute → judge → aggregate flow.

**Minimum re-run for one (task, tool, trial):**

```bash
TASK=refactor ./scripts/create-clones.sh 1
TASK=refactor ./scripts/setup-tool-config.sh bmad 1
TASK=refactor ./scripts/manual-bench.sh bmad 1
# → paste prompt, run tool, exit
# → run the printed one-liner (SHA capture + collect-metrics)

TASK=refactor ./scripts/blind-eval-setup.sh
TASK=refactor ROUND=1 ./scripts/judge-opus.sh  <label>
TASK=refactor ROUND=1 ./scripts/judge-codex.sh <label>
TASK=refactor ROUND=1 ./scripts/judge-qwen.sh  <label>

./scripts/aggregate-results.sh
python3 scripts/cross-task-analysis.py
python3 scripts/krippendorff-alpha.py
```

---

## 10. What this benchmark does **not** let you verify

Being explicit about what's outside the scope of the artifact:

- **Judge self-preference at the family level** — all 9 executors use a Claude base model, so Anthropic-family favoritism is not identified by this design. A true audit would need a non-Anthropic-base executor as control.
- **Generalization to other languages / codebases** — single TypeScript NX monorepo.
- **Tool-version drift** — results are a 2026-04 snapshot.
- **Judge-pool completeness** — 3 judges is a minimum for the panel estimator; larger panels would tighten CIs further.

Each limitation is disclosed in `PAPER.md` §7 and `README.md` Caveats.

---

## Quick Reference

| Question | File to open |
|---|---|
| What was the exact prompt sent to this tool? | `results/<task>/<tool>/t<N>/phase1-prompt.txt` |
| What did the tool actually do? | `results/<task>/<tool>/t<N>/session-logs/<uuid>.jsonl` |
| What did the judge see? | `results/<task>/_blind-eval/<label>/judge-prompt.md` + `implementation-diff.patch` |
| What did each judge score? | `results/<task>/_blind-eval/<label>/round<N>/<judge>-judge.json` |
| How is the aggregate computed? | `scripts/cross-task-analysis.py` |
| Which label is which tool? | `results/<task>/_blind-eval/.mapping-DO-NOT-OPEN.json` (after judging is done) |
| What are the integrity guarantees? | This file + `results/README.md` |