# AI Tool Benchmark — End-to-End Pipeline

Reference for the full flow from clone setup to final judge scores. Reflects the current **one-shot, code-quality-only** methodology.

## 1. Design

- **Tasks (3):** `feature` (greenfield Mode 2 CD Batch from PRD), `bugfix` (fix from QA report), `refactor` (scoped refactor). Each task has its own PRD/brief in `docs/tasks/` or under the task's `_blind-eval/` root.
- **Tools under test (9):** `pure` (plain Claude Code), `claudekit`, `superpower`, `bmad`, `omc` (oh-my-claudecode), `mindful`, `gstack`, `compound`, `ecc`.
- **Base model (all tool runs):** `claude-opus-4-6`. Per-CLI sampler defaults; temperature/seed not exposed by either Claude or OpenCode CLI.
- **Trials per tool × task:** 4 for `feature`, 2 for `bugfix`, 2 for `refactor` → **72 tool runs total** (plus 5 judge rounds on feature, 2 on bugfix, 2 on refactor).
- **Scoring:** Code quality only. 20 rubric items × 0–10 = **max 200 per run**. The planning phase is deliberately *not* scored — only the committed code matters.
- **Judges (3-panel):** `opus` (`claude-opus-4-7`), `codex` (`gpt-5.4`, high reasoning), `qwen` (`qwen3.6-plus`, high reasoning). Earlier 5-judge panel (glm / kimi / gemini / opus / codex) was retired — glm / kimi / gemini had excessive inter-round variance (σ=9–13) and systematic 57-point calibration gaps. The three current judges have published reasoning modes and stable round-to-round behavior. Judge scripts for the retired judges are retained in `scripts/` but unused.
- **Natural invocations:** each tool uses its most natural entry point. Tools without a native planning workflow (`pure`, `mindful`) run with `--permission-mode plan`; tools with native planning (`claudekit`, `bmad`, `omc`, `superpower`) use their own skill. **`gstack` is deliberately excluded from `--permission-mode plan`** across all trials and all tasks — its `/ship` skill runs an eng-review gate as part of the ship workflow, and its skills activate from trigger phrases in the task prose, so forcing Claude's plan mode on top would double-stack planning ceremony.

## 2. Directory Layout

```
ai-tool-benchmark/
├── scripts/              # Pipeline scripts (below)
├── runs/<tool>-t<trial>/ # Isolated git clones with pinned node_modules
├── config/<tool>-t<trial>/ # Isolated CLAUDE_CONFIG_DIR (plugins, sessions)
├── results/<tool>/t<trial>/    # feature task (task root = results/)
│   ├── phase1-metrics.json    # timing/tokens/cost (single phase)
│   ├── phase1-prompt.txt      # exact prompt sent to this trial
│   ├── commits.txt            # line 1 = BASE (last chore: commit), line 2 = IMPL (HEAD)
│   ├── auto-metrics.json      # tsc/eslint/tests/diff stats
│   ├── hard-gates.json        # task-specific hard-gate verdicts
│   ├── session-logs/*.jsonl   # Claude Code session transcripts
│   ├── sessions/session-*.meta.json  # start/end/exit metadata
│   ├── tsc-output.txt / eslint-output.txt / test-output.txt
│   └── diff-stats.txt
├── results/bugfix/<tool>/t<trial>/    # bugfix task (task root = results/bugfix/)
├── results/refactor/<tool>/t<trial>/  # refactor task (task root = results/refactor/)
└── results/[<task>/]_blind-eval/
    ├── .mapping-DO-NOT-OPEN.json  # label ↔ tool/trial truth
    ├── prd.md                      # shared input
    ├── reference-codebase.md       # shared input
    └── <NATO-label>/
        ├── implementation-diff.patch
        ├── judge-prompt.md         # exact prompt sent to each judge
        └── round<N>/               # per-round judge outputs (3-judge panel)
            ├── opus-judge.json + .raw.json
            ├── codex-judge.json + .raw.txt
            └── qwen-judge.json + .raw.txt
```

## 3. Per-Trial Flow

### 3a. Setup (one-time per trial slot)

```bash
./scripts/create-clones.sh                  # clones runs/<tool>-t<trial>/ via APFS COW
./scripts/setup-tool-config.sh <tool> <trial>  # installs plugin/skill into config dir
./scripts/verify-clean.sh <tool> <trial>     # asserts clone is clean at baseline SHA
```

- `create-clones.sh` — pins `BENCH_COMMIT` as baseline. Each clone is a full git working copy with pre-installed `node_modules` (APFS clonefile = ~0 disk cost).
- `setup-tool-config.sh` — per-tool install logic (see `scripts/setup-tool-config.sh`). Installs Claude Code plugins into the isolated `CLAUDE_CONFIG_DIR`. `bmad` also writes `_bmad/` into the clone.
- `verify-clean.sh` — fails if the clone is dirty, missing PRD, or not at `BENCH_COMMIT`.

### 3b. Interactive run

```bash
./scripts/manual-bench.sh <tool> <trial>
```

What it does:
1. Calls `verify-clean.sh`.
2. Builds the exact prompt (shared task block + per-tool prefix) and prints it between `----- COPY THIS PROMPT INTO CLAUDE -----` markers.
3. Saves the prompt to `results/<tool>/t<trial>/phase1-prompt.txt`.
4. Records start metadata (timestamp, base commit, OS, node version, Claude Code version) to `results/<tool>/t<trial>/sessions/session-<ts>.meta.json`.
5. Launches `claude` interactively with `env -i` (clean env), `CLAUDE_CONFIG_DIR=<tool config>`, and `--permission-mode plan` for `pure`/`mindful`.
6. Waits for you to finish and exit (`Ctrl-D` or `/exit`).
7. Copies session JSONL logs from `$CLAUDE_CONFIG_DIR/projects/…/` into `results/<tool>/t<trial>/session-logs/`.
8. Parses session JSONLs (see §4) and writes `phase1-metrics.json`.
9. Prints a one-liner to capture SHA + collect code-quality metrics.

### 3c. Post-run capture

The printed one-liner:

```bash
BASE=$(git -C runs/<tool>-t<trial> log --format='%H %s' | awk '/ chore:/ {print $1; exit}') && \
  IMPL=$(git -C runs/<tool>-t<trial> rev-parse HEAD) && \
  printf '%s\n%s\n' "$BASE" "$IMPL" > results/<tool>/t<trial>/commits.txt && \
  ./scripts/collect-metrics.sh <tool> <trial>
```

- Records BASE (most recent `chore:` setup commit — `pin product-docs` or `install <tool> config`) on line 1 and IMPL (tool's HEAD) on line 2, so `blind-eval-setup.sh` produces a non-empty `phase1..phase2` diff.
- `capture-sha.sh` does the same auto-seeding when called directly (used by the per-phase flow after `manual-bench.sh` sessions).
- `collect-metrics.sh` runs `tsc --noEmit`, ESLint, Jest on `savings-cd`, computes diff stats, and writes `auto-metrics.json`.

## 4. Metrics Schema (per trial)

### `phase1-metrics.json`
```json
{
  "tool": "pure",
  "trial": 1,
  "wall_seconds": 810,              // active work time (matches Claude Code "Worked for" UI)
  "wall_seconds_active": 810,       // session span minus approval-gate idle
  "wall_seconds_span": 1287,        // first → last event in session JSONL
  "wall_seconds_shell": 1469,       // shell start → exit (includes user prep)
  "approval_idle_seconds": 477,     // sum of ExitPlanMode tool_use → tool_result deltas
  "session_start": "…Z",
  "session_end": "…Z",
  "input_tokens": …,
  "output_tokens": …,
  "cache_creation_tokens": …,
  "cache_read_tokens": …,
  "cost_usd": …,                    // often 0 for interactive sessions — cost not always emitted
  "num_turns": …,
  "session_ids": ["…"],
  "exit_code": 0
}
```

**Timing rationale** — Claude Code's TUI shows "Worked for X" which excludes the time the user spends approving plan gates. We replicate this by pairing every `ExitPlanMode` tool_use with its matching `tool_result` (via `tool_use_id`) and subtracting the delta from the session span. Tools without plan mode see `approval_idle = 0` and `active = span`.

### `auto-metrics.json`
TSC error count, ESLint errors/warnings, Jest pass/fail counts, diff file count and lines, captured SHAs.

## 5. Blind Evaluation Setup

After all trials for a task are committed + captured:

```bash
TASK=feature  ./scripts/blind-eval-setup.sh
TASK=bugfix   ./scripts/blind-eval-setup.sh
TASK=refactor ./scripts/blind-eval-setup.sh
```

- Scans `results/*/t*/commits.txt` and assigns each trial a NATO letter (Alpha–Zulu).
- For each label, extracts `implementation-diff.patch` from `phase1-sha..phase2-sha` with `:(exclude)*.md` so plan documents committed during the impl phase (e.g. `plans/*/plan.md`, `docs/plans/*.md`, `CLAUDE.md`) are stripped out — judges only see code.
- Copies shared `prd.md` and `reference-codebase.md` (static inputs).
- **Incremental** — re-running won't touch already-judged trials.

## 6. Judging

```bash
TASK=feature ROUND=1 ./scripts/judge-opus.sh  <label>
TASK=feature ROUND=1 ./scripts/judge-codex.sh <label>
TASK=feature ROUND=1 ./scripts/judge-qwen.sh  <label>
```

Per label:
1. `generate-judge-prompt-combined.sh <label>` produces the bundled prompt — PRD + reference codebase + impl diff + rubric + JSON schema — and saves it to `results/<task>/_blind-eval/<label>/judge-prompt.md`.
2. `judge-{opus,codex,qwen}.sh` each send the prompt to their respective model CLI, save raw output as `<judge>-judge.raw.txt` / `.raw.json`, and parse into `<judge>-judge.json`. Retired judges (GLM, Kimi, Gemini) are kept in `scripts/` for historical reproducibility but are not part of the 3-panel — they had excessive inter-round variance (σ=9–13) and a 57-point systematic calibration gap between kimi and codex.
3. Judge JSON schema:
   ```json
   {
     "label": "Alpha",
     "judge": "opus",
     "scores": {"1": 8, "2": 9, …, "20": 7},
     "total": 165,
     "max": 200,
     "notes": "…"
   }
   ```
4. **Multi-round** — re-running the three `judge-{opus,codex,qwen}.sh` scripts with a bumped `ROUND=` creates `round<N>/` subdirs for variance smoothing. Aggregation picks up every `^round[0-9]+$` dir.

Rubric (20 items × 0–10):
- **Correctness (100)** — buy price formula, aging days, interest on face_value, maturity vs early sell rates, partial/full withdrawal semantics, per-batch inventory, payment scheduling, strategy interface.
- **Code Quality (50)** — pattern adherence, TS errors, lint, hardcoded values, error handling.
- **Integration (50)** — Mode 1 regression, service usage, DB/API conventions, backward compat.

## 7. Aggregation

Per-task aggregation (balanced mean of per-judge means):

```bash
TASK=feature  ./scripts/aggregate-results.sh
TASK=bugfix   ./scripts/aggregate-results.sh
TASK=refactor ./scripts/aggregate-results.sh
```

- Reads mapping + every `<judge>-judge.json` across all `^round[0-9]+$` dirs.
- **Canonical score rule:** `sum(scores.values())` per file, never the stored `total` field.
- **Canonical tool mean:** balanced mean of per-judge means (equal-weight average). Cancels additive judge calibration drift; does not remove all calibration structure.
- Writes `results/<task>/final-report.md` (feature's lives at `results/final-report.md`) with per-tool mean ± stdev, ranking, and per-trial breakdown.

Cross-task statistics (the numbers in `FINAL-REPORT-3JUDGE-*.md` and `PAPER.md`):

```bash
python3 scripts/cross-task-analysis.py    # bootstrap CIs, tiers, ranking sensitivity, calibration
python3 scripts/krippendorff-alpha.py     # inter-rater α, per-task + pairwise
python3 scripts/audit-cohort-symmetry.py  # rerun-protocol compliance
```

All three are deterministic (bootstrap uses seed 42) and read only the files under `results/`.

## 8. Reproducibility Artifacts

For research-paper reproducibility, every trial preserves:
- `phase1-prompt.txt` — the exact instructions sent in.
- `session-logs/*.jsonl` — full transcript with all tool calls, token usage, timestamps.
- `skill-verification.txt` — which skills/agents the tool actually invoked (validates that e.g. `bmad-quick-dev` fired for bmad).
- Implementation diff (base → final SHA).

For every judge decision:
- `judge-prompt.md` — byte-identical prompt used for the eval.
- `<judge>-judge.raw.txt`/`.raw.json` — unparsed model output.
- `<judge>-judge.json` — parsed scores.

## 9. Adding a New Tool

Use `./scripts/add-tool.sh` to onboard a new tool. Interactive; use `--dry-run` to preview.

The script prompts for:
- **Name** (short, lowercase)
- **Description** (one-line for docs)
- **Install type** — one of:
  - `plugin` — Claude Code plugin from a marketplace (most OMC-ecosystem tools)
  - `clone` — files copied into the run clone (bmad/claudekit style)
  - `npm` — npx-based installer into the clone (bmad-method style)
  - `none` — no install (pure Claude Code baseline)
- **Install details** per type (marketplace URL + plugin id; or source repo + subpath; or npm spec + args)
- **Plan mode** — yes/no, whether to launch with `--permission-mode plan` (for tools without a native planning workflow)
- **Prompt prefix** — optional text prepended to the shared task block (e.g. `Use /newtool:plan then /newtool:run.`)
- **Gitignore patterns** — safety patterns added to `create-clones.sh` benchmark-gitignore block

It then atomically patches, in one Python pass:
- `scripts/env.sh` — appends to the `TOOLS` array
- `scripts/setup-tool-config.sh` — inserts the install case block before the `*)` default
- `scripts/manual-bench.sh` — inserts the per-tool `PROMPT=` case; extends `pure|mindful` plan-mode guard if needed
- `scripts/create-clones.sh` — adds gitignore safety patterns (de-duplicated)
- `docs/plans/one-shot-prompts.md` — appends a playbook section before "Running a Manual Trial"
- `docs/pipeline.md` — bumps "Tools under test (N)" count + list + "N runs total"

After applying, it `bash -n` checks every modified shell script, optionally runs `create-clones.sh` to provision T1/T2/T3 clones, and prints the exact next-step commands.

**Example:**
```bash
./scripts/add-tool.sh --dry-run     # preview
./scripts/add-tool.sh               # apply, then:
./scripts/setup-tool-config.sh <newtool> 1
./scripts/manual-bench.sh <newtool> 1
```

**Judging doesn't need changes** — `blind-eval-setup.sh` auto-discovers runs from `results/*/t*/commits.txt`, and the 20-item rubric applies to any implementation regardless of which tool produced it.

## 10. Known Limitations

- **Cost field often 0** — Claude Code doesn't always emit `costUSD`/`total_cost_usd` in interactive session JSONLs. Tokens are reliable; cost must be computed downstream via published pricing if needed.
- **One-shot flow** — `commits.txt` now records BASE (last `chore:` commit) on line 1 and IMPL (tool HEAD) on line 2. `blind-eval-setup.sh` produces `implementation-diff.patch` from BASE→IMPL with `.md` files filtered out, so judges receive code-only diffs even when a tool commits its plan into the repo.
- **LLM non-determinism** — neither Claude nor OpenCode CLI exposes temperature or seed; judge sampling uses provider defaults. Round-to-round σ reflects this; multi-round + multi-judge averaging is the mitigation. See PAPER.md §7 "Judge sampling not pinned".
- **Non-interactive judging** — judges receive a fully-inlined bundled prompt. They have no tool access, cannot fetch additional context, and rely entirely on the PRD + reference codebase + diff baked into `judge-prompt.md`.

## 11. Quick Reference — End-to-End Command Sequence

```bash
# Per trial — set TASK=feature|bugfix|refactor
TASK=refactor ./scripts/create-clones.sh <trial>
TASK=refactor ./scripts/setup-tool-config.sh <tool> <trial>
TASK=refactor ./scripts/manual-bench.sh <tool> <trial>
# → paste prompt, run tool, exit
# → run the printed one-liner (SHA capture + collect-metrics)

# After all trials for a task
TASK=refactor ./scripts/blind-eval-setup.sh
TASK=refactor ROUND=1 ./scripts/judge-opus.sh  <label>
TASK=refactor ROUND=1 ./scripts/judge-codex.sh <label>
TASK=refactor ROUND=1 ./scripts/judge-qwen.sh  <label>
TASK=refactor ./scripts/aggregate-results.sh

# Cross-task stats (after all 3 tasks are judged)
python3 scripts/cross-task-analysis.py     # → results/FINAL-REPORT-*.md + cross-task-stats.json
python3 scripts/krippendorff-alpha.py      # → results/krippendorff-alpha.json
python3 scripts/audit-cohort-symmetry.py   # non-zero exit on rerun-protocol violations

# Onboarding a new tool (one-time per tool)
./scripts/add-tool.sh --dry-run    # preview
./scripts/add-tool.sh              # apply + optional create-clones
```

## 12. Reader Guides

- **[../results/README.md](../results/README.md)** — folder index + file schemas for `results/`
- **[VERIFICATION-GUIDE.md](VERIFICATION-GUIDE.md)** — how to independently verify any claim