Quickstart — Run one trial end-to-end in ~10 minutes

This guide gets you from git clone to a scored trial in the shortest path. For the full methodology and reference, see ../methodology/pipeline.md.

Prerequisites

macOS or Linux. Windows via WSL untested.
git, bash (4.0+), python3 (3.10+ with numpy, scipy if you want to re-compute stats).
Claude Code CLI — latest. claude --version must succeed. (pure, mindful, bmad, claudekit, superpower, compound, ecc all run inside Claude Code.)
OpenCode CLI — only if you want to run judges other than opus. opencode --version must succeed.
Anthropic API key for Claude Code and OpenCode.
Disk: ~2 GB for the benchmark’s base repo clones and judge artifacts.

The benchmark runs against a real TypeScript NX monorepo that must be cloneable locally. Set BENCH_REPO to the clone URL of that monorepo (your fork or the original) — scripts/env.sh reads it and derives BASE_REPO as the local prepared-clone path (runs/base-repo, runs/base-bugfix, runs/base-refactor depending on TASK). create-clones.sh provisions those base clones on first use; after that it copies them into per-trial working copies.

1. Clone and orient (~1 min)

git clone git@github.com:infina-pfa/claude-tool-benchmark.git
cd claude-tool-benchmark
export BENCH_REPO="git@github.com:RealStake/infina-partner-sdk.git"   # or your fork
grep -E 'TOOLS|BASE_REPO|BENCH_REPO|TASK' scripts/env.sh              # tool list, base repo paths, task env

The benchmark evaluates 9 tools × 3 tasks:

Tools: pure superpower claudekit omc bmad mindful gstack compound ecc
Tasks: feature bugfix refactor

2. Pick one (task, tool, trial) (~1 min)

The smallest meaningful unit is one (task, tool, trial). Pick something cheap first:

export TASK=refactor          # smallest PRD, shortest trials
TOOL=bmad                     # top-of-cluster tool — easy signal
TRIAL=1

3. Clone the base repo into a trial working copy (~2 min)

TASK=$TASK ./scripts/create-clones.sh $TRIAL

First run: clones $BENCH_REPO to the task’s base path (runs/base-repo for feature, runs/base-bugfix for bugfix, runs/base-refactor for refactor) at the pinned SHA. Subsequent runs: fast-copies that base (APFS clonefile on macOS, cp -r elsewhere) into runs/<task>/<tool>-t<trial>/ — one isolated working copy per (tool, trial). These directories are .gitignored so your tool’s commits don’t pollute the benchmark repo.

4. Prepare the per-tool config (~30 sec)

TASK=$TASK ./scripts/setup-tool-config.sh $TOOL $TRIAL

This provisions config/<tool>-t<trial>/ — an isolated Claude Code home directory. It installs the tool (plugin, git clone, marketplace, depending on the tool — see scripts/setup-tool-config.sh for the per-tool recipe), seeds settings.json, and writes .claude.json.

Isolation matters: the benchmark does not want to inherit your personal Claude settings, plugins, or MCP servers. See the per-tool config/<tool>-t<trial>/ for what was actually loaded.

5. Run the tool (~5-15 min depending on tool)

TASK=$TASK ./scripts/manual-bench.sh $TOOL $TRIAL

This prints the exact prompt to paste into Claude Code. Then it opens a Claude Code session with the right CLAUDE_CONFIG_DIR, --permission-mode (for pure/mindful), and model pin (claude-opus-4-6).

Follow the on-screen instructions: paste the prompt, let the tool run to its natural stop, then /exit. The script prints the SHA-capture + collect-metrics one-liner — run it before anything else so the SHAs and auto-metrics.json are captured cleanly.

You now have one trial worth of artifacts at results/<task>/<tool>/t<trial>/:

commits.txt (line 1: BASE SHA, line 2: IMPL SHA)
session-logs/*.jsonl (the full Claude Code transcript)
auto-metrics.json, diff-stats.txt, tsc-output.txt, eslint-output.txt, test-output.txt
sessions/*.meta.json (trial metadata, base commit, tool version)

6. Judge it (~2-5 min per judge)

For a minimum-viable-judgement, run one judge on your trial:

TASK=$TASK ./scripts/blind-eval-setup.sh
TASK=$TASK ROUND=1 ./scripts/judge-opus.sh Alpha    # replace Alpha with the blind label printed above

blind-eval-setup.sh generates a blind label (Alpha/Beta/Gamma/…) per tool and builds judge-prompt.md with: the PRD, reference implementation, the diff patch, the 20-item rubric, and the output JSON schema. Nothing in the prompt names the tool. .mapping-DO-NOT-OPEN.json keeps the label→tool mapping.

For the full 3-judge protocol (as used in the published report), run all three:

TASK=$TASK ROUND=1 ./scripts/judge-opus.sh  <label>
TASK=$TASK ROUND=1 ./scripts/judge-codex.sh <label>
TASK=$TASK ROUND=1 ./scripts/judge-qwen.sh  <label>

5 rounds on feature, 2 on bugfix and refactor. Each round uses fresh sampling.

7. Aggregate & get the score (~30 sec)

TASK=$TASK ./scripts/aggregate-results.sh

Writes results/<task>/final-report.md — per-trial scores, per-rubric-item breakdown, cross-judge comparison.

For cross-task stats (bootstrap CIs, tier groupings, ranking sensitivity, calibration):

python3 scripts/cross-task-analysis.py           # → results/FINAL-REPORT-*.md + cross-task-stats.json
python3 scripts/krippendorff-alpha.py            # → results/krippendorff-alpha.json
python3 scripts/audit-cohort-symmetry.py         # hard exit if rerun-protocol violated

8. Check your numbers against the published benchmark

diff <(jq '.tasks.refactor.per_tool_ci.bmad' results/cross-task-stats.json) \
     <(curl -s https://raw.githubusercontent.com/infina-pfa/claude-tool-benchmark/main/results/cross-task-stats.json | jq '.tasks.refactor.per_tool_ci.bmad')

If your numbers diverge, check: base-repo SHA, tool version, judge model versions, round count, random seed.

Minimum viable re-run

If all you want is one score on one task with one judge (skip multi-judge averaging):

TASK=refactor ./scripts/create-clones.sh 1
TASK=refactor ./scripts/setup-tool-config.sh pure 1
TASK=refactor ./scripts/manual-bench.sh pure 1
# paste + run + exit + printed one-liner
TASK=refactor ./scripts/blind-eval-setup.sh
TASK=refactor ROUND=1 ./scripts/judge-opus.sh Alpha
TASK=refactor ./scripts/aggregate-results.sh

What this quickstart deliberately skips

Multi-trial per tool. Production numbers use 2-4 trials per (tool, task). See ../methodology/pipeline.md §3.
Multi-round judging. Production uses 2-5 rounds per judge. See pipeline §6.
Cohort symmetry. Production re-runs all 9 tools whenever one is re-run. See pipeline §9a for valid rerun triggers and the archival procedure.
Judge panel rotation. Production rotates opus/codex/qwen. See pipeline §6.

See the verification guide for how to reproduce a specific published claim.

See extending.md if you want to add a new tool or judge.