Quickstart — Run one trial end-to-end in ~10 minutes
This guide gets you from git clone to a scored trial in the shortest path. For the full methodology and reference, see ../methodology/pipeline.md.
Prerequisites
- macOS or Linux. Windows via WSL untested.
- git, bash (4.0+), python3 (3.10+ with
numpy,scipyif you want to re-compute stats). - Claude Code CLI — latest.
claude --versionmust succeed. (pure,mindful,bmad,claudekit,superpower,compound,eccall run inside Claude Code.) - OpenCode CLI — only if you want to run judges other than
opus.opencode --versionmust succeed. - Anthropic API key for Claude Code and OpenCode.
- Disk: ~2 GB for the benchmark’s base repo clones and judge artifacts.
The benchmark runs against a real TypeScript NX monorepo that must be cloneable locally. Set BENCH_REPO to the clone URL of that monorepo (your fork or the original) — scripts/env.sh reads it and derives BASE_REPO as the local prepared-clone path (runs/base-repo, runs/base-bugfix, runs/base-refactor depending on TASK). create-clones.sh provisions those base clones on first use; after that it copies them into per-trial working copies.
1. Clone and orient (~1 min)
git clone git@github.com:infina-pfa/claude-tool-benchmark.git
cd claude-tool-benchmark
export BENCH_REPO="git@github.com:RealStake/infina-partner-sdk.git" # or your fork
grep -E 'TOOLS|BASE_REPO|BENCH_REPO|TASK' scripts/env.sh # tool list, base repo paths, task env
The benchmark evaluates 9 tools × 3 tasks:
- Tools:
puresuperpowerclaudekitomcbmadmindfulgstackcompoundecc - Tasks:
featurebugfixrefactor
2. Pick one (task, tool, trial) (~1 min)
The smallest meaningful unit is one (task, tool, trial). Pick something cheap first:
export TASK=refactor # smallest PRD, shortest trials
TOOL=bmad # top-of-cluster tool — easy signal
TRIAL=1
3. Clone the base repo into a trial working copy (~2 min)
TASK=$TASK ./scripts/create-clones.sh $TRIAL
First run: clones $BENCH_REPO to the task’s base path (runs/base-repo for feature, runs/base-bugfix for bugfix, runs/base-refactor for refactor) at the pinned SHA. Subsequent runs: fast-copies that base (APFS clonefile on macOS, cp -r elsewhere) into runs/<task>/<tool>-t<trial>/ — one isolated working copy per (tool, trial). These directories are .gitignored so your tool’s commits don’t pollute the benchmark repo.
4. Prepare the per-tool config (~30 sec)
TASK=$TASK ./scripts/setup-tool-config.sh $TOOL $TRIAL
This provisions config/<tool>-t<trial>/ — an isolated Claude Code home directory. It installs the tool (plugin, git clone, marketplace, depending on the tool — see scripts/setup-tool-config.sh for the per-tool recipe), seeds settings.json, and writes .claude.json.
Isolation matters: the benchmark does not want to inherit your personal Claude settings, plugins, or MCP servers. See the per-tool config/<tool>-t<trial>/ for what was actually loaded.
5. Run the tool (~5-15 min depending on tool)
TASK=$TASK ./scripts/manual-bench.sh $TOOL $TRIAL
This prints the exact prompt to paste into Claude Code. Then it opens a Claude Code session with the right CLAUDE_CONFIG_DIR, --permission-mode (for pure/mindful), and model pin (claude-opus-4-6).
Follow the on-screen instructions: paste the prompt, let the tool run to its natural stop, then /exit. The script prints the SHA-capture + collect-metrics one-liner — run it before anything else so the SHAs and auto-metrics.json are captured cleanly.
You now have one trial worth of artifacts at results/<task>/<tool>/t<trial>/:
commits.txt(line 1: BASE SHA, line 2: IMPL SHA)session-logs/*.jsonl(the full Claude Code transcript)auto-metrics.json,diff-stats.txt,tsc-output.txt,eslint-output.txt,test-output.txtsessions/*.meta.json(trial metadata, base commit, tool version)
6. Judge it (~2-5 min per judge)
For a minimum-viable-judgement, run one judge on your trial:
TASK=$TASK ./scripts/blind-eval-setup.sh
TASK=$TASK ROUND=1 ./scripts/judge-opus.sh Alpha # replace Alpha with the blind label printed above
blind-eval-setup.sh generates a blind label (Alpha/Beta/Gamma/…) per tool and builds judge-prompt.md with: the PRD, reference implementation, the diff patch, the 20-item rubric, and the output JSON schema. Nothing in the prompt names the tool. .mapping-DO-NOT-OPEN.json keeps the label→tool mapping.
For the full 3-judge protocol (as used in the published report), run all three:
TASK=$TASK ROUND=1 ./scripts/judge-opus.sh <label>
TASK=$TASK ROUND=1 ./scripts/judge-codex.sh <label>
TASK=$TASK ROUND=1 ./scripts/judge-qwen.sh <label>
5 rounds on feature, 2 on bugfix and refactor. Each round uses fresh sampling.
7. Aggregate & get the score (~30 sec)
TASK=$TASK ./scripts/aggregate-results.sh
Writes results/<task>/final-report.md — per-trial scores, per-rubric-item breakdown, cross-judge comparison.
For cross-task stats (bootstrap CIs, tier groupings, ranking sensitivity, calibration):
python3 scripts/cross-task-analysis.py # → results/FINAL-REPORT-*.md + cross-task-stats.json
python3 scripts/krippendorff-alpha.py # → results/krippendorff-alpha.json
python3 scripts/audit-cohort-symmetry.py # hard exit if rerun-protocol violated
8. Check your numbers against the published benchmark
diff <(jq '.tasks.refactor.per_tool_ci.bmad' results/cross-task-stats.json) \
<(curl -s https://raw.githubusercontent.com/infina-pfa/claude-tool-benchmark/main/results/cross-task-stats.json | jq '.tasks.refactor.per_tool_ci.bmad')
If your numbers diverge, check: base-repo SHA, tool version, judge model versions, round count, random seed.
Minimum viable re-run
If all you want is one score on one task with one judge (skip multi-judge averaging):
TASK=refactor ./scripts/create-clones.sh 1
TASK=refactor ./scripts/setup-tool-config.sh pure 1
TASK=refactor ./scripts/manual-bench.sh pure 1
# paste + run + exit + printed one-liner
TASK=refactor ./scripts/blind-eval-setup.sh
TASK=refactor ROUND=1 ./scripts/judge-opus.sh Alpha
TASK=refactor ./scripts/aggregate-results.sh
What this quickstart deliberately skips
- Multi-trial per tool. Production numbers use 2-4 trials per (tool, task). See
../methodology/pipeline.md§3. - Multi-round judging. Production uses 2-5 rounds per judge. See pipeline §6.
- Cohort symmetry. Production re-runs all 9 tools whenever one is re-run. See pipeline §9a for valid rerun triggers and the archival procedure.
- Judge panel rotation. Production rotates opus/codex/qwen. See pipeline §6.
See the verification guide for how to reproduce a specific published claim.
See extending.md if you want to add a new tool or judge.