Extending — Add a new tool or a new judge

The benchmark is built to be forked. The scripts are shell, the stats are ~500 lines of Python, every artifact is on disk. No database, no service. This guide walks through the two extension paths.

Add a new tool

1. Use the scaffolding script

./scripts/add-tool.sh --dry-run     # preview the edits
./scripts/add-tool.sh               # interactive: name, install command, entry-point slash command

The script interactively wires your tool into:

scripts/env.sh — appends to the TOOLS array.
scripts/setup-tool-config.sh — inserts the per-tool install block (plugin marketplace, git clone, or npm install).
scripts/manual-bench.sh — inserts the per-tool PROMPT= case. If your tool has no native planning skill, the script extends the pure|mindful plan-mode guard to include you.
scripts/create-clones.sh — adds .gitignore safety patterns (so your tool’s state files stay out of the benchmark commits).
docs/methodology/pipeline.md — bumps the “Tools under test (N)” count and list.

It bash -n checks every modified script and can optionally run create-clones.sh to provision T1/T2 clones immediately.

2. Decide plan-mode vs. native-planning

Your tool has …	Configure it as
No planning skill, plain entry point	In `pure\|mindful\|<yourtool>` plan-mode guard. Runs under `--permission-mode plan`.
A native planning/review workflow (e.g. `/mytool:plan`, `/mytool:cook`)	Just a `PROMPT=` case. No plan-mode flag.
A mixed mode (`plan` for feature/refactor, `build-fix` for bugfix)	A per-task `case "$TASK"` inside your tool’s branch. See `ecc)` for the template.

Do not stack ceremonies. If your tool runs its own eng-review gate (like gstack’s /ship), don’t also set --permission-mode plan. The benchmark excludes gstack from plan-mode for exactly this reason.

3. Run the cohort

TASK=feature  ./scripts/manual-bench.sh <yourtool> 1
TASK=feature  ./scripts/manual-bench.sh <yourtool> 2
TASK=feature  ./scripts/manual-bench.sh <yourtool> 3
TASK=feature  ./scripts/manual-bench.sh <yourtool> 4
TASK=bugfix   ./scripts/manual-bench.sh <yourtool> 1
TASK=bugfix   ./scripts/manual-bench.sh <yourtool> 2
TASK=refactor ./scripts/manual-bench.sh <yourtool> 1
TASK=refactor ./scripts/manual-bench.sh <yourtool> 2

Cohort symmetry: if you re-run any trial for your tool, you must re-run it for all 9 tools in the same cohort. Judge-side artifacts must all come from the same trial SHA. scripts/audit-cohort-symmetry.py exits non-zero if this is violated. See ../methodology/pipeline.md §9a for the full rerun protocol (valid triggers, archival procedure) and .../pipeline.md §4 for the metrics schema.

4. Judge

Judging needs no changes. blind-eval-setup.sh auto-discovers your runs from results/*/t*/commits.txt and blind-labels them in the rotation.

TASK=feature ./scripts/blind-eval-setup.sh
TASK=feature ROUND=1 ./scripts/judge-opus.sh  <label>
TASK=feature ROUND=1 ./scripts/judge-codex.sh <label>
TASK=feature ROUND=1 ./scripts/judge-qwen.sh  <label>
# Repeat for rounds 2-5 on feature, 2 on bugfix, 2 on refactor
TASK=feature ./scripts/aggregate-results.sh

5. Re-compute cross-task stats

python3 scripts/cross-task-analysis.py
python3 scripts/krippendorff-alpha.py
python3 scripts/audit-cohort-symmetry.py

Your tool now appears in FINAL-REPORT-*.md, cross-task-stats.json, and the landing page (which reads cross-task-stats.json as inline JSON).

6. Write the tool profile

Add docs/tools/<yourtool>.md following the template in this folder. Sections: Upstream, Performance, Mechanism, How this benchmark invoked it, What actually happened in the transcripts (per task), Why it ranked where it did, Strengths & failure modes, References. Keep it transcript-grounded.

Add a new judge

Judges are OpenCode CLI sessions (or Claude Code for opus) that receive a fully-inlined prompt and return structured JSON scores. No tool access, no retrieval, no internet.

1. Copy the closest existing judge

cp scripts/judge-opus.sh scripts/judge-<yourjudge>.sh

The opus script (runs inside Claude Code) and the codex/qwen scripts (run inside OpenCode) are slightly different. Pick the matching shape.

2. Edit the judge script

Update:

Model ID. The exact string your CLI accepts. Claude Code model IDs look like claude-<name>-<version>. OpenCode model IDs depend on the provider — check opencode models ls.
Reasoning mode. codex and qwen run with reasoning=high. If your judge supports reasoning modes, pin it explicitly and document the choice in the script’s header comment.
Sampler settings. Neither Claude nor OpenCode CLI exposes temperature or seed. Document this limitation in the script header (see scripts/judge-opus.sh for the template) so readers know round-to-round σ is sampler noise, not rubric disagreement.

3. Vet the judge on one round before committing it

TASK=refactor ROUND=sanitycheck ./scripts/judge-<yourjudge>.sh Alpha
cat results/_blind-eval/Alpha/rounds/sanitycheck/judge-<yourjudge>.json

Check: does the JSON validate against the schema? Are the rubric items all scored 0-10? Does the reasoning block look coherent? If any judge returns malformed JSON more than once in five rounds, retire it.

4. Make the judge rotation-safe

The panel is 3-judge. Adding yours makes it 4. Decide:

Replace one existing judge. Retire the outgoing judge (keep its scripts for reproducibility, stop running it). Document the retirement reason in docs/methodology/pipeline.md §1, following the template from the glm/kimi/gemini retirement note.
Extend to 4-judge. Then scripts/aggregate-results.sh needs to balance a 4-judge panel. Check the balancing logic in aggregate-results.sh (balanced mean over judges — adding one more is trivial; just make sure every judge has judged every trial at least once).

[ ] add-tool.sh --dry-run preview matches intent
[ ] scripts/env.sh TOOLS array includes the new tool
[ ] config/<tool>-t<N>/ populated cleanly for N=1…4 (feature), 1…2 (bugfix, refactor)
[ ] 8 runs complete with commits.txt, auto-metrics.json, sessions/*.meta.json
[ ] All 3 judges × required rounds completed on all trials
[ ] audit-cohort-symmetry.py exits 0
[ ] cross-task-analysis.py re-runs without error; new tool appears in cross-task-stats.json
[ ] docs/tools/<tool>.md written — Upstream, Performance, Mechanism, Transcripts, Why-it-ranked, Strengths, Failure modes
[ ] Landing page inline JSON (docs/index.html) updated with new tool’s CI entries

New judge

[ ] scripts/judge-<name>.sh written, header documents model + reasoning mode + sampler-limitation disclaimer
[ ] Sanity round produces valid JSON with all 20 rubric items scored 0-10
[ ] Retirement vs. extension decision documented in pipeline.md §1
[ ] aggregate-results.sh panel-balancing still correct
[ ] krippendorff-alpha.py re-run; Δα vs. old panel reported
[ ] New judge’s cost + speed characterized

See ../methodology/pipeline.md for the canonical pipeline reference and verification.md for the “how do I independently verify a claim?” walkthrough.