Extending — Add a new tool or a new judge

The benchmark is built to be forked. The scripts are shell, the stats are ~500 lines of Python, every artifact is on disk. No database, no service. This guide walks through the two extension paths.


Add a new tool

1. Use the scaffolding script

./scripts/add-tool.sh --dry-run     # preview the edits
./scripts/add-tool.sh               # interactive: name, install command, entry-point slash command

The script interactively wires your tool into:

It bash -n checks every modified script and can optionally run create-clones.sh to provision T1/T2 clones immediately.

2. Decide plan-mode vs. native-planning

Your tool has … Configure it as
No planning skill, plain entry point In pure|mindful|<yourtool> plan-mode guard. Runs under --permission-mode plan.
A native planning/review workflow (e.g. /mytool:plan, /mytool:cook) Just a PROMPT= case. No plan-mode flag.
A mixed mode (plan for feature/refactor, build-fix for bugfix) A per-task case "$TASK" inside your tool’s branch. See ecc) for the template.

Do not stack ceremonies. If your tool runs its own eng-review gate (like gstack’s /ship), don’t also set --permission-mode plan. The benchmark excludes gstack from plan-mode for exactly this reason.

3. Run the cohort

TASK=feature  ./scripts/manual-bench.sh <yourtool> 1
TASK=feature  ./scripts/manual-bench.sh <yourtool> 2
TASK=feature  ./scripts/manual-bench.sh <yourtool> 3
TASK=feature  ./scripts/manual-bench.sh <yourtool> 4
TASK=bugfix   ./scripts/manual-bench.sh <yourtool> 1
TASK=bugfix   ./scripts/manual-bench.sh <yourtool> 2
TASK=refactor ./scripts/manual-bench.sh <yourtool> 1
TASK=refactor ./scripts/manual-bench.sh <yourtool> 2

Cohort symmetry: if you re-run any trial for your tool, you must re-run it for all 9 tools in the same cohort. Judge-side artifacts must all come from the same trial SHA. scripts/audit-cohort-symmetry.py exits non-zero if this is violated. See ../methodology/pipeline.md §9a for the full rerun protocol (valid triggers, archival procedure) and .../pipeline.md §4 for the metrics schema.

4. Judge

Judging needs no changes. blind-eval-setup.sh auto-discovers your runs from results/*/t*/commits.txt and blind-labels them in the rotation.

TASK=feature ./scripts/blind-eval-setup.sh
TASK=feature ROUND=1 ./scripts/judge-opus.sh  <label>
TASK=feature ROUND=1 ./scripts/judge-codex.sh <label>
TASK=feature ROUND=1 ./scripts/judge-qwen.sh  <label>
# Repeat for rounds 2-5 on feature, 2 on bugfix, 2 on refactor
TASK=feature ./scripts/aggregate-results.sh

5. Re-compute cross-task stats

python3 scripts/cross-task-analysis.py
python3 scripts/krippendorff-alpha.py
python3 scripts/audit-cohort-symmetry.py

Your tool now appears in FINAL-REPORT-*.md, cross-task-stats.json, and the landing page (which reads cross-task-stats.json as inline JSON).

6. Write the tool profile

Add docs/tools/<yourtool>.md following the template in this folder. Sections: Upstream, Performance, Mechanism, How this benchmark invoked it, What actually happened in the transcripts (per task), Why it ranked where it did, Strengths & failure modes, References. Keep it transcript-grounded.


Add a new judge

Judges are OpenCode CLI sessions (or Claude Code for opus) that receive a fully-inlined prompt and return structured JSON scores. No tool access, no retrieval, no internet.

1. Copy the closest existing judge

cp scripts/judge-opus.sh scripts/judge-<yourjudge>.sh

The opus script (runs inside Claude Code) and the codex/qwen scripts (run inside OpenCode) are slightly different. Pick the matching shape.

2. Edit the judge script

Update:

3. Vet the judge on one round before committing it

TASK=refactor ROUND=sanitycheck ./scripts/judge-<yourjudge>.sh Alpha
cat results/_blind-eval/Alpha/rounds/sanitycheck/judge-<yourjudge>.json

Check: does the JSON validate against the schema? Are the rubric items all scored 0-10? Does the reasoning block look coherent? If any judge returns malformed JSON more than once in five rounds, retire it.

4. Make the judge rotation-safe

The panel is 3-judge. Adding yours makes it 4. Decide:

5. Re-run krippendorff-alpha.py

Inter-rater reliability tracks the whole panel. Adding a judge that disagrees systematically will drop α. That’s a finding, not a bug. Include the Δα in your addition’s PR notes.

6. Cost & speed profile

Document the new judge’s speed + cost per round in docs/tools/judges.md (create if missing). A judge that costs 10× but only nudges Spearman ρ by 0.02 is a bad addition.


Checklist before opening a PR

New tool

New judge


See ../methodology/pipeline.md for the canonical pipeline reference and verification.md for the “how do I independently verify a claim?” walkthrough.