Extending — Add a new tool or a new judge
The benchmark is built to be forked. The scripts are shell, the stats are ~500 lines of Python, every artifact is on disk. No database, no service. This guide walks through the two extension paths.
Add a new tool
1. Use the scaffolding script
./scripts/add-tool.sh --dry-run # preview the edits
./scripts/add-tool.sh # interactive: name, install command, entry-point slash command
The script interactively wires your tool into:
scripts/env.sh— appends to theTOOLSarray.scripts/setup-tool-config.sh— inserts the per-tool install block (plugin marketplace, git clone, or npm install).scripts/manual-bench.sh— inserts the per-toolPROMPT=case. If your tool has no native planning skill, the script extends thepure|mindfulplan-mode guard to include you.scripts/create-clones.sh— adds.gitignoresafety patterns (so your tool’s state files stay out of the benchmark commits).docs/methodology/pipeline.md— bumps the “Tools under test (N)” count and list.
It bash -n checks every modified script and can optionally run create-clones.sh to provision T1/T2 clones immediately.
2. Decide plan-mode vs. native-planning
| Your tool has … | Configure it as |
|---|---|
| No planning skill, plain entry point | In pure|mindful|<yourtool> plan-mode guard. Runs under --permission-mode plan. |
A native planning/review workflow (e.g. /mytool:plan, /mytool:cook) |
Just a PROMPT= case. No plan-mode flag. |
A mixed mode (plan for feature/refactor, build-fix for bugfix) |
A per-task case "$TASK" inside your tool’s branch. See ecc) for the template. |
Do not stack ceremonies. If your tool runs its own eng-review gate (like gstack’s /ship), don’t also set --permission-mode plan. The benchmark excludes gstack from plan-mode for exactly this reason.
3. Run the cohort
TASK=feature ./scripts/manual-bench.sh <yourtool> 1
TASK=feature ./scripts/manual-bench.sh <yourtool> 2
TASK=feature ./scripts/manual-bench.sh <yourtool> 3
TASK=feature ./scripts/manual-bench.sh <yourtool> 4
TASK=bugfix ./scripts/manual-bench.sh <yourtool> 1
TASK=bugfix ./scripts/manual-bench.sh <yourtool> 2
TASK=refactor ./scripts/manual-bench.sh <yourtool> 1
TASK=refactor ./scripts/manual-bench.sh <yourtool> 2
Cohort symmetry: if you re-run any trial for your tool, you must re-run it for all 9 tools in the same cohort. Judge-side artifacts must all come from the same trial SHA. scripts/audit-cohort-symmetry.py exits non-zero if this is violated. See ../methodology/pipeline.md §9a for the full rerun protocol (valid triggers, archival procedure) and .../pipeline.md §4 for the metrics schema.
4. Judge
Judging needs no changes. blind-eval-setup.sh auto-discovers your runs from results/*/t*/commits.txt and blind-labels them in the rotation.
TASK=feature ./scripts/blind-eval-setup.sh
TASK=feature ROUND=1 ./scripts/judge-opus.sh <label>
TASK=feature ROUND=1 ./scripts/judge-codex.sh <label>
TASK=feature ROUND=1 ./scripts/judge-qwen.sh <label>
# Repeat for rounds 2-5 on feature, 2 on bugfix, 2 on refactor
TASK=feature ./scripts/aggregate-results.sh
5. Re-compute cross-task stats
python3 scripts/cross-task-analysis.py
python3 scripts/krippendorff-alpha.py
python3 scripts/audit-cohort-symmetry.py
Your tool now appears in FINAL-REPORT-*.md, cross-task-stats.json, and the landing page (which reads cross-task-stats.json as inline JSON).
6. Write the tool profile
Add docs/tools/<yourtool>.md following the template in this folder. Sections: Upstream, Performance, Mechanism, How this benchmark invoked it, What actually happened in the transcripts (per task), Why it ranked where it did, Strengths & failure modes, References. Keep it transcript-grounded.
Add a new judge
Judges are OpenCode CLI sessions (or Claude Code for opus) that receive a fully-inlined prompt and return structured JSON scores. No tool access, no retrieval, no internet.
1. Copy the closest existing judge
cp scripts/judge-opus.sh scripts/judge-<yourjudge>.sh
The opus script (runs inside Claude Code) and the codex/qwen scripts (run inside OpenCode) are slightly different. Pick the matching shape.
2. Edit the judge script
Update:
- Model ID. The exact string your CLI accepts. Claude Code model IDs look like
claude-<name>-<version>. OpenCode model IDs depend on the provider — checkopencode models ls. - Reasoning mode.
codexandqwenrun withreasoning=high. If your judge supports reasoning modes, pin it explicitly and document the choice in the script’s header comment. - Sampler settings. Neither Claude nor OpenCode CLI exposes temperature or seed. Document this limitation in the script header (see
scripts/judge-opus.shfor the template) so readers know round-to-round σ is sampler noise, not rubric disagreement.
3. Vet the judge on one round before committing it
TASK=refactor ROUND=sanitycheck ./scripts/judge-<yourjudge>.sh Alpha
cat results/_blind-eval/Alpha/rounds/sanitycheck/judge-<yourjudge>.json
Check: does the JSON validate against the schema? Are the rubric items all scored 0-10? Does the reasoning block look coherent? If any judge returns malformed JSON more than once in five rounds, retire it.
4. Make the judge rotation-safe
The panel is 3-judge. Adding yours makes it 4. Decide:
- Replace one existing judge. Retire the outgoing judge (keep its scripts for reproducibility, stop running it). Document the retirement reason in
docs/methodology/pipeline.md§1, following the template from theglm/kimi/geminiretirement note. - Extend to 4-judge. Then
scripts/aggregate-results.shneeds to balance a 4-judge panel. Check the balancing logic inaggregate-results.sh(balanced mean over judges — adding one more is trivial; just make sure every judge has judged every trial at least once).
5. Re-run krippendorff-alpha.py
Inter-rater reliability tracks the whole panel. Adding a judge that disagrees systematically will drop α. That’s a finding, not a bug. Include the Δα in your addition’s PR notes.
6. Cost & speed profile
Document the new judge’s speed + cost per round in docs/tools/judges.md (create if missing). A judge that costs 10× but only nudges Spearman ρ by 0.02 is a bad addition.
Checklist before opening a PR
New tool
- [ ]
add-tool.sh --dry-runpreview matches intent - [ ]
scripts/env.shTOOLS array includes the new tool - [ ]
config/<tool>-t<N>/populated cleanly for N=1…4 (feature), 1…2 (bugfix, refactor) - [ ] 8 runs complete with
commits.txt,auto-metrics.json,sessions/*.meta.json - [ ] All 3 judges × required rounds completed on all trials
- [ ]
audit-cohort-symmetry.pyexits 0 - [ ]
cross-task-analysis.pyre-runs without error; new tool appears incross-task-stats.json - [ ]
docs/tools/<tool>.mdwritten — Upstream, Performance, Mechanism, Transcripts, Why-it-ranked, Strengths, Failure modes - [ ] Landing page inline JSON (
docs/index.html) updated with new tool’s CI entries
New judge
- [ ]
scripts/judge-<name>.shwritten, header documents model + reasoning mode + sampler-limitation disclaimer - [ ] Sanity round produces valid JSON with all 20 rubric items scored 0-10
- [ ] Retirement vs. extension decision documented in pipeline.md §1
- [ ]
aggregate-results.shpanel-balancing still correct - [ ]
krippendorff-alpha.pyre-run; Δα vs. old panel reported - [ ] New judge’s cost + speed characterized
See ../methodology/pipeline.md for the canonical pipeline reference and verification.md for the “how do I independently verify a claim?” walkthrough.