Pre-publish rerun runbook
Patches applied to the harness
The re-publish pass applied the following script-level fixes; the runbook below assumes they are in place.
scripts/apply-r1-override.py—- Item 16 field: reads
tests_core_failed, matching the Mode-1 rubric wording. - Item 20 added to the
featurelock list with the same formula the judge prompt uses:s20 = 10 if lines_removed == 0 else max(0, 10 - ceil(lines_removed / 10)). scores_pre_r1snapshot is now always written on first sweep (previously only written when the override changed at least one item, so files where the LLM happened to be correct silently lacked the audit trail).
- Item 16 field: reads
scripts/blind-eval-setup.sh—- Path-level excludes extended with
docs/bmad/**,.superpowers/**,.compound-engineering/**,.ecc/**. - Awk content-scrub list extended with
plans/,research/and the new tool-state dirs above. auto-metrics.jsonanonymisation:plugin_versionsandcollected_atare stripped before the file lands under_blind-eval/(both are identity fingerprints).
- Path-level excludes extended with
scripts/aggregate-results.sh—- Idempotent R1 sweep runs before every aggregation, closing the single-judge-retry bypass orchestration gap.
- σ decomposition:
within_σ(trial-to-trial within-judge noise) andbetween_σ(judge base-rate spread) are emitted alongside pooled σ. - Cohort-span hours and weight pre-registration date are surfaced in the Caveats block.
versions.lock.json—judges.*.weightblock added with the pre-registered 3 / 2 / 1 / 1 / 1 scheme;_weight_commentdocuments the registration date (2026-05-12).scripts/env.sh—BENCH_COMMITfor all three tasks switched from 7-char short SHAs to full 40-char SHAs to remove ambiguity.
After applying these patches, regenerate the six leak-fix re-judge labels (Grove, Delta, Mike, Xray, November, Quebec — all feature) on clean diffs:
# from the repo root
TASK=feature ./scripts/blind-eval-setup.sh
TASK=feature ./scripts/judge-all.sh Grove Delta Mike Xray November Quebec
TASK=feature ./scripts/aggregate-results.sh
TASK=bugfix ./scripts/aggregate-results.sh
TASK=refactor ./scripts/aggregate-results.sh
Verify scores_pre_r1 coverage across the corpus (expect 360 / 360):
python3 -c "
import json, pathlib
files = list(pathlib.Path('results').rglob('_blind-eval/*/[!.]*-judge.json'))
files = [f for f in files if '_archive' not in f.parts and 'request' not in f.name and 'raw' not in f.name]
total = len(files); with_snap = sum(1 for f in files if 'scores_pre_r1' in json.loads(f.read_text()))
print(f'{with_snap}/{total} files carry scores_pre_r1')
"
Verify no fingerprint leaks:
grep -rlE "\.omc/|_bmad-output/|docs/bmad/|\.superpowers/|\.compound-engineering/|\.ecc/|CLAUDE\.md\.original|plugin_versions" \
results/{,bugfix/,refactor/}_blind-eval/*/{implementation-diff.patch,auto-metrics.json} 2>/dev/null \
| grep -v "\.pre-" \
| head
# Expect: empty.