Pre-publish rerun runbook

Patches applied to the harness

The re-publish pass applied the following script-level fixes; the runbook below assumes they are in place.

After applying these patches, regenerate the six leak-fix re-judge labels (Grove, Delta, Mike, Xray, November, Quebec — all feature) on clean diffs:

# from the repo root
TASK=feature ./scripts/blind-eval-setup.sh
TASK=feature ./scripts/judge-all.sh Grove Delta Mike Xray November Quebec
TASK=feature  ./scripts/aggregate-results.sh
TASK=bugfix   ./scripts/aggregate-results.sh
TASK=refactor ./scripts/aggregate-results.sh

Verify scores_pre_r1 coverage across the corpus (expect 360 / 360):

python3 -c "
import json, pathlib
files = list(pathlib.Path('results').rglob('_blind-eval/*/[!.]*-judge.json'))
files = [f for f in files if '_archive' not in f.parts and 'request' not in f.name and 'raw' not in f.name]
total = len(files); with_snap = sum(1 for f in files if 'scores_pre_r1' in json.loads(f.read_text()))
print(f'{with_snap}/{total} files carry scores_pre_r1')
"

Verify no fingerprint leaks:

grep -rlE "\.omc/|_bmad-output/|docs/bmad/|\.superpowers/|\.compound-engineering/|\.ecc/|CLAUDE\.md\.original|plugin_versions" \
  results/{,bugfix/,refactor/}_blind-eval/*/{implementation-diff.patch,auto-metrics.json} 2>/dev/null \
  | grep -v "\.pre-" \
  | head
# Expect: empty.