Generated: 2026-05-12T10:00:19Z
- Tools under test: 8
- Blind labels: 24
- Layout: single round per (artifact, judge); judge files read flat from each label dir.
^round[0-9]+$ subdir support is retained for multi-round stability studies (pilot/sample dirs excluded).
- Judges: opus, grok420, glm51, gpt54pro, mimo25pro (5-judge panel; each artifact scored once by every judge)
- Rubric: 20 items × 0–10 pts = 200 pt max
- Canonical score per judge file:
sum(scores.values()) (not the stored total field)
- Reported tool mean: weighted mean of per-judge means (weights: opus×3, gpt54pro×2, grok420×1, glm51×1, mimo25pro×1)
- Total judgments aggregated: 120
- Judge weights are pre-registered, not derived. The 3 / 2 / 1 / 1 / 1 weighting is stored as
judges.*.weight in versions.lock.json (committed 2026-05-12) and reflects the operator’s prior trust in the Anthropic (opus) and OpenAI (gpt54pro) reviewers. An equal-weight aggregation is emitted alongside this report as final-report.equal-weight.md; the in-report Pooled Mean column is also the equal-weight comparator and lets readers verify rank-stability without leaving this file.
- Judge scorer asymmetries.
gpt54pro is consistently the harshest scorer in the panel (lowest mean across labels). mimo25pro is the most lenient and occasionally emits 200/200 saturations; its weight of 1 dilutes the impact, but right-tail scores should be read in that context.
- σ decomposition. The per-tool standard deviation column is split into
within_σ (trial-to-trial within-judge spread — mean of the per-judge stdev across the 3 trials) and between_σ (judge base-rate spread — stdev of per-judge means). Within > between would indicate the tool’s output is genuinely unstable trial-to-trial; the reverse means most variance is judge disagreement. The within_σ label is preserved from the multi-round harness era — in this single-round cohort it measures within-judge trial-to-trial noise.
- Judge sampling not pinned. Temperature is fixed to 0 where the provider exposes it (OpenRouter, OpenCode Go). Claude CLI and OpenAI
/v1/responses do not expose temperature/seed, so residual sampler variance is absorbed in per-judge σ rather than eliminated.
- R1 mechanical-fact override. Rubric items with deterministic answers (e.g.
tsc_errors == 0) are rewritten post-hoc from auto-metrics.json to remove LLM arithmetic / classification drift. Items locked per task: feature 12/13/16/20, bugfix 14/15, refactor 13/14. Pre-override scores are preserved under scores_pre_r1 on every judged file (scripts/aggregate-results.sh runs an idempotent R1 sweep before aggregating).
- Blind eval is structural, not semantic. Tool identity is hidden via NATO labels and a path-/content-level scrub of tool-specific directories (
.omc/, _bmad/, _bmad-output/, _bmad-core/, docs/bmad/, docs/superpowers/, plans/, .claudekit/, .gstack/, .superpowers/, .compound-engineering/, .ecc/, CLAUDE.md.original). auto-metrics.json is anonymised by stripping plugin_versions and collected_at. A skilled judge could still infer identity from idiosyncratic code style; we don’t claim semantic anonymity.
- Cohort span: 27.0h (2026-05-09 → 2026-05-10). Spans >24h indicate the cohort did not complete within a single day;
scripts/audit-cohort-symmetry.py flags this as a soft warning. The longest spans in this report stem from the leak-fix re-judge pass (see docs/RERUN-PRE-PUBLISH.md).
| Tool |
Weighted Mean |
Pooled Mean |
Pooled σ |
within_σ |
between_σ |
N |
n(opus) |
n(grok420) |
n(glm51) |
n(gpt54pro) |
n(mimo25pro) |
| ecc |
149.17 |
152.80 |
15.15 |
4.72 |
15.65 |
15 |
3 |
3 |
3 |
3 |
3 |
| compound |
144.58 |
151.07 |
17.83 |
7.08 |
17.88 |
15 |
3 |
3 |
3 |
3 |
3 |
| pure |
144.25 |
148.27 |
17.79 |
5.07 |
18.54 |
15 |
3 |
3 |
3 |
3 |
3 |
| superpower |
142.29 |
145.60 |
14.26 |
6.41 |
13.87 |
15 |
3 |
3 |
3 |
3 |
3 |
| bmad |
137.17 |
145.27 |
21.97 |
5.77 |
22.91 |
15 |
3 |
3 |
3 |
3 |
3 |
| claudekit |
135.96 |
138.87 |
21.14 |
15.45 |
17.27 |
15 |
3 |
3 |
3 |
3 |
3 |
| omc |
135.17 |
139.60 |
20.63 |
9.91 |
19.60 |
15 |
3 |
3 |
3 |
3 |
3 |
| gstack |
121.17 |
126.73 |
23.81 |
12.37 |
21.59 |
15 |
3 |
3 |
3 |
3 |
3 |
- ecc — 149.17/200
- compound — 144.58/200
- pure — 144.25/200
- superpower — 142.29/200
- bmad — 137.17/200
- claudekit — 135.96/200
- omc — 135.17/200
- gstack — 121.17/200
| Tool |
opus |
grok420 |
glm51 |
gpt54pro |
mimo25pro |
| ecc |
151.7 |
162.0 |
161.7 |
126.0 |
162.7 |
| compound |
136.7 |
167.3 |
156.7 |
128.0 |
166.7 |
| pure |
147.3 |
167.0 |
150.7 |
118.0 |
158.3 |
| superpower |
144.0 |
155.7 |
150.3 |
122.3 |
155.7 |
| bmad |
129.0 |
161.0 |
159.7 |
113.0 |
163.7 |
| claudekit |
140.0 |
156.0 |
132.0 |
113.3 |
153.0 |
| omc |
138.0 |
155.7 |
142.3 |
107.3 |
154.7 |
| gstack |
121.7 |
147.3 |
131.0 |
92.3 |
141.3 |