Generated: 2026-05-12T10:00:22Z
- Tools under test: 8
- Blind labels: 24
- Layout: single round per (artifact, judge); judge files read flat from each label dir.
^round[0-9]+$ subdir support is retained for multi-round stability studies (pilot/sample dirs excluded).
- Judges: opus, grok420, glm51, gpt54pro, mimo25pro (5-judge panel; each artifact scored once by every judge)
- Rubric: 20 items × 0–10 pts = 200 pt max
- Canonical score per judge file:
sum(scores.values()) (not the stored total field)
- Reported tool mean: weighted mean of per-judge means (weights: opus×3, gpt54pro×2, grok420×1, glm51×1, mimo25pro×1)
- Total judgments aggregated: 120
- Judge weights are pre-registered, not derived. The 3 / 2 / 1 / 1 / 1 weighting is stored as
judges.*.weight in versions.lock.json (committed 2026-05-12) and reflects the operator’s prior trust in the Anthropic (opus) and OpenAI (gpt54pro) reviewers. An equal-weight aggregation is emitted alongside this report as final-report.equal-weight.md; the in-report Pooled Mean column is also the equal-weight comparator and lets readers verify rank-stability without leaving this file.
- Judge scorer asymmetries.
gpt54pro is consistently the harshest scorer in the panel (lowest mean across labels). mimo25pro is the most lenient and occasionally emits 200/200 saturations; its weight of 1 dilutes the impact, but right-tail scores should be read in that context.
- σ decomposition. The per-tool standard deviation column is split into
within_σ (trial-to-trial within-judge spread — mean of the per-judge stdev across the 3 trials) and between_σ (judge base-rate spread — stdev of per-judge means). Within > between would indicate the tool’s output is genuinely unstable trial-to-trial; the reverse means most variance is judge disagreement. The within_σ label is preserved from the multi-round harness era — in this single-round cohort it measures within-judge trial-to-trial noise.
- Judge sampling not pinned. Temperature is fixed to 0 where the provider exposes it (OpenRouter, OpenCode Go). Claude CLI and OpenAI
/v1/responses do not expose temperature/seed, so residual sampler variance is absorbed in per-judge σ rather than eliminated.
- R1 mechanical-fact override. Rubric items with deterministic answers (e.g.
tsc_errors == 0) are rewritten post-hoc from auto-metrics.json to remove LLM arithmetic / classification drift. Items locked per task: feature 12/13/16/20, bugfix 14/15, refactor 13/14. Pre-override scores are preserved under scores_pre_r1 on every judged file (scripts/aggregate-results.sh runs an idempotent R1 sweep before aggregating).
- Blind eval is structural, not semantic. Tool identity is hidden via NATO labels and a path-/content-level scrub of tool-specific directories (
.omc/, _bmad/, _bmad-output/, _bmad-core/, docs/bmad/, docs/superpowers/, plans/, .claudekit/, .gstack/, .superpowers/, .compound-engineering/, .ecc/, CLAUDE.md.original). auto-metrics.json is anonymised by stripping plugin_versions and collected_at. A skilled judge could still infer identity from idiosyncratic code style; we don’t claim semantic anonymity.
- Cohort span: 22.4h (2026-05-10 → 2026-05-11).
scripts/audit-cohort-symmetry.py flags spans >24h as a soft warning; this cohort completed within that window.
| Tool |
Weighted Mean |
Pooled Mean |
Pooled σ |
within_σ |
between_σ |
N |
n(opus) |
n(grok420) |
n(glm51) |
n(gpt54pro) |
n(mimo25pro) |
| pure |
182.46 |
184.47 |
12.84 |
3.34 |
13.38 |
15 |
3 |
3 |
3 |
3 |
3 |
| bmad |
181.29 |
182.80 |
14.76 |
5.65 |
15.03 |
15 |
3 |
3 |
3 |
3 |
3 |
| superpower |
180.04 |
182.27 |
14.77 |
4.40 |
15.36 |
15 |
3 |
3 |
3 |
3 |
3 |
| ecc |
179.54 |
182.13 |
15.20 |
5.55 |
15.43 |
15 |
3 |
3 |
3 |
3 |
3 |
| claudekit |
178.25 |
180.27 |
18.92 |
4.37 |
19.89 |
15 |
3 |
3 |
3 |
3 |
3 |
| compound |
173.46 |
177.20 |
17.81 |
5.54 |
18.48 |
15 |
3 |
3 |
3 |
3 |
3 |
| gstack |
170.79 |
174.40 |
21.86 |
14.02 |
19.11 |
15 |
3 |
3 |
3 |
3 |
3 |
| omc |
169.17 |
174.00 |
18.92 |
8.75 |
18.52 |
15 |
3 |
3 |
3 |
3 |
3 |
- pure — 182.46/200
- bmad — 181.29/200
- superpower — 180.04/200
- ecc — 179.54/200
- claudekit — 178.25/200
- compound — 173.46/200
- gstack — 170.79/200
- omc — 169.17/200
| Tool |
opus |
grok420 |
glm51 |
gpt54pro |
mimo25pro |
| pure |
187.7 |
184.3 |
191.7 |
162.0 |
196.7 |
| bmad |
190.0 |
185.0 |
190.7 |
156.3 |
192.0 |
| superpower |
187.0 |
187.3 |
190.7 |
155.0 |
191.3 |
| ecc |
185.3 |
187.7 |
189.7 |
155.0 |
193.0 |
| claudekit |
189.3 |
183.3 |
185.3 |
146.0 |
197.3 |
| compound |
177.7 |
182.3 |
184.0 |
146.3 |
195.7 |
| gstack |
176.7 |
184.0 |
182.3 |
141.0 |
188.0 |
| omc |
170.0 |
183.7 |
184.3 |
143.3 |
188.7 |