Generated: 2026-05-12T10:00:20Z
- Tools under test: 8
- Blind labels: 24
- Layout: single round per (artifact, judge); judge files read flat from each label dir.
^round[0-9]+$ subdir support is retained for multi-round stability studies (pilot/sample dirs excluded).
- Judges: opus, grok420, glm51, gpt54pro, mimo25pro (5-judge panel; each artifact scored once by every judge)
- Rubric: 20 items × 0–10 pts = 200 pt max
- Canonical score per judge file:
sum(scores.values()) (not the stored total field)
- Reported tool mean: weighted mean of per-judge means (weights: opus×3, gpt54pro×2, grok420×1, glm51×1, mimo25pro×1)
- Total judgments aggregated: 120
- Judge weights are pre-registered, not derived. The 3 / 2 / 1 / 1 / 1 weighting is stored as
judges.*.weight in versions.lock.json (committed 2026-05-12) and reflects the operator’s prior trust in the Anthropic (opus) and OpenAI (gpt54pro) reviewers. An equal-weight aggregation is emitted alongside this report as final-report.equal-weight.md; the in-report Pooled Mean column is also the equal-weight comparator and lets readers verify rank-stability without leaving this file.
- Judge scorer asymmetries.
gpt54pro is consistently the harshest scorer in the panel (lowest mean across labels). mimo25pro is the most lenient and occasionally emits 200/200 saturations; its weight of 1 dilutes the impact, but right-tail scores should be read in that context.
- σ decomposition. The per-tool standard deviation column is split into
within_σ (trial-to-trial within-judge spread — mean of the per-judge stdev across the 3 trials) and between_σ (judge base-rate spread — stdev of per-judge means). Within > between would indicate the tool’s output is genuinely unstable trial-to-trial; the reverse means most variance is judge disagreement. The within_σ label is preserved from the multi-round harness era — in this single-round cohort it measures within-judge trial-to-trial noise.
- Judge sampling not pinned. Temperature is fixed to 0 where the provider exposes it (OpenRouter, OpenCode Go). Claude CLI and OpenAI
/v1/responses do not expose temperature/seed, so residual sampler variance is absorbed in per-judge σ rather than eliminated.
- R1 mechanical-fact override. Rubric items with deterministic answers (e.g.
tsc_errors == 0) are rewritten post-hoc from auto-metrics.json to remove LLM arithmetic / classification drift. Items locked per task: feature 12/13/16/20, bugfix 14/15, refactor 13/14. Pre-override scores are preserved under scores_pre_r1 on every judged file (scripts/aggregate-results.sh runs an idempotent R1 sweep before aggregating).
- Blind eval is structural, not semantic. Tool identity is hidden via NATO labels and a path-/content-level scrub of tool-specific directories (
.omc/, _bmad/, _bmad-output/, _bmad-core/, docs/bmad/, docs/superpowers/, plans/, .claudekit/, .gstack/, .superpowers/, .compound-engineering/, .ecc/, CLAUDE.md.original). auto-metrics.json is anonymised by stripping plugin_versions and collected_at. A skilled judge could still infer identity from idiosyncratic code style; we don’t claim semantic anonymity.
- Cohort span: 2.7h (2026-05-10 → 2026-05-10).
scripts/audit-cohort-symmetry.py flags spans >24h as a soft warning; this cohort completed within that window.
| Tool |
Weighted Mean |
Pooled Mean |
Pooled σ |
within_σ |
between_σ |
N |
n(opus) |
n(grok420) |
n(glm51) |
n(gpt54pro) |
n(mimo25pro) |
| claudekit |
184.33 |
186.33 |
10.81 |
8.01 |
8.11 |
15 |
3 |
3 |
3 |
3 |
3 |
| ecc |
181.38 |
183.73 |
12.03 |
7.92 |
9.94 |
15 |
3 |
3 |
3 |
3 |
3 |
| pure |
175.00 |
177.00 |
14.05 |
13.46 |
8.72 |
15 |
3 |
3 |
3 |
3 |
3 |
| bmad |
173.83 |
176.73 |
15.75 |
12.05 |
12.49 |
15 |
3 |
3 |
3 |
3 |
3 |
| superpower |
168.50 |
170.27 |
13.11 |
7.67 |
12.03 |
15 |
3 |
3 |
3 |
3 |
3 |
| compound |
167.33 |
169.67 |
12.51 |
6.48 |
12.01 |
15 |
3 |
3 |
3 |
3 |
3 |
| omc |
165.08 |
167.40 |
21.54 |
17.67 |
15.61 |
15 |
3 |
3 |
3 |
3 |
3 |
| gstack |
159.71 |
163.53 |
16.17 |
7.68 |
15.70 |
15 |
3 |
3 |
3 |
3 |
3 |
- claudekit — 184.33/200
- ecc — 181.38/200
- pure — 175.00/200
- bmad — 173.83/200
- superpower — 168.50/200
- compound — 167.33/200
- omc — 165.08/200
- gstack — 159.71/200
| Tool |
opus |
grok420 |
glm51 |
gpt54pro |
mimo25pro |
| claudekit |
184.3 |
191.0 |
186.0 |
174.3 |
196.0 |
| ecc |
182.7 |
191.3 |
187.3 |
167.0 |
190.3 |
| pure |
176.3 |
180.3 |
181.3 |
162.3 |
184.7 |
| bmad |
174.7 |
176.3 |
184.0 |
157.7 |
191.0 |
| superpower |
173.7 |
172.0 |
177.3 |
149.3 |
179.0 |
| compound |
170.3 |
173.3 |
173.0 |
149.7 |
182.0 |
| omc |
170.7 |
173.3 |
166.0 |
142.3 |
184.7 |
| gstack |
161.0 |
173.3 |
167.0 |
138.0 |
178.3 |