refactor — Per-Task Aggregation

Generated: 2026-05-12T10:00:22Z

Methodology

Tools under test: 8
Blind labels: 24
Layout: single round per (artifact, judge); judge files read flat from each label dir. ^round[0-9]+$ subdir support is retained for multi-round stability studies (pilot/sample dirs excluded).
Judges: opus, grok420, glm51, gpt54pro, mimo25pro (5-judge panel; each artifact scored once by every judge)
Rubric: 20 items × 0–10 pts = 200 pt max
Canonical score per judge file: sum(scores.values()) (not the stored total field)
Reported tool mean: weighted mean of per-judge means (weights: opus×3, gpt54pro×2, grok420×1, glm51×1, mimo25pro×1)
Total judgments aggregated: 120

Caveats / threats to validity

Judge weights are pre-registered, not derived. The 3 / 2 / 1 / 1 / 1 weighting is stored as judges.*.weight in versions.lock.json (committed 2026-05-12) and reflects the operator’s prior trust in the Anthropic (opus) and OpenAI (gpt54pro) reviewers. An equal-weight aggregation is emitted alongside this report as final-report.equal-weight.md; the in-report Pooled Mean column is also the equal-weight comparator and lets readers verify rank-stability without leaving this file.
Judge scorer asymmetries. gpt54pro is consistently the harshest scorer in the panel (lowest mean across labels). mimo25pro is the most lenient and occasionally emits 200/200 saturations; its weight of 1 dilutes the impact, but right-tail scores should be read in that context.
σ decomposition. The per-tool standard deviation column is split into within_σ (trial-to-trial within-judge spread — mean of the per-judge stdev across the 3 trials) and between_σ (judge base-rate spread — stdev of per-judge means). Within > between would indicate the tool’s output is genuinely unstable trial-to-trial; the reverse means most variance is judge disagreement. The within_σ label is preserved from the multi-round harness era — in this single-round cohort it measures within-judge trial-to-trial noise.
Judge sampling not pinned. Temperature is fixed to 0 where the provider exposes it (OpenRouter, OpenCode Go). Claude CLI and OpenAI /v1/responses do not expose temperature/seed, so residual sampler variance is absorbed in per-judge σ rather than eliminated.
R1 mechanical-fact override. Rubric items with deterministic answers (e.g. tsc_errors == 0) are rewritten post-hoc from auto-metrics.json to remove LLM arithmetic / classification drift. Items locked per task: feature 12/13/16/20, bugfix 14/15, refactor 13/14. Pre-override scores are preserved under scores_pre_r1 on every judged file (scripts/aggregate-results.sh runs an idempotent R1 sweep before aggregating).
Blind eval is structural, not semantic. Tool identity is hidden via NATO labels and a path-/content-level scrub of tool-specific directories (.omc/, _bmad/, _bmad-output/, _bmad-core/, docs/bmad/, docs/superpowers/, plans/, .claudekit/, .gstack/, .superpowers/, .compound-engineering/, .ecc/, CLAUDE.md.original). auto-metrics.json is anonymised by stripping plugin_versions and collected_at. A skilled judge could still infer identity from idiosyncratic code style; we don’t claim semantic anonymity.
Cohort span: 22.4h (2026-05-10 → 2026-05-11). scripts/audit-cohort-symmetry.py flags spans >24h as a soft warning; this cohort completed within that window.

Aggregate Scores per Tool

Tool	Weighted Mean	Pooled Mean	Pooled σ	within_σ	between_σ	N	n(opus)	n(grok420)	n(glm51)	n(gpt54pro)	n(mimo25pro)
pure	182.46	184.47	12.84	3.34	13.38	15	3	3	3	3	3
bmad	181.29	182.80	14.76	5.65	15.03	15	3	3	3	3	3
superpower	180.04	182.27	14.77	4.40	15.36	15	3	3	3	3	3
ecc	179.54	182.13	15.20	5.55	15.43	15	3	3	3	3	3
claudekit	178.25	180.27	18.92	4.37	19.89	15	3	3	3	3	3
compound	173.46	177.20	17.81	5.54	18.48	15	3	3	3	3	3
gstack	170.79	174.40	21.86	14.02	19.11	15	3	3	3	3	3
omc	169.17	174.00	18.92	8.75	18.52	15	3	3	3	3	3

Ranking (Weighted Mean)

pure — 182.46/200
bmad — 181.29/200
superpower — 180.04/200
ecc — 179.54/200
claudekit — 178.25/200
compound — 173.46/200
gstack — 170.79/200
omc — 169.17/200

Per-Judge Means

Tool	opus	grok420	glm51	gpt54pro	mimo25pro
pure	187.7	184.3	191.7	162.0	196.7
bmad	190.0	185.0	190.7	156.3	192.0
superpower	187.0	187.3	190.7	155.0	191.3
ecc	185.3	187.7	189.7	155.0	193.0
claudekit	189.3	183.3	185.3	146.0	197.3
compound	177.7	182.3	184.0	146.3	195.7
gstack	176.7	184.0	182.3	141.0	188.0
omc	170.0	183.7	184.3	143.3	188.7