bugfix — Per-Task Aggregation

Generated: 2026-05-12T10:00:20Z

Methodology

Tools under test: 8
Blind labels: 24
Layout: single round per (artifact, judge); judge files read flat from each label dir. ^round[0-9]+$ subdir support is retained for multi-round stability studies (pilot/sample dirs excluded).
Judges: opus, grok420, glm51, gpt54pro, mimo25pro (5-judge panel; each artifact scored once by every judge)
Rubric: 20 items × 0–10 pts = 200 pt max
Canonical score per judge file: sum(scores.values()) (not the stored total field)
Reported tool mean: weighted mean of per-judge means (weights: opus×3, gpt54pro×2, grok420×1, glm51×1, mimo25pro×1)
Total judgments aggregated: 120

Caveats / threats to validity

Judge weights are pre-registered, not derived. The 3 / 2 / 1 / 1 / 1 weighting is stored as judges.*.weight in versions.lock.json (committed 2026-05-12) and reflects the operator’s prior trust in the Anthropic (opus) and OpenAI (gpt54pro) reviewers. An equal-weight aggregation is emitted alongside this report as final-report.equal-weight.md; the in-report Pooled Mean column is also the equal-weight comparator and lets readers verify rank-stability without leaving this file.
Judge scorer asymmetries. gpt54pro is consistently the harshest scorer in the panel (lowest mean across labels). mimo25pro is the most lenient and occasionally emits 200/200 saturations; its weight of 1 dilutes the impact, but right-tail scores should be read in that context.
σ decomposition. The per-tool standard deviation column is split into within_σ (trial-to-trial within-judge spread — mean of the per-judge stdev across the 3 trials) and between_σ (judge base-rate spread — stdev of per-judge means). Within > between would indicate the tool’s output is genuinely unstable trial-to-trial; the reverse means most variance is judge disagreement. The within_σ label is preserved from the multi-round harness era — in this single-round cohort it measures within-judge trial-to-trial noise.
Judge sampling not pinned. Temperature is fixed to 0 where the provider exposes it (OpenRouter, OpenCode Go). Claude CLI and OpenAI /v1/responses do not expose temperature/seed, so residual sampler variance is absorbed in per-judge σ rather than eliminated.
R1 mechanical-fact override. Rubric items with deterministic answers (e.g. tsc_errors == 0) are rewritten post-hoc from auto-metrics.json to remove LLM arithmetic / classification drift. Items locked per task: feature 12/13/16/20, bugfix 14/15, refactor 13/14. Pre-override scores are preserved under scores_pre_r1 on every judged file (scripts/aggregate-results.sh runs an idempotent R1 sweep before aggregating).
Blind eval is structural, not semantic. Tool identity is hidden via NATO labels and a path-/content-level scrub of tool-specific directories (.omc/, _bmad/, _bmad-output/, _bmad-core/, docs/bmad/, docs/superpowers/, plans/, .claudekit/, .gstack/, .superpowers/, .compound-engineering/, .ecc/, CLAUDE.md.original). auto-metrics.json is anonymised by stripping plugin_versions and collected_at. A skilled judge could still infer identity from idiosyncratic code style; we don’t claim semantic anonymity.
Cohort span: 2.7h (2026-05-10 → 2026-05-10). scripts/audit-cohort-symmetry.py flags spans >24h as a soft warning; this cohort completed within that window.

Aggregate Scores per Tool

Tool	Weighted Mean	Pooled Mean	Pooled σ	within_σ	between_σ	N	n(opus)	n(grok420)	n(glm51)	n(gpt54pro)	n(mimo25pro)
claudekit	184.33	186.33	10.81	8.01	8.11	15	3	3	3	3	3
ecc	181.38	183.73	12.03	7.92	9.94	15	3	3	3	3	3
pure	175.00	177.00	14.05	13.46	8.72	15	3	3	3	3	3
bmad	173.83	176.73	15.75	12.05	12.49	15	3	3	3	3	3
superpower	168.50	170.27	13.11	7.67	12.03	15	3	3	3	3	3
compound	167.33	169.67	12.51	6.48	12.01	15	3	3	3	3	3
omc	165.08	167.40	21.54	17.67	15.61	15	3	3	3	3	3
gstack	159.71	163.53	16.17	7.68	15.70	15	3	3	3	3	3

Ranking (Weighted Mean)

claudekit — 184.33/200
ecc — 181.38/200
pure — 175.00/200
bmad — 173.83/200
superpower — 168.50/200
compound — 167.33/200
omc — 165.08/200
gstack — 159.71/200

Per-Judge Means

Tool	opus	grok420	glm51	gpt54pro	mimo25pro
claudekit	184.3	191.0	186.0	174.3	196.0
ecc	182.7	191.3	187.3	167.0	190.3
pure	176.3	180.3	181.3	162.3	184.7
bmad	174.7	176.3	184.0	157.7	191.0
superpower	173.7	172.0	177.3	149.3	179.0
compound	170.3	173.3	173.0	149.7	182.0
omc	170.7	173.3	166.0	142.3	184.7
gstack	161.0	173.3	167.0	138.0	178.3