feature — Per-Task Aggregation

Generated: 2026-05-12T10:00:19Z

Methodology

Tools under test: 8
Blind labels: 24
Layout: single round per (artifact, judge); judge files read flat from each label dir. ^round[0-9]+$ subdir support is retained for multi-round stability studies (pilot/sample dirs excluded).
Judges: opus, grok420, glm51, gpt54pro, mimo25pro (5-judge panel; each artifact scored once by every judge)
Rubric: 20 items × 0–10 pts = 200 pt max
Canonical score per judge file: sum(scores.values()) (not the stored total field)
Reported tool mean: weighted mean of per-judge means (weights: opus×3, gpt54pro×2, grok420×1, glm51×1, mimo25pro×1)
Total judgments aggregated: 120

Caveats / threats to validity

Judge weights are pre-registered, not derived. The 3 / 2 / 1 / 1 / 1 weighting is stored as judges.*.weight in versions.lock.json (committed 2026-05-12) and reflects the operator’s prior trust in the Anthropic (opus) and OpenAI (gpt54pro) reviewers. An equal-weight aggregation is emitted alongside this report as final-report.equal-weight.md; the in-report Pooled Mean column is also the equal-weight comparator and lets readers verify rank-stability without leaving this file.
Judge scorer asymmetries. gpt54pro is consistently the harshest scorer in the panel (lowest mean across labels). mimo25pro is the most lenient and occasionally emits 200/200 saturations; its weight of 1 dilutes the impact, but right-tail scores should be read in that context.
σ decomposition. The per-tool standard deviation column is split into within_σ (trial-to-trial within-judge spread — mean of the per-judge stdev across the 3 trials) and between_σ (judge base-rate spread — stdev of per-judge means). Within > between would indicate the tool’s output is genuinely unstable trial-to-trial; the reverse means most variance is judge disagreement. The within_σ label is preserved from the multi-round harness era — in this single-round cohort it measures within-judge trial-to-trial noise.
Judge sampling not pinned. Temperature is fixed to 0 where the provider exposes it (OpenRouter, OpenCode Go). Claude CLI and OpenAI /v1/responses do not expose temperature/seed, so residual sampler variance is absorbed in per-judge σ rather than eliminated.
R1 mechanical-fact override. Rubric items with deterministic answers (e.g. tsc_errors == 0) are rewritten post-hoc from auto-metrics.json to remove LLM arithmetic / classification drift. Items locked per task: feature 12/13/16/20, bugfix 14/15, refactor 13/14. Pre-override scores are preserved under scores_pre_r1 on every judged file (scripts/aggregate-results.sh runs an idempotent R1 sweep before aggregating).
Blind eval is structural, not semantic. Tool identity is hidden via NATO labels and a path-/content-level scrub of tool-specific directories (.omc/, _bmad/, _bmad-output/, _bmad-core/, docs/bmad/, docs/superpowers/, plans/, .claudekit/, .gstack/, .superpowers/, .compound-engineering/, .ecc/, CLAUDE.md.original). auto-metrics.json is anonymised by stripping plugin_versions and collected_at. A skilled judge could still infer identity from idiosyncratic code style; we don’t claim semantic anonymity.
Cohort span: 27.0h (2026-05-09 → 2026-05-10). Spans >24h indicate the cohort did not complete within a single day; scripts/audit-cohort-symmetry.py flags this as a soft warning. The longest spans in this report stem from the leak-fix re-judge pass (see docs/RERUN-PRE-PUBLISH.md).

Aggregate Scores per Tool

Tool	Weighted Mean	Pooled Mean	Pooled σ	within_σ	between_σ	N	n(opus)	n(grok420)	n(glm51)	n(gpt54pro)	n(mimo25pro)
ecc	149.17	152.80	15.15	4.72	15.65	15	3	3	3	3	3
compound	144.58	151.07	17.83	7.08	17.88	15	3	3	3	3	3
pure	144.25	148.27	17.79	5.07	18.54	15	3	3	3	3	3
superpower	142.29	145.60	14.26	6.41	13.87	15	3	3	3	3	3
bmad	137.17	145.27	21.97	5.77	22.91	15	3	3	3	3	3
claudekit	135.96	138.87	21.14	15.45	17.27	15	3	3	3	3	3
omc	135.17	139.60	20.63	9.91	19.60	15	3	3	3	3	3
gstack	121.17	126.73	23.81	12.37	21.59	15	3	3	3	3	3

Ranking (Weighted Mean)

ecc — 149.17/200
compound — 144.58/200
pure — 144.25/200
superpower — 142.29/200
bmad — 137.17/200
claudekit — 135.96/200
omc — 135.17/200
gstack — 121.17/200

Per-Judge Means

Tool	opus	grok420	glm51	gpt54pro	mimo25pro
ecc	151.7	162.0	161.7	126.0	162.7
compound	136.7	167.3	156.7	128.0	166.7
pure	147.3	167.0	150.7	118.0	158.3
superpower	144.0	155.7	150.3	122.3	155.7
bmad	129.0	161.0	159.7	113.0	163.7
claudekit	140.0	156.0	132.0	113.3	153.0
omc	138.0	155.7	142.3	107.3	154.7
gstack	121.7	147.3	131.0	92.3	141.3