feature — Per-Task Aggregation

Generated: 2026-05-12T10:00:19Z

Methodology

Caveats / threats to validity

Aggregate Scores per Tool

Tool Weighted Mean Pooled Mean Pooled σ within_σ between_σ N n(opus) n(grok420) n(glm51) n(gpt54pro) n(mimo25pro)
ecc 149.17 152.80 15.15 4.72 15.65 15 3 3 3 3 3
compound 144.58 151.07 17.83 7.08 17.88 15 3 3 3 3 3
pure 144.25 148.27 17.79 5.07 18.54 15 3 3 3 3 3
superpower 142.29 145.60 14.26 6.41 13.87 15 3 3 3 3 3
bmad 137.17 145.27 21.97 5.77 22.91 15 3 3 3 3 3
claudekit 135.96 138.87 21.14 15.45 17.27 15 3 3 3 3 3
omc 135.17 139.60 20.63 9.91 19.60 15 3 3 3 3 3
gstack 121.17 126.73 23.81 12.37 21.59 15 3 3 3 3 3

Ranking (Weighted Mean)

  1. ecc — 149.17/200
  2. compound — 144.58/200
  3. pure — 144.25/200
  4. superpower — 142.29/200
  5. bmad — 137.17/200
  6. claudekit — 135.96/200
  7. omc — 135.17/200
  8. gstack — 121.17/200

Per-Judge Means

Tool opus grok420 glm51 gpt54pro mimo25pro
ecc 151.7 162.0 161.7 126.0 162.7
compound 136.7 167.3 156.7 128.0 166.7
pure 147.3 167.0 150.7 118.0 158.3
superpower 144.0 155.7 150.3 122.3 155.7
bmad 129.0 161.0 159.7 113.0 163.7
claudekit 140.0 156.0 132.0 113.3 153.0
omc 138.0 155.7 142.3 107.3 154.7
gstack 121.7 147.3 131.0 92.3 141.3