refactor — Per-Task Aggregation

Generated: 2026-05-12T10:00:22Z

Methodology

Caveats / threats to validity

Aggregate Scores per Tool

Tool Weighted Mean Pooled Mean Pooled σ within_σ between_σ N n(opus) n(grok420) n(glm51) n(gpt54pro) n(mimo25pro)
pure 182.46 184.47 12.84 3.34 13.38 15 3 3 3 3 3
bmad 181.29 182.80 14.76 5.65 15.03 15 3 3 3 3 3
superpower 180.04 182.27 14.77 4.40 15.36 15 3 3 3 3 3
ecc 179.54 182.13 15.20 5.55 15.43 15 3 3 3 3 3
claudekit 178.25 180.27 18.92 4.37 19.89 15 3 3 3 3 3
compound 173.46 177.20 17.81 5.54 18.48 15 3 3 3 3 3
gstack 170.79 174.40 21.86 14.02 19.11 15 3 3 3 3 3
omc 169.17 174.00 18.92 8.75 18.52 15 3 3 3 3 3

Ranking (Weighted Mean)

  1. pure — 182.46/200
  2. bmad — 181.29/200
  3. superpower — 180.04/200
  4. ecc — 179.54/200
  5. claudekit — 178.25/200
  6. compound — 173.46/200
  7. gstack — 170.79/200
  8. omc — 169.17/200

Per-Judge Means

Tool opus grok420 glm51 gpt54pro mimo25pro
pure 187.7 184.3 191.7 162.0 196.7
bmad 190.0 185.0 190.7 156.3 192.0
superpower 187.0 187.3 190.7 155.0 191.3
ecc 185.3 187.7 189.7 155.0 193.0
claudekit 189.3 183.3 185.3 146.0 197.3
compound 177.7 182.3 184.0 146.3 195.7
gstack 176.7 184.0 182.3 141.0 188.0
omc 170.0 183.7 184.3 143.3 188.7