bugfix — Per-Task Aggregation

Generated: 2026-05-12T10:00:20Z

Methodology

Caveats / threats to validity

Aggregate Scores per Tool

Tool Weighted Mean Pooled Mean Pooled σ within_σ between_σ N n(opus) n(grok420) n(glm51) n(gpt54pro) n(mimo25pro)
claudekit 184.33 186.33 10.81 8.01 8.11 15 3 3 3 3 3
ecc 181.38 183.73 12.03 7.92 9.94 15 3 3 3 3 3
pure 175.00 177.00 14.05 13.46 8.72 15 3 3 3 3 3
bmad 173.83 176.73 15.75 12.05 12.49 15 3 3 3 3 3
superpower 168.50 170.27 13.11 7.67 12.03 15 3 3 3 3 3
compound 167.33 169.67 12.51 6.48 12.01 15 3 3 3 3 3
omc 165.08 167.40 21.54 17.67 15.61 15 3 3 3 3 3
gstack 159.71 163.53 16.17 7.68 15.70 15 3 3 3 3 3

Ranking (Weighted Mean)

  1. claudekit — 184.33/200
  2. ecc — 181.38/200
  3. pure — 175.00/200
  4. bmad — 173.83/200
  5. superpower — 168.50/200
  6. compound — 167.33/200
  7. omc — 165.08/200
  8. gstack — 159.71/200

Per-Judge Means

Tool opus grok420 glm51 gpt54pro mimo25pro
claudekit 184.3 191.0 186.0 174.3 196.0
ecc 182.7 191.3 187.3 167.0 190.3
pure 176.3 180.3 181.3 162.3 184.7
bmad 174.7 176.3 184.0 157.7 191.0
superpower 173.7 172.0 177.3 149.3 179.0
compound 170.3 173.3 173.0 149.7 182.0
omc 170.7 173.3 166.0 142.3 184.7
gstack 161.0 173.3 167.0 138.0 178.3