Skill cost efficiency — output tokens per score point, per line

Generated: 2026-05-14 Source: scripts/audit-sessions.pyresults/_audits/session-audit.json Scope: feature task only (n=3 trials × 8 tools = 24 sessions). The 3-task pipeline (feature, bugfix, refactor) is also mined; per-task drill-downs live in session-audit.md.

This page joins two audit fields:

Summed per skill, we get “output tokens spent inside skill X”, which we then divide by the trial’s weighted feature score and +Lines to get two efficiency ratios. The pure (no-addons) setup invokes no skills, so its row is empty by construction — it functions as the zero-skill-burn baseline.

Caveats

Cohort-wide: top skills by output tokens

Aggregated across all 8 tools × 3 tasks × 3 trials = 72 sessions. “Cells” = number of (tool, task) cells the skill appears in (max 3 since each addon is exclusive to one tool).

Skill Tool Cells Turns Output tokens
bmad-quick-dev bmad 3 1,380 693,733
superpowers:subagent-driven-development superpower 1 2,668 669,379
oh-my-claudecode:team omc 3 2,153 574,078
oh-my-claudecode:ralplan omc 3 1,100 559,150
cook claudekit 3 1,707 525,514
autoplan gstack 2 456 349,544
compound-engineering:ce-work compound 3 727 327,751
compound-engineering:ce-plan compound 3 387 255,051
ck-plan claudekit 3 540 200,432
ship gstack 2 336 172,074
superpowers:writing-plans superpower 1 47 162,288
superpowers:brainstorming superpower 3 340 140,225
everything-claude-code:plan ecc 3 427 137,366
compound-engineering:ce-code-review compound 3 261 132,251
oh-my-claudecode:hud omc 3 230 95,825

Each tool’s primary skill dominates its skill-output total. superpowers:subagent-driven-development is concentrated in a single session-task cell (superpower on feature, 2,668 turns) — it generates more output in one feature trial than bmad-quick-dev does across all three tasks combined.

Feature task: tokens per score point, tokens per line

Sorted by feature weighted-mean score (feature-cohort.md § Pooled-mean table). Score and +Lines are the cohort means (3 trials per tool). Skill out tok is the per-trial-summed output across all skills attributed to that tool’s feature trials.

Tool Score +Lines Skill out tok Tok / pt Tok / line Top skill (output)
ecc 152.2 1,924 48,956 322 25.4 everything-claude-code:plan (48,956)
compound 148.1 960 322,815 2,180 336.3 compound-engineering:ce-work (132,134)
bmad 147.9 521 242,106 1,637 464.7 bmad-quick-dev (242,106)
pure 146.5 821 0 0 0 (no skills invoked)
superpower 146.2 2,706 890,464 6,091 329.1 superpowers:subagent-driven-development (669,379)
omc 140.0 1,837 511,141 3,651 278.2 oh-my-claudecode:team (290,393)
claudekit 135.5 1,940 348,218 2,570 179.5 cook (274,451)
gstack 127.1 1,000 194,449 1,530 194.4 autoplan (163,706)

Observations

ecc is an order of magnitude more skill-efficient than any other setup. 25 output tokens per line shipped vs. 180–465 for the rest, and 322 tokens per score point vs. 1,500–6,100. everything-claude-code:plan runs once at the start (427 turns cohort-wide across 3 trials = ~142 turns/trial) and then yields to the main agent loop; ecc’s skill ceremony front-loads into one planning pass rather than running a parallel skill context throughout execution. This shows up as the smallest Skill out tok column despite the largest +Lines.

superpower’s 6,091 tok/pt is the cohort outlier. superpowers:subagent-driven-development runs 2,668 turns in a single feature trial and emits 669k output tokens, but the corresponding score lift over pure (146.2 vs 146.5) is statistically zero. The skill is doing work, but that work is not converting into judged score on feature. Compare to compound (compound-engineering:ce-work at 132k tokens for the same +148 score band).

pure at 0 skill-burn is a usable null. Pure’s 146.5 score sits in the top-4 with zero attributed skill output. Every tool above 0 in this column is choosing to spend output tokens on skill-internal text (plans, subagent prompts, hooks, status lines) on top of the same base model. The skill-cost-efficient setups are the ones where that spend either (a) stays small (ecc) or (b) translates into measurable lift over the pure baseline. On feature, only ecc clears both bars; superpower and omc spend the most output and score below pure.