Skill cost efficiency — output tokens per score point, per line

Generated: 2026-05-14 Source: scripts/audit-sessions.py → results/_audits/session-audit.json Scope: feature task only (n=3 trials × 8 tools = 24 sessions). The 3-task pipeline (feature, bugfix, refactor) is also mined; per-task drill-downs live in session-audit.md.

This page joins two audit fields:

message.usage.output_tokens — tokens the model generated on the assistant turn (i.e. the cost-bearing tokens billed to output).
attributionSkill — the skill/slash-command that owned that turn (captured from Claude Code’s session JSONL).

Summed per skill, we get “output tokens spent inside skill X”, which we then divide by the trial’s weighted feature score and +Lines to get two efficiency ratios. The pure (no-addons) setup invokes no skills, so its row is empty by construction — it functions as the zero-skill-burn baseline.

Caveats

Output tokens ≠ full billed cost. Cache reads and cache creation dominate Claude Opus’s $ figure (see Cost column in feature-cohort.md); this page reports the verbosity-inside-skill axis only. For $/score see feature-cohort.md.
attributionSkill is best-effort. Turns outside any skill (the main agent loop) are excluded from these totals. A skill that delegates to a sub-agent loses attribution on the sub-agent’s turns — sub-agent output appears in cohort totals (see session-audit.md § Subagent dispatches) but not in the parent skill’s column.
n=3 per tool. A single outlier trial can swing a tool’s per-skill mean by ±30 %. Pairs within ~20 % on these ratios should be read as ties.

Cohort-wide: top skills by output tokens

Aggregated across all 8 tools × 3 tasks × 3 trials = 72 sessions. “Cells” = number of (tool, task) cells the skill appears in (max 3 since each addon is exclusive to one tool).

Skill	Tool	Cells	Turns	Output tokens
`bmad-quick-dev`	bmad	3	1,380	693,733
`superpowers:subagent-driven-development`	superpower	1	2,668	669,379
`oh-my-claudecode:team`	omc	3	2,153	574,078
`oh-my-claudecode:ralplan`	omc	3	1,100	559,150
`cook`	claudekit	3	1,707	525,514
`autoplan`	gstack	2	456	349,544
`compound-engineering:ce-work`	compound	3	727	327,751
`compound-engineering:ce-plan`	compound	3	387	255,051
`ck-plan`	claudekit	3	540	200,432
`ship`	gstack	2	336	172,074
`superpowers:writing-plans`	superpower	1	47	162,288
`superpowers:brainstorming`	superpower	3	340	140,225
`everything-claude-code:plan`	ecc	3	427	137,366
`compound-engineering:ce-code-review`	compound	3	261	132,251
`oh-my-claudecode:hud`	omc	3	230	95,825

Each tool’s primary skill dominates its skill-output total. superpowers:subagent-driven-development is concentrated in a single session-task cell (superpower on feature, 2,668 turns) — it generates more output in one feature trial than bmad-quick-dev does across all three tasks combined.

Feature task: tokens per score point, tokens per line

Sorted by feature weighted-mean score (feature-cohort.md § Pooled-mean table). Score and +Lines are the cohort means (3 trials per tool). Skill out tok is the per-trial-summed output across all skills attributed to that tool’s feature trials.

Tool	Score	+Lines	Skill out tok	Tok / pt	Tok / line	Top skill (output)
ecc	152.2	1,924	48,956	322	25.4	`everything-claude-code:plan` (48,956)
compound	148.1	960	322,815	2,180	336.3	`compound-engineering:ce-work` (132,134)
bmad	147.9	521	242,106	1,637	464.7	`bmad-quick-dev` (242,106)
pure	146.5	821	0	0	0	(no skills invoked)
superpower	146.2	2,706	890,464	6,091	329.1	`superpowers:subagent-driven-development` (669,379)
omc	140.0	1,837	511,141	3,651	278.2	`oh-my-claudecode:team` (290,393)
claudekit	135.5	1,940	348,218	2,570	179.5	`cook` (274,451)
gstack	127.1	1,000	194,449	1,530	194.4	`autoplan` (163,706)

Observations

ecc is an order of magnitude more skill-efficient than any other setup. 25 output tokens per line shipped vs. 180–465 for the rest, and 322 tokens per score point vs. 1,500–6,100. everything-claude-code:plan runs once at the start (427 turns cohort-wide across 3 trials = ~142 turns/trial) and then yields to the main agent loop; ecc’s skill ceremony front-loads into one planning pass rather than running a parallel skill context throughout execution. This shows up as the smallest Skill out tok column despite the largest +Lines.

superpower’s 6,091 tok/pt is the cohort outlier. superpowers:subagent-driven-development runs 2,668 turns in a single feature trial and emits 669k output tokens, but the corresponding score lift over pure (146.2 vs 146.5) is statistically zero. The skill is doing work, but that work is not converting into judged score on feature. Compare to compound (compound-engineering:ce-work at 132k tokens for the same +148 score band).

pure at 0 skill-burn is a usable null. Pure’s 146.5 score sits in the top-4 with zero attributed skill output. Every tool above 0 in this column is choosing to spend output tokens on skill-internal text (plans, subagent prompts, hooks, status lines) on top of the same base model. The skill-cost-efficient setups are the ones where that spend either (a) stays small (ecc) or (b) translates into measurable lift over the pure baseline. On feature, only ecc clears both bars; superpower and omc spend the most output and score below pure.

feature-cohort.md — full feature-task table with billed $ cost (cache-aware) and per-trial run-time stats.
../../results/_audits/session-audit.md — cohort behavioural fingerprints, top-5 skills per tool, full per-(tool, task) metrics.
../../scripts/audit-sessions.py — miner; the skill_token_cost field is at TrialMetrics.skill_token_cost.

Skill cost efficiency — output tokens per score point, per line

Caveats

Cohort-wide: top skills by output tokens

Feature task: tokens per score point, tokens per line

Observations

Related