superpower

Superpowers is a skill-pack plugin for Claude Code by Jesse Vincent (obra). It ships a library of named skills — brainstorming, writing-plans, test-driven-development, systematic-debugging, verification-before-completion, and roughly two dozen more — that the base model is expected to invoke via the Skill tool when appropriate. In this benchmark it ranks 9 of 9 overall (z̄ = −0.320), driven by a bugfix gap (z = −0.740) and a refactor dip (z = −0.370); on feature it is competitive with the leaders (z = +0.149). See ../analysis/skill-content-effectiveness.md for the wording-level mechanism behind the bugfix gap.

Source. obra/superpowers; skills live in obra/superpowers-skills (MIT), marketplace manifest in obra/superpowers-marketplace. The benchmark installs at pinned 5.0.7 (SHA 917e5f53). Setup writes two lines to settings.json (enabledPlugins + marketplace entry). No wrapper command, no hook, no MCP server — a pure skill registry.

Benchmark invocation. On feature and refactor, the harness passes the raw task prompt unchanged (PROMPT="$SHARED_TASK"); skill activation is up to the base model. On bugfix, the harness pins two slash-command triggers — /superpowers:systematic-debugging at session start and /superpowers:verification-before-completion as a completion gate — so skill activation is isolated from base-model trigger-phrase sensitivity. The scores below are under those operating conditions.

What the plugin actually installs

A named skill registry plus short SKILL.md files loaded into context. Relevant skills for this benchmark: superpowers:brainstorming, writing-plans, executing-plans, test-driven-development, systematic-debugging, verification-before-completion, dispatching-parallel-agents, and using-superpowers (the meta-skill that tells the model to consult the registry before responding). On feature and refactor none are auto-triggered by the harness — the only control surface is the skill descriptions themselves and whatever the base model has learned about when to invoke them. On bugfix the harness names the triggers directly.

What the benchmark measured

Feature — rank 3. z = +0.149, 581 lines across 8 files, 90/90 tests passing, 10 ESLint errors. The session fired superpowers:brainstorming exactly once, then ran 125 turns of a standard build using TaskCreate / TaskUpdate for internal tracking. No parallel agents, no plan document, no TDD skill. The brainstorm appears to have disambiguated the Mode 2 CD Batch requirements before coding. Mid-pack.

Refactor — rank 9 (last). z = −0.370, 123 added / 42 removed across 12 files, 61/61 tests passing, 9 ESLint errors. One superpowers:brainstorming invocation, then 110 turns with two Explore subagents. All tests pass; judges placed it below T2. Per-task CI [141.8, 155.7] overlaps the middle tier and §4.3 flags refactor as noise-dominated — read this rank as weak evidence.

Bugfix — rank 9 cluster. Mean 156.08 / 200, z = −0.740, 95% CI [151.2, 161.0]. Tier T2 ({claudekit, compound, omc, superpower}). Both t1 and t2 fired 2 Skill calls each (systematic-debugging at session start, verification-before-completion at finish). Hard gates: 5/5 PASS (both trials). Scope files touched: 2. Diff sizes: +176 (t1), +125 (t2).

The harness prompt for bugfix explicitly names both skills:

/superpowers:systematic-debugging

<SHARED_TASK>

When you believe the fix is done, run /superpowers:verification-before-completion before claiming success.

This isolates skill quality from the base model’s trigger-phrase sensitivity.

Why it ranked 9/9

superpower is the only setup whose bugfix CI is fully disjoint from the cluster above it (T2 vs. T1), even with skills explicitly invoked. T2 on bugfix still trails T1 by 12–23 pts. On feature (z = +0.149) it is competitive with the leaders; on refactor (z = −0.370) it trails T1 inside the noise-dominated task (§4.4). The combined 9/9 rank is the cross-task consequence, not a statement about any single task.

Failure modes

Activation dependence on entry-point wording. The one-shot harness — used on feature and refactor — passes a terse prompt with no language that pattern-matches the skills’ trigger vocabulary. The model can read the prompt as a direct execution request and skip the registry. On bugfix the benchmark controls for this by naming the slash-commands in the prompt; without that, activation is not guaranteed. The wording-level mechanism is described in ../analysis/skill-content-effectiveness.md.
Skill output lands in T2 on bugfix, not T1. Once activated, the skills run a clean systematic-debugging → verification-before-completion pipeline that clears all five hard gates and keeps the diff scoped. Judges still score it 12–23 pts below ecc/bmad/pure. The remaining gap to T1 is skill-content headroom, not an activation failure.

Honest positioning

Superpowers is a well-constructed skill library. On feature-class work it costs nothing and adds one useful reflective pause. The 9/9 combined rank is driven by bugfix (z = −0.740) and refactor (z = −0.370 in noise). Read the remaining gap to T1 leaders as “skills work, but mid-pack on this corpus” rather than “the plugin is broken.” The base-model dependency on trigger-phrase pattern-matching for activation is a real operating-condition caveat for anyone deploying this in a one-shot harness: if the entry prompt does not name the skills, the registry may stay dormant.

Cross-reference: ../analysis/skill-content-effectiveness.md — wording patterns that drive top-5 bugfix performance, including caveats.

Observed in trial timelines

On bugfix, where the harness explicitly names the skills as slash-commands, superpower fires the highest Bash volume of any tool (mean 51, peak 63) and the highest test count (mean 24.5) — yet still scores last. Verification effort and judged quality decouple here: more test runs do not produce higher rubric scores. On feature, where no skill is forced, superpower lands at the cohort mean for both Bash (30.5) and tests (16) and scores rank 3.

Detail: see the per-trial timeline files linked below.

Trial timelines

Per-trial event timelines auto-extracted from session-logs/*.jsonl — skill activations, plugin/skill file reads, subagents dispatched, code mutations, Bash usage:

Trial timelines

Per-trial session execution extracted from each trial's session-logs/*.jsonl. Each card shows the subagents dispatched, skill activations, Bash command mix, and the final diff. Switch task tabs to compare behaviour across feature, bugfix, and refactor trials.

Feature4 trials Bugfix2 trials Refactor2 trials

t1 02:25 → 02:38 UTC · 12 min

2 commits8 files+581

“Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User stories - Mode 2 CD Batch.md and study the existing Mode 1 implementation in libs/core/src/domain/savings-cd/ and libs/savin…”

New files4
Edits6
Bash28
Skills1
Todos7

Bash command mix · 28 calls

other 12
tests 10
inspection 4
lint/format 1
git ops 1

Skill activations (1)

superpowers:brainstorming at 02:25

New files created (4)

libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
libs/core/src/domain/savings-cd/td-cd-mode2.util.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.util.ts

t2 15:58 → 16:19 UTC · 20 min

2 commits12 files+443

Agents1
New files5
Edits16
Bash47
Skills1

Bash command mix · 47 calls

other 20
tests 20
lint/format 3
inspection 2
typecheck 1
git ops 1

Skill activations (1)

superpowers:writing-plans at 15:58

Subagents dispatched (1)

Explore · Map Mode 1 implementation surface at 15:59

Subagent transcripts (1)

agent-a0ba463d07aa… — We need to implement Mode 2 CD Batch for TD-CD by analogy to the existing Mode 1 implementation in t… [Bash×30, Read×24]

New files created (5)

libs/core/src/domain/savings-cd/cd-aging.util.spec.ts
libs/core/src/domain/savings-cd/cd-aging.util.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
libs/core/src/port/service/td-cd-batch-resolver.port.ts

t3 04:34 → 04:59 UTC · 24 min

2 commits5 files+435

“<local-command-caveat>Caveat: The messages below were generated by the user while running local commands. DO NOT respond to these messages or otherwise consider them in your response unless the user explicitly asks you t…”

Agents3
New files2
Edits1
Bash30

Bash command mix · 30 calls

tests 24
inspection 2
git ops 2
typecheck 1
other 1

Subagents dispatched (3)

Explore · Explore Mode 1 implementation at 04:35
Explore · Find CD batch/holding entities at 04:38
Explore · Find barrel exports and imports at 04:49

Subagent transcripts (3)

agent-a1119fd3217d… — I need to find: 1. The barrel/index file that exports `TDCDMode1Strategy` — search for files that ex… [Read×22, Bash×21, Grep×12, Glob×5]
agent-a21776702c68… — Explore the Mode 1 CD implementation in this NestJS/TypeORM monorepo thoroughly. I need to understan… [Read×30, Bash×21, Grep×3, Glob×2]
agent-a314d11136da… — I need to understand how CD batches and holdings work in this codebase for implementing Mode 2 per-b… [Read×22, Grep×10, Bash×9, Glob×4]

New files created (2)

libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts

t4 09:11 → 09:30 UTC · 18 min

2 commits3 files+663

“<command-message>superpowers:using-superpowers</command-message> <command-name>/superpowers:using-superpowers</command-name> <command-args>Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD…”

Agents2
New files2
Edits1
Bash17

Bash command mix · 17 calls

tests 10
git ops 4
other 3

Subagents dispatched (2)

Explore · Explore Mode 1 implementation at 09:11
Explore · Find strategy wiring patterns at 09:15

Subagent transcripts (2)

agent-a5ba2b3cc91b… — In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/superpower-t4, find how TDCDMode1Strate… [Read×18, Bash×17, Glob×7, Grep×7]
agent-acd862899c2a… — Explore the Mode 1 TD-CD implementation thoroughly in this repo. I need to understand: 1. The file l… [Read×26, Bash×21]

New files created (2)

libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts

t1 15:51 → 16:05 UTC · 13 min

2 commits2 files+176

“/superpowers:systematic-debugging A QA bug report was filed. Read docs/benchmark/TASK.md for the full report: reproduction steps, observed vs expected behaviour, and the definition of done. Your job: investigate the code…”

Agents1
New files1
Edits4
Bash39
Skills2

Bash command mix · 39 calls

tests 16
other 12
install/build 5
git ops 4
inspection 2

Skill activations (2)

superpowers:systematic-debugging at 15:51
superpowers:verification-before-completion at 16:04

Subagents dispatched (1)

Explore · Explore savings-cd batch selection at 15:52

Subagent transcripts (1)

agent-a3c23db3c9db… — In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/bugfix/superpower-t1, I need to underst… [Read×15, Grep×14, Glob×3]

New files created (1)

superpower-t1/libs/savings-cd/src/domain/savings-cd-batch-data-source.spec.ts

t2 15:52 → 16:10 UTC · 17 min

2 commits2 files+125

Agents1
New files1
Edits3
Bash63
Skills2

Bash command mix · 63 calls

tests 33
other 13
install/build 10
git ops 5
inspection 2

Skill activations (2)

superpowers:systematic-debugging at 15:52
superpowers:verification-before-completion at 16:08

Subagents dispatched (1)

Explore · Find savings CD batch selection code at 15:52

Subagent transcripts (1)

agent-add4945f9f50… — I'm debugging SHP-2376: deposits into near-maturity Savings CD schemes get stuck because the batch s… [Read×17, Grep×7, Glob×5, Bash×4]

New files created (1)

superpower-t2/libs/savings-cd/src/domain/savings-cd-batch-data-source.spec.ts

t1 03:50 → 04:00 UTC · 10 min

2 commits12 files+123

“A refactor has been requested by the tech lead. Read docs/benchmark/TASK.md for the full brief: the two design seams being cleaned up, the behavioral "done" criteria, the non-goals, and the judgment calls you're trusted…”

Agents2
New files1
Edits22
Bash10
Skills1

Bash command mix · 10 calls

other 4
git ops 3
tests 2
inspection 1

Skill activations (1)

superpowers:brainstorming — Refactor TD-CD Mode 2 scheme-to-batch binding: move cdBatchId from Scheme to TSSchemeSetting, create a projection DTO to… at 03:50

Subagents dispatched (2)

Explore · Explore Scheme model/entity at 03:50
Explore · Explore CDBatch and strategy at 03:50

Subagent transcripts (2)

agent-a269ed25cb87… — In /Users/randytran/Codes/ai-tool-benchmark/runs/shp2317/superpower-t1, find all files related to: 1… [Bash×15, Read×15, Grep×5]
agent-a2bc34724118… — In /Users/randytran/Codes/ai-tool-benchmark/runs/shp2317/superpower-t1, find all files related to th… [Read×16, Grep×4, Glob×3, Bash×3]

New files created (1)

superpower-t1/libs/core/src/model/cd-batch-info.model.ts

t2 06:31 → 06:46 UTC · 14 min

2 commits12 files+178

Agents3
New files5
Edits16
Bash12
Skills1
Todos6

Bash command mix · 12 calls

other 5
git ops 3
tests 2
inspection 2

Skill activations (1)

superpowers:brainstorming — Refactor TD-CD Mode 2 scheme-to-batch binding: move cdBatchId from Scheme to TSSchemeSetting, create a batch projection … at 06:31

Subagents dispatched (3)

Explore · Explore Scheme model and entity at 06:32
Explore · Explore strategy port and CDBatch at 06:32
Explore · Explore savings-cd usecase layer at 06:32

Subagent transcripts (3)

agent-a80631d6edfd… — Thoroughness: very thorough In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2317/sup… [Read×21, Glob×5, Grep×5, Bash×4]
agent-ae297b6a4fd1… — Thoroughness: very thorough In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2317/sup… [Read×12, Bash×4, Glob×4, Grep×4]
agent-aee5973681a3… — Thoroughness: very thorough In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2317/sup… [Read×13, Bash×11, Glob×3, Grep×3]

New files created (4)

superpower-t2/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
superpower-t2/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
superpower-t2/libs/core/src/model/cd-batch-info.model.ts
superpower-t2/libs/core/src/port/service/td-cd-mode-strategy.port.ts