docs/tools/ — Tool profiles
One profile per setup under test. Read the individual files for mechanism detail; this index is for comparison at a glance.
Top-4 CIs overlap on every task — do not read rank-ordering within the top cluster as statistically significant. The mindful and omc tools are author-adjacent (declared in each profile’s disclosure block).
Leaderboard & upstream
| # | Tool | Upstream | License | Version at run | Profile |
|---|---|---|---|---|---|
| 1 | bmad |
bmad-code-org/BMAD-METHOD |
MIT | npx bmad-method@6.3.0 |
bmad.md |
| 2 | ecc |
affaan-m/everything-claude-code |
MIT | plugin 1.10.0 (commit 7eb7c598) |
ecc.md |
| 3 | pure |
Anthropic Claude Code (stock) | proprietary (CLI) | CLI 2.1.107 | pure.md |
| 4 | gstack |
garrytan/gstack |
MIT | skills 1.0.0 / VERSION 0.17.0.0 | gstack.md |
| 5 | mindful |
quangtran88/mindful-claude |
MIT | plugin 1.0.1 | mindful.md |
| 6 | claudekit |
carlrannaberg/claudekit |
MIT | npm 0.8.x | claudekit.md |
| 7 | compound |
EveryInc/compound-engineering-plugin |
MIT | 2.65.0 | compound.md |
| 8 | omc |
Yeachan-Heo/oh-my-claudecode |
MIT | plugin 4.13.x | omc.md |
| 9 | superpower |
obra/superpowers |
MIT | 5.0.7 (SHA 917e5f53) |
superpower.md |
All tools run on the same base executor model: claude-opus-4-6. The only thing that varies between rows is the setup (plugin, skill pack, hook kit, or bare-CLI configuration) wrapping Claude Code.
Cross-task z-scores
| Tool | feature | bugfix | refactor | z̄ | Note |
|---|---|---|---|---|---|
| bmad | +0.215 | +0.640 | +0.020 | +0.292 | Rank 1 — top-4 tie |
| ecc | +0.091 | +0.680 | −0.039 | +0.244 | Top-4 tie |
| pure | +0.077 | +0.508 | +0.086 | +0.224 | Top-4 tie — baseline in leader CI |
| gstack | +0.158 | +0.165 | −0.111 | +0.071 | Top-4 tie |
| mindful | −0.388 | +0.180 | +0.162 | −0.016 | Bimodal — best on refactor, worst on feature |
| claudekit | −0.113 | −0.346 | +0.057 | −0.134 | |
| compound | −0.082 | −0.487 | +0.070 | −0.167 | Middle-pack; collapses on bugfix |
| omc | −0.108 | −0.599 | +0.125 | −0.194 | |
| superpower | +0.149 | −0.740 | −0.370 | −0.320 | Outlier on bugfix (T2, CI disjoint from T1) — see superpower.md |
Exact CIs in results/cross-task-stats.json. Top-4 ordering inside the overlap band is sampler-noise, not signal.
Mechanism taxonomy
Grouping the 9 setups by the primary enhancement mechanism:
| Mechanism | Tools | Works by |
|---|---|---|
| No-addon baseline | pure |
Vanilla Claude Code + --permission-mode plan — nothing else |
| Reasoning-layer + hooks | mindful |
CLAUDE.md cognitive principles + 2 PreToolUse bash hooks; no agents, no MCP |
| Skill registry (model-selected) | superpower |
Named skill files the base model chooses to invoke via Skill tool |
| Skill pack + hook gates | claudekit, gstack |
Slash commands, skills, hooks enforcing gates (typecheck/eslint, freeze/scope-lock, eng-review) |
| Multi-agent orchestrator (sequential) | compound, ecc |
Fixed phase pipelines (plan → work → review) with stop-gates between phases |
| Multi-agent orchestrator (role-based) | bmad |
Agent personas (PM, architect, dev, QA) with hand-off between them |
| Meta-orchestrator (delegation-heavy) | omc |
Top-level planner that dispatches to specialized subagents/skills |
“Orchestrator” here means the setup adds its own agent/subagent layer on top of Claude Code, not just prompt + tool choices. The compound/ecc/bmad/omc row is not a ranking of orchestration quality — compound ranks 7, bmad ranks 1. Architecture is under-determinative of outcome in this corpus.
Invocation & planning profile
| Tool | Entry command | Planning layer | Plan-mode flag? | Setup turn? |
|---|---|---|---|---|
| pure | (no prefix) | Claude native plan-mode | yes | no |
| superpower | (no prefix) | skill registry (model chooses) | no | no |
| mindful | (no prefix, after setup turn) | Claude native plan-mode | yes | yes (/mindful-claude:setup) |
| bmad | /bmad-quick-dev |
BMad-Method phases | no | no |
| claudekit | /ck:cook --auto |
ck-plan skill |
no | no |
| ecc | /everything-claude-code:plan (feat/refactor), build-fix (bugfix) |
own plan skill |
no | no |
| gstack | /investigate (bugfix only); raw otherwise |
/autoplan, /ship gate |
no | no |
| compound | /compound-engineering:lfg |
/ce:plan phase (step 2 of 6) |
no | no |
| omc | /oh-my-claudecode:autopilot (after setup turn) |
internal planner subagent | no | yes (/omc-setup) |
Why pure and mindful get --permission-mode plan: they have no native planning layer. Every other setup brings its own plan step; stacking Claude’s plan mode on top of a setup’s plan would double-plan. gstack is excluded from plan-mode specifically because its /ship skill runs an eng-review gate at the tail — see ../methodology/pipeline.md §1.
Per-task winners
| Task | Rank 1 | Rank 2 | Rank 3 |
|---|---|---|---|
| feature (TD-CD Mode 2) | bmad | gstack | superpower |
| bugfix (near-maturity) | ecc | bmad | pure |
| refactor (scoped) | omc | pure | mindful |
No tool wins on all three tasks. pure is the only tool to make top-3 on two tasks (bugfix, refactor). Rank-1 on refactor is omc, but on a task where judge Spearman ρ CIs straddle zero — treat rank-1 there as less meaningful than rank-1 on feature.
Observed failure modes
Cross-cutting patterns from reading the transcripts (see ../analysis/skill-and-hook-patterns.md for the deep dive):
- Over-orchestration on small tasks — compound, omc. Multi-agent setups paying a high fixed setup cost on a 30-minute task suffer when the task doesn’t use their multi-phase capacity.
- Setup-turn tax — mindful, omc. A dedicated
/setupturn before the task costs one round of prompt + context. Hurts on greenfield-feature where planning context is tight. --autogate-suppression — claudekit./ck:cook --autoremoves the human-review gates that the setup relies on; the scripted workflow runs end-to-end but the “stop on test failure” behavior doesn’t kick in, resulting in 15 failing tests in a green commit.- Skill activation depends on entry-point wording — superpower. A registry-style skill library whose triggers are generic vocabulary (“bug”, “error”, “root cause”) is not guaranteed to activate on a terse prompt that does not pattern-match. The bugfix harness names the slash-commands explicitly (
/superpowers:systematic-debugging,/superpowers:verification-before-completion) to isolate skill content from trigger-phrase sensitivity; see superpower.md and../analysis/skill-content-effectiveness.md. - Eng-review gate ≠ code-quality gate — gstack.
/shipenforces process cleanliness but doesn’t catch all code-quality issues; average performance onbugfix/refactordespite tier-1 ranking onfeature.
How to read these profiles
Each profile follows roughly:
- Upstream & identity — repo, version, license, maintainer.
- Performance — per-task z, CI, tier; disclosure of author-adjacency where applicable.
- Mechanism — what actually runs: skills, hooks, sub-agents, MCP, permission layer.
- How this benchmark invoked it — exact prompt, plan-mode flag, setup turn.
- What the transcripts show — per-task: tool-call mix, sub-agent use, commit shape.
- Why it ranked where it did — grounded in transcript evidence.
- Strengths & failure modes — per-task observations.
- References — links back to upstream and to the benchmark artifacts.
The intent is: a reader should be able to close the profile with an accurate mental model of how the tool operates in practice on this corpus, not a re-phrasing of the upstream README.