docs/tools/ — Tool profiles

One profile per setup under test. Read the individual files for mechanism detail; this index is for comparison at a glance.

Top-4 CIs overlap on every task — do not read rank-ordering within the top cluster as statistically significant. The mindful and omc tools are author-adjacent (declared in each profile’s disclosure block).


Leaderboard & upstream

# Tool Upstream License Version at run Profile
1 bmad bmad-code-org/BMAD-METHOD MIT npx bmad-method@6.3.0 bmad.md
2 ecc affaan-m/everything-claude-code MIT plugin 1.10.0 (commit 7eb7c598) ecc.md
3 pure Anthropic Claude Code (stock) proprietary (CLI) CLI 2.1.107 pure.md
4 gstack garrytan/gstack MIT skills 1.0.0 / VERSION 0.17.0.0 gstack.md
5 mindful quangtran88/mindful-claude MIT plugin 1.0.1 mindful.md
6 claudekit carlrannaberg/claudekit MIT npm 0.8.x claudekit.md
7 compound EveryInc/compound-engineering-plugin MIT 2.65.0 compound.md
8 omc Yeachan-Heo/oh-my-claudecode MIT plugin 4.13.x omc.md
9 superpower obra/superpowers MIT 5.0.7 (SHA 917e5f53) superpower.md

All tools run on the same base executor model: claude-opus-4-6. The only thing that varies between rows is the setup (plugin, skill pack, hook kit, or bare-CLI configuration) wrapping Claude Code.


Cross-task z-scores

Tool feature bugfix refactor Note
bmad +0.215 +0.640 +0.020 +0.292 Rank 1 — top-4 tie
ecc +0.091 +0.680 −0.039 +0.244 Top-4 tie
pure +0.077 +0.508 +0.086 +0.224 Top-4 tie — baseline in leader CI
gstack +0.158 +0.165 −0.111 +0.071 Top-4 tie
mindful −0.388 +0.180 +0.162 −0.016 Bimodal — best on refactor, worst on feature
claudekit −0.113 −0.346 +0.057 −0.134
compound −0.082 −0.487 +0.070 −0.167 Middle-pack; collapses on bugfix
omc −0.108 −0.599 +0.125 −0.194
superpower +0.149 −0.740 −0.370 −0.320 Outlier on bugfix (T2, CI disjoint from T1) — see superpower.md

Exact CIs in results/cross-task-stats.json. Top-4 ordering inside the overlap band is sampler-noise, not signal.


Mechanism taxonomy

Grouping the 9 setups by the primary enhancement mechanism:

Mechanism Tools Works by
No-addon baseline pure Vanilla Claude Code + --permission-mode plan — nothing else
Reasoning-layer + hooks mindful CLAUDE.md cognitive principles + 2 PreToolUse bash hooks; no agents, no MCP
Skill registry (model-selected) superpower Named skill files the base model chooses to invoke via Skill tool
Skill pack + hook gates claudekit, gstack Slash commands, skills, hooks enforcing gates (typecheck/eslint, freeze/scope-lock, eng-review)
Multi-agent orchestrator (sequential) compound, ecc Fixed phase pipelines (plan → work → review) with stop-gates between phases
Multi-agent orchestrator (role-based) bmad Agent personas (PM, architect, dev, QA) with hand-off between them
Meta-orchestrator (delegation-heavy) omc Top-level planner that dispatches to specialized subagents/skills

“Orchestrator” here means the setup adds its own agent/subagent layer on top of Claude Code, not just prompt + tool choices. The compound/ecc/bmad/omc row is not a ranking of orchestration quality — compound ranks 7, bmad ranks 1. Architecture is under-determinative of outcome in this corpus.


Invocation & planning profile

Tool Entry command Planning layer Plan-mode flag? Setup turn?
pure (no prefix) Claude native plan-mode yes no
superpower (no prefix) skill registry (model chooses) no no
mindful (no prefix, after setup turn) Claude native plan-mode yes yes (/mindful-claude:setup)
bmad /bmad-quick-dev BMad-Method phases no no
claudekit /ck:cook --auto ck-plan skill no no
ecc /everything-claude-code:plan (feat/refactor), build-fix (bugfix) own plan skill no no
gstack /investigate (bugfix only); raw otherwise /autoplan, /ship gate no no
compound /compound-engineering:lfg /ce:plan phase (step 2 of 6) no no
omc /oh-my-claudecode:autopilot (after setup turn) internal planner subagent no yes (/omc-setup)

Why pure and mindful get --permission-mode plan: they have no native planning layer. Every other setup brings its own plan step; stacking Claude’s plan mode on top of a setup’s plan would double-plan. gstack is excluded from plan-mode specifically because its /ship skill runs an eng-review gate at the tail — see ../methodology/pipeline.md §1.


Per-task winners

Task Rank 1 Rank 2 Rank 3
feature (TD-CD Mode 2) bmad gstack superpower
bugfix (near-maturity) ecc bmad pure
refactor (scoped) omc pure mindful

No tool wins on all three tasks. pure is the only tool to make top-3 on two tasks (bugfix, refactor). Rank-1 on refactor is omc, but on a task where judge Spearman ρ CIs straddle zero — treat rank-1 there as less meaningful than rank-1 on feature.


Observed failure modes

Cross-cutting patterns from reading the transcripts (see ../analysis/skill-and-hook-patterns.md for the deep dive):


How to read these profiles

Each profile follows roughly:

  1. Upstream & identity — repo, version, license, maintainer.
  2. Performance — per-task z, CI, tier; disclosure of author-adjacency where applicable.
  3. Mechanism — what actually runs: skills, hooks, sub-agents, MCP, permission layer.
  4. How this benchmark invoked it — exact prompt, plan-mode flag, setup turn.
  5. What the transcripts show — per-task: tool-call mix, sub-agent use, commit shape.
  6. Why it ranked where it did — grounded in transcript evidence.
  7. Strengths & failure modes — per-task observations.
  8. References — links back to upstream and to the benchmark artifacts.

The intent is: a reader should be able to close the profile with an accurate mental model of how the tool operates in practice on this corpus, not a re-phrasing of the upstream README.