— Tool profiles

One profile per setup under test. Read the individual files for mechanism detail; this index is for comparison at a glance.

Top-4 CIs overlap on every task — do not read rank-ordering within the top cluster as statistically significant. The mindful and omc tools are author-adjacent (declared in each profile’s disclosure block).

Leaderboard & upstream

#	Tool	Upstream	License	Version at run	Profile
1	`bmad`	`bmad-code-org/BMAD-METHOD`	MIT	npx `bmad-method@6.3.0`	bmad.md
2	`ecc`	`affaan-m/everything-claude-code`	MIT	plugin 1.10.0 (commit `7eb7c598`)	ecc.md
3	`pure`	Anthropic Claude Code (stock)	proprietary (CLI)	CLI 2.1.107	pure.md
4	`gstack`	`garrytan/gstack`	MIT	skills 1.0.0 / VERSION 0.17.0.0	gstack.md
5	`mindful`	`quangtran88/mindful-claude`	MIT	plugin 1.0.1	mindful.md
6	`claudekit`	`carlrannaberg/claudekit`	MIT	npm 0.8.x	claudekit.md
7	`compound`	`EveryInc/compound-engineering-plugin`	MIT	2.65.0	compound.md
8	`omc`	`Yeachan-Heo/oh-my-claudecode`	MIT	plugin 4.13.x	omc.md
9	`superpower`	`obra/superpowers`	MIT	5.0.7 (SHA `917e5f53`)	superpower.md

All tools run on the same base executor model: claude-opus-4-6. The only thing that varies between rows is the setup (plugin, skill pack, hook kit, or bare-CLI configuration) wrapping Claude Code.

Cross-task z-scores

Tool	feature	bugfix	refactor	z̄	Note
bmad	+0.215	+0.640	+0.020	+0.292	Rank 1 — top-4 tie
ecc	+0.091	+0.680	−0.039	+0.244	Top-4 tie
pure	+0.077	+0.508	+0.086	+0.224	Top-4 tie — baseline in leader CI
gstack	+0.158	+0.165	−0.111	+0.071	Top-4 tie
mindful	−0.388	+0.180	+0.162	−0.016	Bimodal — best on refactor, worst on feature
claudekit	−0.113	−0.346	+0.057	−0.134
compound	−0.082	−0.487	+0.070	−0.167	Middle-pack; collapses on bugfix
omc	−0.108	−0.599	+0.125	−0.194
superpower	+0.149	−0.740	−0.370	−0.320	Outlier on bugfix (T2, CI disjoint from T1) — see superpower.md

Exact CIs in results/cross-task-stats.json. Top-4 ordering inside the overlap band is sampler-noise, not signal.

Mechanism taxonomy

Grouping the 9 setups by the primary enhancement mechanism:

Mechanism	Tools	Works by
No-addon baseline	`pure`	Vanilla Claude Code + `--permission-mode plan` — nothing else
Reasoning-layer + hooks	`mindful`	`CLAUDE.md` cognitive principles + 2 PreToolUse bash hooks; no agents, no MCP
Skill registry (model-selected)	`superpower`	Named skill files the base model chooses to invoke via `Skill` tool
Skill pack + hook gates	`claudekit`, `gstack`	Slash commands, skills, hooks enforcing gates (typecheck/eslint, freeze/scope-lock, eng-review)
Multi-agent orchestrator (sequential)	`compound`, `ecc`	Fixed phase pipelines (plan → work → review) with stop-gates between phases
Multi-agent orchestrator (role-based)	`bmad`	Agent personas (PM, architect, dev, QA) with hand-off between them
Meta-orchestrator (delegation-heavy)	`omc`	Top-level planner that dispatches to specialized subagents/skills

“Orchestrator” here means the setup adds its own agent/subagent layer on top of Claude Code, not just prompt + tool choices. The compound/ecc/bmad/omc row is not a ranking of orchestration quality — compound ranks 7, bmad ranks 1. Architecture is under-determinative of outcome in this corpus.

Invocation & planning profile

Tool	Entry command	Planning layer	Plan-mode flag?	Setup turn?
pure	(no prefix)	Claude native plan-mode	yes	no
superpower	(no prefix)	skill registry (model chooses)	no	no
mindful	(no prefix, after setup turn)	Claude native plan-mode	yes	yes (`/mindful-claude:setup`)
bmad	`/bmad-quick-dev`	BMad-Method phases	no	no
claudekit	`/ck:cook --auto`	`ck-plan` skill	no	no
ecc	`/everything-claude-code:plan` (feat/refactor), `build-fix` (bugfix)	own `plan` skill	no	no
gstack	`/investigate` (bugfix only); raw otherwise	`/autoplan`, `/ship` gate	no	no
compound	`/compound-engineering:lfg`	`/ce:plan` phase (step 2 of 6)	no	no
omc	`/oh-my-claudecode:autopilot` (after setup turn)	internal planner subagent	no	yes (`/omc-setup`)

Why pure and mindful get --permission-mode plan: they have no native planning layer. Every other setup brings its own plan step; stacking Claude’s plan mode on top of a setup’s plan would double-plan. gstack is excluded from plan-mode specifically because its /ship skill runs an eng-review gate at the tail — see ../methodology/pipeline.md §1.

Per-task winners

Task	Rank 1	Rank 2	Rank 3
feature (TD-CD Mode 2)	bmad	gstack	superpower
bugfix (near-maturity)	ecc	bmad	pure
refactor (scoped)	omc	pure	mindful

No tool wins on all three tasks. pure is the only tool to make top-3 on two tasks (bugfix, refactor). Rank-1 on refactor is omc, but on a task where judge Spearman ρ CIs straddle zero — treat rank-1 there as less meaningful than rank-1 on feature.

Observed failure modes

Cross-cutting patterns from reading the transcripts (see ../analysis/skill-and-hook-patterns.md for the deep dive):

Over-orchestration on small tasks — compound, omc. Multi-agent setups paying a high fixed setup cost on a 30-minute task suffer when the task doesn’t use their multi-phase capacity.
Setup-turn tax — mindful, omc. A dedicated /setup turn before the task costs one round of prompt + context. Hurts on greenfield-feature where planning context is tight.
--auto gate-suppression — claudekit. /ck:cook --auto removes the human-review gates that the setup relies on; the scripted workflow runs end-to-end but the “stop on test failure” behavior doesn’t kick in, resulting in 15 failing tests in a green commit.
Skill activation depends on entry-point wording — superpower. A registry-style skill library whose triggers are generic vocabulary (“bug”, “error”, “root cause”) is not guaranteed to activate on a terse prompt that does not pattern-match. The bugfix harness names the slash-commands explicitly (/superpowers:systematic-debugging, /superpowers:verification-before-completion) to isolate skill content from trigger-phrase sensitivity; see superpower.md and ../analysis/skill-content-effectiveness.md.
Eng-review gate ≠ code-quality gate — gstack. /ship enforces process cleanliness but doesn’t catch all code-quality issues; average performance on bugfix/refactor despite tier-1 ranking on feature.

How to read these profiles

Each profile follows roughly:

Upstream & identity — repo, version, license, maintainer.
Performance — per-task z, CI, tier; disclosure of author-adjacency where applicable.
Mechanism — what actually runs: skills, hooks, sub-agents, MCP, permission layer.
How this benchmark invoked it — exact prompt, plan-mode flag, setup turn.
What the transcripts show — per-task: tool-call mix, sub-agent use, commit shape.
Why it ranked where it did — grounded in transcript evidence.
Strengths & failure modes — per-task observations.
References — links back to upstream and to the benchmark artifacts.

The intent is: a reader should be able to close the profile with an accurate mental model of how the tool operates in practice on this corpus, not a re-phrasing of the upstream README.