pure
The no-addon baseline. Vanilla Claude Code with plan mode forced on and nothing else.
Upstream
- Product: Anthropic Claude Code.
- CLI version captured in transcripts:
2.1.107 (Claude Code). - Model:
claude-opus-4-6. - Docs: https://docs.anthropic.com/en/docs/claude-code/overview — plan mode is described in the permissions documentation.
There is no third-party repository. “pure” is the CLI out of the box.
Performance
| Task | Cohort mean | pure mean | z | Rank |
|---|---|---|---|---|
| feature | 119.54 | 121.53 | +0.077 | 5 / 9 |
| bugfix | 168.29 | 176.67 | +0.508 | 3 / 9 |
| refactor | 160.15 | 161.08 | +0.086 | 3 / 9 |
Overall z̄ = +0.224, aggregate rank 3 / 9, rank-sum 11. In the feature and bugfix task-tiers pure sits inside the top cluster (feature tier-1 with bmad/gstack/superpower/ecc; bugfix tier-1 with ecc/bmad). This is the most important single result in the benchmark: the no-addon baseline lands in the same tier as the best-performing setups, inside the CI of the leaders.
Mechanism
There is no mechanism beyond Claude Code itself. The only non-default is --permission-mode plan, which:
- starts the session in plan mode (read / analyse only; no file writes, no shell side-effects),
- requires an explicit
ExitPlanModetool call to transition to write-mode, - forces the model to produce a written plan before any edit can happen.
Everything else — TodoWrite, the general-purpose Agent sub-agent, Bash, Read, Grep, Glob, Edit, Write — is Claude Code’s stock tool set. No plugins, no skills, no hooks, no MCP servers, no custom slash commands.
How this benchmark invoked it
config/pure-t1/ is deliberately empty of customisation. setup-tool-config.sh line 159 is a no-op:
“Pure Claude Code — no external tools to install. Config dir stays empty (no plugins, no MCP, no skills, no hooks).”
config/pure-t1/settings.json contains one key: "skipDangerousModePermissionPrompt": true.
manual-bench.sh guards pure (and mindful) with a plan-mode flag:
PLAN_MODE_FLAG=""
case "$TOOL" in
pure|mindful)
PLAN_MODE_FLAG="--permission-mode plan"
The per-tool prompt is the shared task text verbatim: PROMPT="$SHARED_TASK". No slash command, no setup preamble. The CLI is launched with claude --model claude-opus-4-6 --dangerously-skip-permissions --permission-mode plan.
What the transcripts look like
Feature (TD-CD Mode 2, 193 messages, ~25 min): 24 × Bash, 16 × Read, 14 × Grep, 3 × Write, 3 × Edit, 2 × Agent, 1 × ExitPlanMode. The ExitPlanMode payload is a ~1,500-word structured plan with an axis-by-axis Mode 1 vs Mode 2 table, an exhaustive file list, and a quote of the dev-tech-spec promise that “Mode 2 only requires a new TDCDMode2Strategy — zero service changes.” The model reads the PRD, surveys Mode 1, commits to a minimal strategy-only patch, exits plan mode, writes the strategy and tests. Result: 372 insertions, 0 tsc errors, 78/78 tests passing.
Bugfix (180 messages in the main session; a 13-message warm-up session pre-compacted context): 17 × Bash, 10 × Read, 10 × Grep, 2 × Write, 2 × Edit, 2 × Agent, 1 × ExitPlanMode in the main session. Three sub-agents total used for scoped investigation (1 dispatched in the warm-up, 2 in the main session — all read-only). 88/103 tests pass post-fix — notably 15 failing tests remain, but the fix is scoped and the judges still rated it rank-3.
Refactor (241 messages): 22 × Edit, 15 × Read, 10 × Bash, 7 × Grep, 4 × Agent, 1 × Write, 1 × ExitPlanMode. Edit-heavy shape — the planning cost was paid once, then the patch is mechanical. 9 files, 121 / 72, 61/61 tests passing.
Zero TodoWrite across all three sessions. The plan document emitted via ExitPlanMode is the todo list; it’s produced once, consulted implicitly by the model’s own context, and never re-materialised as a structured task object.
Why the baseline performs so well
Three observations from the transcripts:
-
Plan mode supplies most of what the other setups supply. It enforces the investigate-then-write discipline. Every pure run begins with an uninterrupted read-only investigation phase that ends in a committed written plan. Setups that wrap Claude Code typically add exactly this — a plan document, a scope boundary, a “do not edit before planning” guardrail. When the CLI is already doing it, the marginal value of an addon shrinks.
-
Opus 4.6 is sufficient on its own for benchmark-sized tasks. The tool mix is unremarkable — Read, Grep, Bash, Edit. Sub-agent use is sparse (2, 2–3, 1 across the three tasks) and scoped to investigation, never delegation of implementation. The model is not stretching for extra scaffolding because it doesn’t need it.
-
The shape of a strong baseline run is short-but-structured. One plan artefact, one
ExitPlanModetransition, then direct execution. No todo list churn, no orchestration overhead, no preamble tokens spent on setup ceremony. The feature run writes 372 lines in ~25 minutes; the refactor run lands 9 files in 241 messages that are mostly small Edits.
Strengths and failure modes
Strengths. Low variance in rank (5, 3, 3). Lowest ceremony cost per trial. No setup-introduced regressions — the only rubric it can fail is the underlying model’s own mistakes. Self-preference bias is mild (pure is penalised by the opus judge on feature, −7.3, and neutral-positive elsewhere — no evidence the baseline is being inflated by judge affinity).
Failure modes. No enforced verification step: the bugfix left 15 failing tests committed. No automatic scope broadening — if the task needs exploration beyond one plan pass, pure won’t initiate a second. No memory across runs; every session is cold. Lint hygiene is worse than task outcome (6–11 ESLint errors per trial) because nothing pushes the model to clean up after itself.
The takeaway is editorial rather than surprising: a well-planned vanilla session is competitive with anything the other setups in this cohort add on top.
References
config/pure-t1/settings.jsonscripts/setup-tool-config.sh(casepure)at line 159)scripts/manual-bench.sh(plan-mode guard at lines 46–53; prompt passthrough at 122–124; launch at 506)results/{pure,bugfix/pure,refactor/pure}/t1/results/cross-task-stats.json(pure rank-sum 11; z̄ = +0.224)- Claude Code documentation: https://docs.anthropic.com/en/docs/claude-code/overview
Observed in trial timelines
pure has 0 skill activations across all 24 trials (no addon installed) but consistently dispatches Claude Code’s built-in Explore subagent (mean 3.0 on feature, 2.5 on bugfix, 4.0 on refactor). The plan-mode CLI flag forces ExitPlanMode to fire exactly once per trial — no Skill calls are needed because the discipline is enforced at the harness layer.
Detail: see the per-trial timeline files linked below.
Trial timelines
Per-trial event timelines auto-extracted from session-logs/*.jsonl — skill activations, plugin/skill file reads, subagents dispatched, code mutations, Bash usage:
Trial timelines
Per-trial session execution extracted from each trial's session-logs/*.jsonl. Each card
shows the subagents dispatched, skill activations, Bash command mix, and the final diff. Switch task
tabs to compare behaviour across feature, bugfix, and refactor trials.
“Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User stories - Mode 2 CD Batch.md and study the existing Mode 1 implementation in libs/core/src/domain/savings-cd/ and libs/savin…”
- Agents2
- New files3
- Edits3
- Bash24
Subagents dispatched (2)
Explore· Read Mode 2 PRD at 14:31Explore· Explore Mode 1 implementation at 14:31
Subagent transcripts (2)
agent-a4626f2b3c05…— Thoroughly explore the existing Mode 1 CD implementation in the repo at `/Users/randytran/Codes/ai-t… [Bash×36, Read×19]agent-a8d49133c231…— Read the PRD file at `/Users/randytran/Codes/ai-tool-benchmark/runs/pure-t1/docs/infina-product-docs… [Read×1]
New files created (3)
/Users/randytran/Codes/ai-tool-benchmark/config/pure-t1/plans/fancy-moseying-blum.mdlibs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.tslibs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
“Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User stories - Mode 2 CD Batch.md and study the existing Mode 1 implementation in libs/core/src/domain/savings-cd/ and libs/savin…”
- Agents1
- New files7
- Edits5
- Bash15
Subagents dispatched (1)
Explore· Explore Mode 1 CD implementation at 14:31
Subagent transcripts (1)
agent-a76fff253717…— I'm planning to implement Mode 2 CD Batch for the TD-CD product in an NX monorepo at /Users/randytra… [Bash×27, Read×19, Glob×4]
New files created (7)
/Users/randytran/Codes/ai-tool-benchmark/config/pure-t2/plans/tingly-sleeping-kettle.mdlibs/core/src/domain/savings-cd/td-cd-mode2-payment-schedule.service.spec.tslibs/core/src/domain/savings-cd/td-cd-mode2-payment-schedule.service.tslibs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.tslibs/core/src/domain/savings-cd/td-cd-mode2.strategy.tslibs/core/src/domain/savings-cd/td-cd-mode2.util.spec.tslibs/core/src/domain/savings-cd/td-cd-mode2.util.ts
“Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User stories - Mode 2 CD Batch.md and study the existing Mode 1 implementation in libs/core/src/domain/savings-cd/ and libs/savin…”
- Agents4
- New files3
- Edits7
- Bash22
Subagents dispatched (4)
Explore· Read PRD for Mode 2 CD Batch at 04:50Explore· Explore Mode 1 strategy implementation at 04:50Explore· Explore CD entities and constants at 04:50Plan· Design Mode 2 implementation plan at 04:55
Subagent transcripts (4)
agent-a18948bc51e5…— Thoroughly explore the existing Mode 1 CD implementation to understand patterns, interfaces, and fil… [Bash×34, Read×26, Grep×2]agent-a73befd088bd…— Explore the TD-CD related entities, constants, enums, and DTOs. Focus on: 1. CD-related entities in… [Read×28, Bash×17, Glob×11]agent-a843868002a1…— Read the PRD file at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User… [Read×1]agent-a9d81b9fc452…— Design an implementation plan for Mode 2 CD Batch strategy for the TD-CD product. ## Context The cod… [Grep×28, Read×26, Glob×3]
New files created (3)
/Users/randytran/Codes/ai-tool-benchmark/config/pure-t3/plans/mellow-foraging-pudding.mdlibs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.tslibs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
“Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User stories - Mode 2 CD Batch.md and study the existing Mode 1 implementation in libs/core/src/domain/savings-cd/ and libs/savin…”
- Agents5
- New files6
- Edits6
- Bash24
Subagents dispatched (5)
Explore· Read PRD for Mode 2 CD Batch at 08:47Explore· Explore Mode 1 implementation in libs/core at 08:47Explore· Explore libs/savings-cd and related files at 08:47Explore· Explore Mode 1 tests and CD batch model at 08:51Plan· Design Mode 2 implementation plan at 08:54
Subagent transcripts (5)
agent-a0bc82fe50c9…— I need to understand two things thoroughly: 1. **Mode 1 Strategy Tests**: Find and read test files f… [Read×18, Glob×10, Bash×8, Grep×5]agent-a2c80e921370…— Design the implementation plan for TD-CD Mode 2 CD Batch strategy. Here's the full context: ## Exist… [Grep×20, Read×16, Glob×3]agent-a4385b9e0ab7…— Read the full PRD document at: docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [T… [Read×1]agent-a57b3dde8410…— Thoroughly explore the Mode 1 CD implementation in libs/core/src/domain/savings-cd/ directory. I nee… [Read×22, Bash×18, Glob×1]agent-afefd3eb6452…— Thoroughly explore libs/savings-cd/ directory and any related TD-CD files. I need to understand: 1.… [Read×38, Bash×15, Grep×1]
New files created (6)
/Users/randytran/Codes/ai-tool-benchmark/config/pure-t4/plans/enchanted-wobbling-horizon.mdlibs/cd-core/src/domain/utils/cd-aging-days.util.spec.tslibs/cd-core/src/domain/utils/cd-aging-days.util.tslibs/cd-core/src/domain/utils/index.tslibs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.tslibs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
“A QA bug report was filed. Read docs/benchmark/TASK.md for the full report: reproduction steps, observed vs expected behaviour, and the definition of done. Your job: investigate the codebase, find the root cause, and shi…”
- Agents3
- New files2
- Edits2
- Bash17
- Sessions2
Subagents dispatched (3)
Explore· Find savings CD batch eligibility code at 16:46Explore· Explore savings-cd batch eligibility at 16:46Explore· Explore core ntd-cd-savings service at 16:47
Subagent transcripts (3)
agent-a0a28b10007c…— In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/pure-t1, I need to understand h… [Read×17, Bash×16, Grep×3, Glob×1]agent-a598ac2cf984…— I'm investigating bug SHP-2376 in an NX monorepo. The issue: deposits into Savings CD schemes near m… [Bash×1]agent-a69b3b260a7c…— In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/pure-t1, I need to understand t… [Read×9, Grep×7, Glob×3, Bash×2]
New files created (2)
/Users/randytran/Codes/ai-tool-benchmark/config/shp2376/pure-t1/plans/a-qa-bug-report-wild-shannon.mdpure-t1/libs/savings-cd/src/domain/savings-cd-batch-data-source.spec.ts
“A QA bug report was filed. Read docs/benchmark/TASK.md for the full report: reproduction steps, observed vs expected behaviour, and the definition of done. Your job: investigate the codebase, find the root cause, and shi…”
- Agents2
- New files2
- Edits4
- Bash20
Subagents dispatched (2)
Explore· Explore Savings CD batch eligibility at 16:41Plan· Plan SHP-2376 bugfix approach at 16:45
Subagent transcripts (2)
agent-a7536ac7fe87…— I'm investigating bug SHP-2376 in /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/pure-t2. The… [Read×17, Grep×10, Bash×6, Glob×3]agent-a9b4d590a3c9…— ## Context Bug SHP-2376: Deposits into near-maturity Savings CD batches get stuck "In progress" fore… [Read×12, Grep×8, Glob×6]
New files created (2)
/Users/randytran/Codes/ai-tool-benchmark/config/shp2376/pure-t2/plans/a-qa-bug-report-humble-boole.mdpure-t2/libs/savings-cd/src/domain/savings-cd-batch-data-source.spec.ts
“A refactor has been requested by the tech lead. Read docs/benchmark/TASK.md for the full brief: the two design seams being cleaned up, the behavioral "done" criteria, the non-goals, and the judgment calls you're trusted…”
- Agents4
- New files1
- Edits22
- Bash10
Subagents dispatched (4)
Explore· Explore Scheme model/entity at 03:49Explore· Explore ITDCDModeStrategy port at 03:49Explore· Explore tests and imports at 03:49Plan· Design refactor plan for SHP-2317 at 03:55
Subagent transcripts (1)
agent-acfdfe0596b2…— Design an implementation plan for the SHP-2317 refactor. I've done thorough exploration. Here's what… [no tools]
New files created (1)
/Users/randytran/Codes/ai-tool-benchmark/config/shp2317/pure-t1/plans/a-refactor-has-been-twinkly-squirrel.md
“A refactor has been requested by the tech lead. Read docs/benchmark/TASK.md for the full brief: the two design seams being cleaned up, the behavioral "done" criteria, the non-goals, and the judgment calls you're trusted…”
- Agents4
- New files2
- Edits27
- Bash11
Subagents dispatched (4)
Explore· Explore Scheme model and entity at 06:31Explore· Explore ITDCDModeStrategy and CDBatch at 06:31Explore· Explore cdBatchId usage across codebase at 06:31Plan· Design refactor implementation plan at 06:35
Subagent transcripts (4)
agent-a6cfec3fc944…— I need to understand the Scheme model/entity in libs/core. Specifically: 1. Find the Scheme model (d… [Bash×9, Read×8, Grep×4, Glob×3]agent-aaac459c5e26…— I need to understand the strategy port and CDBatch usage in libs/core and libs/savings-cd. Search in… [Read×13, Grep×4, Glob×3, Bash×3]agent-ab6e77f23d57…— I need to find ALL references to cdBatchId and cdBatch across the codebase at /Users/randytran/Codes… [Read×8, Bash×7, Grep×4, Glob×2]agent-afcc8910a8b6…— Design an implementation plan for the SHP-2317 refactor described below. I've already explored the c… [Read×12, Glob×7, Grep×5]
New files created (2)
/Users/randytran/Codes/ai-tool-benchmark/config/shp2317/pure-t2/plans/a-refactor-has-been-mutable-crane.mdpure-t2/libs/core/src/model/cd-batch-info.model.ts