pure

The no-addon baseline. Vanilla Claude Code with plan mode forced on and nothing else.

Upstream

There is no third-party repository. “pure” is the CLI out of the box.

Performance

Task Cohort mean pure mean z Rank
feature 119.54 121.53 +0.077 5 / 9
bugfix 168.29 176.67 +0.508 3 / 9
refactor 160.15 161.08 +0.086 3 / 9

Overall z̄ = +0.224, aggregate rank 3 / 9, rank-sum 11. In the feature and bugfix task-tiers pure sits inside the top cluster (feature tier-1 with bmad/gstack/superpower/ecc; bugfix tier-1 with ecc/bmad). This is the most important single result in the benchmark: the no-addon baseline lands in the same tier as the best-performing setups, inside the CI of the leaders.

Mechanism

There is no mechanism beyond Claude Code itself. The only non-default is --permission-mode plan, which:

Everything else — TodoWrite, the general-purpose Agent sub-agent, Bash, Read, Grep, Glob, Edit, Write — is Claude Code’s stock tool set. No plugins, no skills, no hooks, no MCP servers, no custom slash commands.

How this benchmark invoked it

config/pure-t1/ is deliberately empty of customisation. setup-tool-config.sh line 159 is a no-op:

“Pure Claude Code — no external tools to install. Config dir stays empty (no plugins, no MCP, no skills, no hooks).”

config/pure-t1/settings.json contains one key: "skipDangerousModePermissionPrompt": true.

manual-bench.sh guards pure (and mindful) with a plan-mode flag:

PLAN_MODE_FLAG=""
case "$TOOL" in
  pure|mindful)
    PLAN_MODE_FLAG="--permission-mode plan"

The per-tool prompt is the shared task text verbatim: PROMPT="$SHARED_TASK". No slash command, no setup preamble. The CLI is launched with claude --model claude-opus-4-6 --dangerously-skip-permissions --permission-mode plan.

What the transcripts look like

Feature (TD-CD Mode 2, 193 messages, ~25 min): 24 × Bash, 16 × Read, 14 × Grep, 3 × Write, 3 × Edit, 2 × Agent, 1 × ExitPlanMode. The ExitPlanMode payload is a ~1,500-word structured plan with an axis-by-axis Mode 1 vs Mode 2 table, an exhaustive file list, and a quote of the dev-tech-spec promise that “Mode 2 only requires a new TDCDMode2Strategy — zero service changes.” The model reads the PRD, surveys Mode 1, commits to a minimal strategy-only patch, exits plan mode, writes the strategy and tests. Result: 372 insertions, 0 tsc errors, 78/78 tests passing.

Bugfix (180 messages in the main session; a 13-message warm-up session pre-compacted context): 17 × Bash, 10 × Read, 10 × Grep, 2 × Write, 2 × Edit, 2 × Agent, 1 × ExitPlanMode in the main session. Three sub-agents total used for scoped investigation (1 dispatched in the warm-up, 2 in the main session — all read-only). 88/103 tests pass post-fix — notably 15 failing tests remain, but the fix is scoped and the judges still rated it rank-3.

Refactor (241 messages): 22 × Edit, 15 × Read, 10 × Bash, 7 × Grep, 4 × Agent, 1 × Write, 1 × ExitPlanMode. Edit-heavy shape — the planning cost was paid once, then the patch is mechanical. 9 files, 121 / 72, 61/61 tests passing.

Zero TodoWrite across all three sessions. The plan document emitted via ExitPlanMode is the todo list; it’s produced once, consulted implicitly by the model’s own context, and never re-materialised as a structured task object.

Why the baseline performs so well

Three observations from the transcripts:

  1. Plan mode supplies most of what the other setups supply. It enforces the investigate-then-write discipline. Every pure run begins with an uninterrupted read-only investigation phase that ends in a committed written plan. Setups that wrap Claude Code typically add exactly this — a plan document, a scope boundary, a “do not edit before planning” guardrail. When the CLI is already doing it, the marginal value of an addon shrinks.

  2. Opus 4.6 is sufficient on its own for benchmark-sized tasks. The tool mix is unremarkable — Read, Grep, Bash, Edit. Sub-agent use is sparse (2, 2–3, 1 across the three tasks) and scoped to investigation, never delegation of implementation. The model is not stretching for extra scaffolding because it doesn’t need it.

  3. The shape of a strong baseline run is short-but-structured. One plan artefact, one ExitPlanMode transition, then direct execution. No todo list churn, no orchestration overhead, no preamble tokens spent on setup ceremony. The feature run writes 372 lines in ~25 minutes; the refactor run lands 9 files in 241 messages that are mostly small Edits.

Strengths and failure modes

Strengths. Low variance in rank (5, 3, 3). Lowest ceremony cost per trial. No setup-introduced regressions — the only rubric it can fail is the underlying model’s own mistakes. Self-preference bias is mild (pure is penalised by the opus judge on feature, −7.3, and neutral-positive elsewhere — no evidence the baseline is being inflated by judge affinity).

Failure modes. No enforced verification step: the bugfix left 15 failing tests committed. No automatic scope broadening — if the task needs exploration beyond one plan pass, pure won’t initiate a second. No memory across runs; every session is cold. Lint hygiene is worse than task outcome (6–11 ESLint errors per trial) because nothing pushes the model to clean up after itself.

The takeaway is editorial rather than surprising: a well-planned vanilla session is competitive with anything the other setups in this cohort add on top.

References

Observed in trial timelines

pure has 0 skill activations across all 24 trials (no addon installed) but consistently dispatches Claude Code’s built-in Explore subagent (mean 3.0 on feature, 2.5 on bugfix, 4.0 on refactor). The plan-mode CLI flag forces ExitPlanMode to fire exactly once per trial — no Skill calls are needed because the discipline is enforced at the harness layer.

Detail: see the per-trial timeline files linked below.

Trial timelines

Per-trial event timelines auto-extracted from session-logs/*.jsonl — skill activations, plugin/skill file reads, subagents dispatched, code mutations, Bash usage:

Trial timelines

Per-trial session execution extracted from each trial's session-logs/*.jsonl. Each card shows the subagents dispatched, skill activations, Bash command mix, and the final diff. Switch task tabs to compare behaviour across feature, bugfix, and refactor trials.

t1 14:31 → 14:52 UTC · 21 min
2 commits4 files+372

“Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User stories - Mode 2 CD Batch.md and study the existing Mode 1 implementation in libs/core/src/domain/savings-cd/ and libs/savin…”

  • Agents2
  • New files3
  • Edits3
  • Bash24
Bash command mix · 24 calls
  • other 10
  • tests 10
  • inspection 2
  • lint/format 1
  • git ops 1
Subagents dispatched (2)
  • Explore · Read Mode 2 PRD at 14:31
  • Explore · Explore Mode 1 implementation at 14:31
Subagent transcripts (2)
  • agent-a4626f2b3c05… — Thoroughly explore the existing Mode 1 CD implementation in the repo at `/Users/randytran/Codes/ai-t… [Bash×36, Read×19]
  • agent-a8d49133c231… — Read the PRD file at `/Users/randytran/Codes/ai-tool-benchmark/runs/pure-t1/docs/infina-product-docs… [Read×1]
New files created (3)
  • /Users/randytran/Codes/ai-tool-benchmark/config/pure-t1/plans/fancy-moseying-blum.md
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
t2 14:30 → 15:08 UTC · 37 min
2 commits9 files+655

“Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User stories - Mode 2 CD Batch.md and study the existing Mode 1 implementation in libs/core/src/domain/savings-cd/ and libs/savin…”

  • Agents1
  • New files7
  • Edits5
  • Bash15
Bash command mix · 15 calls
  • tests 9
  • other 3
  • inspection 2
  • git ops 1
Subagents dispatched (1)
  • Explore · Explore Mode 1 CD implementation at 14:31
Subagent transcripts (1)
  • agent-a76fff253717… — I'm planning to implement Mode 2 CD Batch for the TD-CD product in an NX monorepo at /Users/randytra… [Bash×27, Read×19, Glob×4]
New files created (7)
  • /Users/randytran/Codes/ai-tool-benchmark/config/pure-t2/plans/tingly-sleeping-kettle.md
  • libs/core/src/domain/savings-cd/td-cd-mode2-payment-schedule.service.spec.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2-payment-schedule.service.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.util.spec.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.util.ts
t3 04:49 → 05:22 UTC · 33 min
2 commits6 files+410

“Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User stories - Mode 2 CD Batch.md and study the existing Mode 1 implementation in libs/core/src/domain/savings-cd/ and libs/savin…”

  • Agents4
  • New files3
  • Edits7
  • Bash22
Bash command mix · 22 calls
  • tests 14
  • git ops 4
  • typecheck 2
  • lint/format 1
  • inspection 1
Subagents dispatched (4)
  • Explore · Read PRD for Mode 2 CD Batch at 04:50
  • Explore · Explore Mode 1 strategy implementation at 04:50
  • Explore · Explore CD entities and constants at 04:50
  • Plan · Design Mode 2 implementation plan at 04:55
Subagent transcripts (4)
  • agent-a18948bc51e5… — Thoroughly explore the existing Mode 1 CD implementation to understand patterns, interfaces, and fil… [Bash×34, Read×26, Grep×2]
  • agent-a73befd088bd… — Explore the TD-CD related entities, constants, enums, and DTOs. Focus on: 1. CD-related entities in… [Read×28, Bash×17, Glob×11]
  • agent-a843868002a1… — Read the PRD file at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User… [Read×1]
  • agent-a9d81b9fc452… — Design an implementation plan for Mode 2 CD Batch strategy for the TD-CD product. ## Context The cod… [Grep×28, Read×26, Glob×3]
New files created (3)
  • /Users/randytran/Codes/ai-tool-benchmark/config/pure-t3/plans/mellow-foraging-pudding.md
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
t4 08:47 → 09:08 UTC · 20 min
2 commits10 files+430

“Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User stories - Mode 2 CD Batch.md and study the existing Mode 1 implementation in libs/core/src/domain/savings-cd/ and libs/savin…”

  • Agents5
  • New files6
  • Edits6
  • Bash24
Bash command mix · 24 calls
  • tests 13
  • inspection 4
  • other 2
  • typecheck 2
  • git ops 2
  • lint/format 1
Subagents dispatched (5)
  • Explore · Read PRD for Mode 2 CD Batch at 08:47
  • Explore · Explore Mode 1 implementation in libs/core at 08:47
  • Explore · Explore libs/savings-cd and related files at 08:47
  • Explore · Explore Mode 1 tests and CD batch model at 08:51
  • Plan · Design Mode 2 implementation plan at 08:54
Subagent transcripts (5)
  • agent-a0bc82fe50c9… — I need to understand two things thoroughly: 1. **Mode 1 Strategy Tests**: Find and read test files f… [Read×18, Glob×10, Bash×8, Grep×5]
  • agent-a2c80e921370… — Design the implementation plan for TD-CD Mode 2 CD Batch strategy. Here's the full context: ## Exist… [Grep×20, Read×16, Glob×3]
  • agent-a4385b9e0ab7… — Read the full PRD document at: docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [T… [Read×1]
  • agent-a57b3dde8410… — Thoroughly explore the Mode 1 CD implementation in libs/core/src/domain/savings-cd/ directory. I nee… [Read×22, Bash×18, Glob×1]
  • agent-afefd3eb6452… — Thoroughly explore libs/savings-cd/ directory and any related TD-CD files. I need to understand: 1.… [Read×38, Bash×15, Grep×1]
New files created (6)
  • /Users/randytran/Codes/ai-tool-benchmark/config/pure-t4/plans/enchanted-wobbling-horizon.md
  • libs/cd-core/src/domain/utils/cd-aging-days.util.spec.ts
  • libs/cd-core/src/domain/utils/cd-aging-days.util.ts
  • libs/cd-core/src/domain/utils/index.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
t1 16:46 → 16:58 UTC · 11 min
2 commits2 files+138

“A QA bug report was filed. Read docs/benchmark/TASK.md for the full report: reproduction steps, observed vs expected behaviour, and the definition of done. Your job: investigate the codebase, find the root cause, and shi…”

  • Agents3
  • New files2
  • Edits2
  • Bash17
  • Sessions2
Bash command mix · 17 calls
  • tests 7
  • other 7
  • git ops 2
  • inspection 1
Subagents dispatched (3)
  • Explore · Find savings CD batch eligibility code at 16:46
  • Explore · Explore savings-cd batch eligibility at 16:46
  • Explore · Explore core ntd-cd-savings service at 16:47
Subagent transcripts (3)
  • agent-a0a28b10007c… — In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/pure-t1, I need to understand h… [Read×17, Bash×16, Grep×3, Glob×1]
  • agent-a598ac2cf984… — I'm investigating bug SHP-2376 in an NX monorepo. The issue: deposits into Savings CD schemes near m… [Bash×1]
  • agent-a69b3b260a7c… — In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/pure-t1, I need to understand t… [Read×9, Grep×7, Glob×3, Bash×2]
New files created (2)
  • /Users/randytran/Codes/ai-tool-benchmark/config/shp2376/pure-t1/plans/a-qa-bug-report-wild-shannon.md
  • pure-t1/libs/savings-cd/src/domain/savings-cd-batch-data-source.spec.ts
t2 16:40 → 16:54 UTC · 13 min
2 commits4 files+163

“A QA bug report was filed. Read docs/benchmark/TASK.md for the full report: reproduction steps, observed vs expected behaviour, and the definition of done. Your job: investigate the codebase, find the root cause, and shi…”

  • Agents2
  • New files2
  • Edits4
  • Bash20
Bash command mix · 20 calls
  • tests 9
  • other 6
  • inspection 2
  • git ops 2
  • lint/format 1
Subagents dispatched (2)
  • Explore · Explore Savings CD batch eligibility at 16:41
  • Plan · Plan SHP-2376 bugfix approach at 16:45
Subagent transcripts (2)
  • agent-a7536ac7fe87… — I'm investigating bug SHP-2376 in /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/pure-t2. The… [Read×17, Grep×10, Bash×6, Glob×3]
  • agent-a9b4d590a3c9… — ## Context Bug SHP-2376: Deposits into near-maturity Savings CD batches get stuck "In progress" fore… [Read×12, Grep×8, Glob×6]
New files created (2)
  • /Users/randytran/Codes/ai-tool-benchmark/config/shp2376/pure-t2/plans/a-qa-bug-report-humble-boole.md
  • pure-t2/libs/savings-cd/src/domain/savings-cd-batch-data-source.spec.ts
t1 03:49 → 04:02 UTC · 13 min
2 commits9 files+121

“A refactor has been requested by the tech lead. Read docs/benchmark/TASK.md for the full brief: the two design seams being cleaned up, the behavioral "done" criteria, the non-goals, and the judgment calls you're trusted…”

  • Agents4
  • New files1
  • Edits22
  • Bash10
Bash command mix · 10 calls
  • other 4
  • git ops 3
  • tests 2
  • inspection 1
Subagents dispatched (4)
  • Explore · Explore Scheme model/entity at 03:49
  • Explore · Explore ITDCDModeStrategy port at 03:49
  • Explore · Explore tests and imports at 03:49
  • Plan · Design refactor plan for SHP-2317 at 03:55
Subagent transcripts (1)
  • agent-acfdfe0596b2… — Design an implementation plan for the SHP-2317 refactor. I've done thorough exploration. Here's what… [no tools]
New files created (1)
  • /Users/randytran/Codes/ai-tool-benchmark/config/shp2317/pure-t1/plans/a-refactor-has-been-twinkly-squirrel.md
t2 06:31 → 06:47 UTC · 15 min
2 commits12 files+115

“A refactor has been requested by the tech lead. Read docs/benchmark/TASK.md for the full brief: the two design seams being cleaned up, the behavioral "done" criteria, the non-goals, and the judgment calls you're trusted…”

  • Agents4
  • New files2
  • Edits27
  • Bash11
Bash command mix · 11 calls
  • other 4
  • git ops 4
  • tests 2
  • inspection 1
Subagents dispatched (4)
  • Explore · Explore Scheme model and entity at 06:31
  • Explore · Explore ITDCDModeStrategy and CDBatch at 06:31
  • Explore · Explore cdBatchId usage across codebase at 06:31
  • Plan · Design refactor implementation plan at 06:35
Subagent transcripts (4)
  • agent-a6cfec3fc944… — I need to understand the Scheme model/entity in libs/core. Specifically: 1. Find the Scheme model (d… [Bash×9, Read×8, Grep×4, Glob×3]
  • agent-aaac459c5e26… — I need to understand the strategy port and CDBatch usage in libs/core and libs/savings-cd. Search in… [Read×13, Grep×4, Glob×3, Bash×3]
  • agent-ab6e77f23d57… — I need to find ALL references to cdBatchId and cdBatch across the codebase at /Users/randytran/Codes… [Read×8, Bash×7, Grep×4, Glob×2]
  • agent-afcc8910a8b6… — Design an implementation plan for the SHP-2317 refactor described below. I've already explored the c… [Read×12, Glob×7, Grep×5]
New files created (2)
  • /Users/randytran/Codes/ai-tool-benchmark/config/shp2317/pure-t2/plans/a-refactor-has-been-mutable-crane.md
  • pure-t2/libs/core/src/model/cd-batch-info.model.ts