gstack

Upstream: github.com/garrytan/gstack — MIT, by Garry Tan (YC). Pinned at the single-branch depth-1 clone taken on 2026-04-15; skill manifests declare version: 1.0.0, the repo’s VERSION file reports 0.17.0.0, and package.json reports 0.16.2.0. Runtime stack: Claude Code + Bun v1.0+ (plus Node on Windows). Marketed as “Garry’s Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.”

What it is

gstack is a skill pack — 37 directories under config/gstack-t1/skills/ — that frames Claude Code as a simulated product team: CEO (/plan-ceo-review), eng manager (/plan-eng-review), designer (/plan-design-review, /design-review), reviewer (/review), QA (/qa, /qa-only), security officer (/cso), release engineer (/ship, /land-and-deploy, /canary), plus debugging (/investigate), planning (/autoplan, /office-hours), retros, scope locks (/freeze, /guard), and a headless-browser binary. The installer (./setup --no-prefix) drops each skill into $CLAUDE_CONFIG_DIR/skills/, where Claude Code reads them as top-level slash commands.

Every skill begins with a long # Preamble (run first) Bash block that probes ~/.gstack/ for telemetry, proactive-suggest, routing, and vendoring markers, optionally appends a “Skill routing” block to the project CLAUDE.md, and (when proactive is on) auto-invokes peer skills from conversational triggers — "why is this broken" routes to /investigate, "ship it" to /ship, "architecture review" to /plan-eng-review, and so on. Each skill’s description: frontmatter explicitly names its trigger phrases; that prose is what Claude Code matches against the user turn.

Benchmark configuration

Config dir: config/gstack-t1/. settings.json contains only {"skipDangerousModePermissionPrompt": true}; no MCP, no hooks, no allow-list. All behaviour comes from the skill tree. Install shim in scripts/setup-tool-config.sh (gstack case, ~L176) clones upstream into $TOOL_CONFIG/skills/gstack and runs HOME=$TOOL_CONFIG ./setup --no-prefix so ~/.gstack/ writes stay inside the trial.
Launch prompt (scripts/manual-bench.sh, ~L158): feature and refactor run the raw $SHARED_TASK with the suffix “When implementation is complete and committed, run /ship to review the diff and finalize.”; bugfix prepends /investigate $SHARED_TASK with the same /ship tail. /investigate is the only gstack entry point fired by command — everywhere else, activation is left to the skill’s prose triggers.
Plan mode is deliberately off for gstack (pipeline.md §1; manual-bench.sh L47-54). The rationale in the script comment: “gstack deliberately excluded — its /ship workflow already runs an eng review gate; we evaluate gstack on its native surface without forcing Claude’s plan mode.” Only pure and mindful get --permission-mode plan; every tool with its own planning surface (bmad, omc, ecc, compound, gstack, …) runs without it so results reflect the setup’s own gate, not Claude’s.

How the entry points behave

/investigate is a four-phase debugger enforcing what the skill calls the Iron Law: no fixes without root cause. Phase 1 is root-cause investigation with a regression-diff check; Phase 2 is pattern analysis with an optional sanitized web search; Phase 3 is explicit hypothesis confirmation via temporary logs/assertions before any edit; Phase 4 is the minimal fix; Phase 5 writes a regression test and a capture-learnings entry to ~/.gstack/projects/<slug>/learnings.jsonl. PreToolUse hooks on Edit/Write call freeze/bin/check-freeze.sh to enforce a scope lock.

/ship is where gstack’s “eng-review gate” lives — the reason gstack is excluded from plan mode. It runs: Step 0 platform detection → Step 1 pre-flight + Review Readiness Dashboard (tallies prior /plan-ceo-review, /codex review, /plan-eng-review, /plan-design-review, /plan-devex-review runs from ~/.gstack/ logs; verdict is NO REVIEWS YET until they’ve run) → Step 2 merge base branch → Step 2.5 test-framework bootstrap → Step 3 tests, with a Test Failure Ownership Triage that classifies failures as in-branch vs pre-existing and, on collaborative repos, can open and assign a GitHub issue via gh. Only after review+tests+eval-suites clear does it bump VERSION, update CHANGELOG, commit, push, and open the PR.

Results

Rank 4 / 9 overall, z̄ = +0.071, top-4 tie with bmad / pure / ecc. Per-task: feature rank 2 (z = +0.158, mean 123.65), bugfix rank 5 (z = +0.165), refactor rank 8 (z = −0.111). sigma_round on feature is 3.34, the highest single-task round-to-round spread in the cohort. Self-preference is negative on feature (−3.22) and strongly positive on bugfix (+9.0) and refactor (+8.5) — gstack under-rates its own feature work relative to the panel. Auto-metrics: feature 11 files / +772 / −4 / 94 tests pass / 10 ESLint errors; bugfix 2 files / +185 / 0 / 15 test failures / 10 ESLint errors; refactor 9 files / +121 / −71 / 0 test failures / 7 ESLint errors. The feature session shows Claude auto-routing to /autoplan from the task prose before any user-typed slash command (50 Bash / 18 Read / 7 Edit / 5 Write / 1 Skill invocation across 239 log lines).

Where it fits

gstack trades blank-prompt flexibility for a fixed organisational metaphor. The ceiling is high when the task lines up with a role gstack already has — greenfield feature work, PR shipping — and lower when the bottleneck is exploratory reasoning the preamble didn’t anticipate. On this benchmark the eng-review gate was enough to tie with the leaders on feature work but didn’t rescue the refactor run.

Observed in trial timelines

Skill activations land at mean 0.5 (feature), 1.0 (bugfix), 1.0 (refactor) — the auto-route to /autoplan from task prose is visible in feature t1 but does not always fire. Subagent dispatch is light (mean 2.2 / 1.0 / 1.0) and Bash dominates on refactor (mean 36) — the /ship test+lint+gate sequence is Bash-heavy.

Detail: see the per-trial timeline files linked below.

Trial timelines

Per-trial event timelines auto-extracted from session-logs/*.jsonl — skill activations, plugin/skill file reads, subagents dispatched, code mutations, Bash usage:

Trial timelines

Per-trial session execution extracted from each trial's session-logs/*.jsonl. Each card shows the subagents dispatched, skill activations, Bash command mix, and the final diff. Switch task tabs to compare behaviour across feature, bugfix, and refactor trials.

Feature4 trials Bugfix2 trials Refactor2 trials

t1 03:40 → 03:53 UTC · 12 min

2 commits11 files+772

“Use gstack's canonical feature-build sequence: run /autoplan first to produce a reviewed plan (CEO + design + eng + DX review pipeline with auto-decisions), then implement the plan, then run /ship to review the diff, run…”

New files5
Edits7
Bash50
Skills1

Bash command mix · 50 calls

inspection 14
tests 14
other 13
lint/format 5
git ops 3
typecheck 1

Skill activations (1)

autoplan — Implement Mode 2 CD Batch for TD-CD end-to-end. PRD: docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] … at 03:40

New files created (5)

libs/core/src/domain/savings-cd/td-cd-mode2-pricing.util.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2-pricing.util.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
libs/core/src/port/service/td-cd-mode2-batch-resolver.port.ts

t2 14:32 → 15:06 UTC · 34 min

2 commits10 files+617

Agents2
New files6
Edits4
Bash16
Skills1
Todos6

Bash command mix · 16 calls

other 7
tests 4
typecheck 2
inspection 1
lint/format 1
git ops 1

Skill activations (1)

autoplan at 14:32

Subagents dispatched (2)

Explore · Read TD-CD Mode 2 PRD at 14:33
Explore · Map Mode 1 CD implementation at 14:33

Subagent transcripts (2)

agent-ab5886c56785… — I need a detailed map of the existing TD-CD Mode 1 implementation in this NX monorepo at `/Users/ran… [Read×17, Bash×11, Grep×4]
agent-ae0ffd7627d3… — Read the PRD at `/Users/randytran/Codes/ai-tool-benchmark/runs/gstack-t2/docs/infina-product-docs/do… [Read×1]

New files created (6)

libs/core/src/domain/savings-cd/cd-mode.spec.ts
libs/core/src/domain/savings-cd/cd-mode.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
libs/core/src/domain/savings-cd/td-cd-mode2.util.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.util.ts

t3 04:37 → 05:11 UTC · 34 min

2 commits4 files+406

Agents4
New files4
Edits4
Bash26
Skill files1
Sessions2

Bash command mix · 26 calls

tests 10
other 9
git ops 5
inspection 2

Plugin/skill files read (1 unique)

CLAUDE.md

Subagents dispatched (4)

Explore · Explore Mode 1 CD implementation at 04:37
Explore · Explore Mode 1 TD-CD strategy at 04:48
Explore · Explore Mode 1 entities & types at 04:48
Explore · Explore Mode 1 tests & PRD at 04:49

Subagent transcripts (4)

agent-a0095fdbb95c… — Explore the TD-CD data model and types in this NestJS monorepo. I need to understand: 1. All TypeORM… [Read×24, Bash×19, Glob×1]
agent-aabbcff25891… — Thoroughly explore the Mode 1 CD implementation in this NestJS monorepo. I need to understand: 1. Th… [Glob×2, Bash×1]
agent-ab5e41b191f5… — Explore the TD-CD Mode 1 implementation in this NestJS monorepo. I need to understand: 1. The ITDCDM… [Bash×30, Read×17, Grep×3, Glob×2]
agent-af7e40d8465e… — Explore test patterns and the Mode 1 PRD for TD-CD in this NestJS monorepo. I need: 1. Test files fo… [Read×13, Grep×7, Bash×6, Glob×4]

New files created (3)

/Users/randytran/Codes/ai-tool-benchmark/config/gstack-t3/plans/synchronous-splashing-mccarthy.md
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts

t4 09:02 → 09:20 UTC · 17 min

2 commits8 files+575

“<command-message>office-hours</command-message> <command-name>/office-hours</command-name> <command-args>Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User stories - Mode 2 CD…”

Agents3
New files2
Edits10
Bash26
Sessions2

Bash command mix · 26 calls

tests 12
other 11
git ops 2
inspection 1

Subagents dispatched (3)

Explore · Explore CD batch models and infra at 09:04
Explore · Read strategy port and related files at 08:57
Explore · Read test files for Mode 1 at 08:57

Subagent transcripts (3)

agent-a5cdb48597dc… — In /Users/randytran/Codes/ai-tool-benchmark/runs/gstack-t4, I need to understand the CD batch-relate… [Read×25, Grep×9, Glob×8, Bash×1]
agent-a8051acb63ee… — In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/gstack-t4, find and read the test files… [Bash×8, Read×6]
agent-aca8082e09e9… — In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/gstack-t4, I need to understand the ITD… [Read×13, Glob×11, Bash×4, Grep×3]

New files created (2)

libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts

t1 16:44 → 16:57 UTC · 12 min

2 commits2 files+185

“<command-message>investigate</command-message> <command-name>/investigate</command-name> <command-args>A QA bug report was filed. Read docs/benchmark/TASK.md for the full report: reproduction steps, observed vs expected…”

Agents1
New files1
Edits2
Bash16
Skills1

Bash command mix · 16 calls

git ops 6
tests 5
other 4
lint/format 1

Skill activations (1)

ship at 16:55

Subagents dispatched (1)

Explore · Explore savings-cd codebase at 16:45

Subagent transcripts (1)

agent-abd288a49184… — Explore the codebase at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/gstack-t1 thoroughly.… [Read×21, Bash×10, Grep×1]

New files created (1)

gstack-t1/libs/savings-cd/src/domain/savings-cd-batch-data-source.spec.ts

t2 16:17 → 16:30 UTC · 13 min

2 commits2 files+183

Agents1
New files1
Edits4
Bash22
Skills1

Bash command mix · 22 calls

git ops 8
other 7
tests 5
inspection 2

Skill activations (1)

ship at 16:29

Subagents dispatched (1)

Explore · Explore savings-cd codebase at 16:17

Subagent transcripts (1)

agent-aba563046c48… — Explore the codebase at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/gstack-t2 thoroughly.… [Read×19, Bash×14, Grep×3, Glob×2]

New files created (1)

gstack-t2/libs/savings-cd/src/domain/savings-cd-batch-data-source.spec.ts

t1 03:56 → 04:23 UTC · 27 min

2 commits9 files+121

“A refactor has been requested by the tech lead. Read docs/benchmark/TASK.md for the full brief: the two design seams being cleaned up, the behavioral "done" criteria, the non-goals, and the judgment calls you're trusted…”

Agents1
New files3
Edits17
Bash42
Skills1
Skill files2

Bash command mix · 42 calls

tests 20
git ops 10
other 9
inspection 3

Skill activations (1)

ship at 04:15

Plugin/skill files read (2 unique)

/Users/randytran/.claude/skills/gstack/ship/SKILL.md
/Users/randytran/Codes/ai-tool-benchmark/config/shp2317/gstack-t1/skills/gstack/ship/SKILL.md

Subagents dispatched (1)

Explore · Explore refactor targets at 03:56

Subagent transcripts (1)

agent-af6e019dbddf… — I need to understand the current state of a codebase for a refactor. The working directory is /Users… [Read×17, Glob×10, Grep×5, Bash×3]

New files created (3)

gstack-t1/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
gstack-t1/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
gstack-t1/libs/core/src/port/service/td-cd-mode-strategy.port.ts

t2 06:33 → 06:50 UTC · 16 min

2 commits10 files+107

Agents1
New files2
Edits14
Bash30
Skills1

Bash command mix · 30 calls

tests 14
other 7
git ops 7
inspection 2

Skill activations (1)

ship at 06:47

Subagents dispatched (1)

Explore · Explore cdBatchId in Scheme at 06:33

Subagent transcripts (1)

agent-a1cf6a9617d1… — In /Users/randytran/Codes/ai-tool-benchmark/runs/shp2317/gstack-t2, I need a thorough investigation… [Read×20, Bash×11, Grep×9, Glob×1]

New files created (2)

gstack-t2/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
gstack-t2/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts