gstack
Upstream: github.com/garrytan/gstack — MIT, by Garry Tan (YC). Pinned at the single-branch depth-1 clone taken on 2026-04-15; skill manifests declare version: 1.0.0, the repo’s VERSION file reports 0.17.0.0, and package.json reports 0.16.2.0. Runtime stack: Claude Code + Bun v1.0+ (plus Node on Windows). Marketed as “Garry’s Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.”
What it is
gstack is a skill pack — 37 directories under config/gstack-t1/skills/ — that frames Claude Code as a simulated product team: CEO (/plan-ceo-review), eng manager (/plan-eng-review), designer (/plan-design-review, /design-review), reviewer (/review), QA (/qa, /qa-only), security officer (/cso), release engineer (/ship, /land-and-deploy, /canary), plus debugging (/investigate), planning (/autoplan, /office-hours), retros, scope locks (/freeze, /guard), and a headless-browser binary. The installer (./setup --no-prefix) drops each skill into $CLAUDE_CONFIG_DIR/skills/, where Claude Code reads them as top-level slash commands.
Every skill begins with a long # Preamble (run first) Bash block that probes ~/.gstack/ for telemetry, proactive-suggest, routing, and vendoring markers, optionally appends a “Skill routing” block to the project CLAUDE.md, and (when proactive is on) auto-invokes peer skills from conversational triggers — "why is this broken" routes to /investigate, "ship it" to /ship, "architecture review" to /plan-eng-review, and so on. Each skill’s description: frontmatter explicitly names its trigger phrases; that prose is what Claude Code matches against the user turn.
Benchmark configuration
- Config dir:
config/gstack-t1/.settings.jsoncontains only{"skipDangerousModePermissionPrompt": true}; no MCP, no hooks, no allow-list. All behaviour comes from the skill tree. Install shim inscripts/setup-tool-config.sh(gstack case, ~L176) clones upstream into$TOOL_CONFIG/skills/gstackand runsHOME=$TOOL_CONFIG ./setup --no-prefixso~/.gstack/writes stay inside the trial. - Launch prompt (
scripts/manual-bench.sh, ~L158): feature and refactor run the raw$SHARED_TASKwith the suffix “When implementation is complete and committed, run /ship to review the diff and finalize.”; bugfix prepends/investigate $SHARED_TASKwith the same/shiptail./investigateis the only gstack entry point fired by command — everywhere else, activation is left to the skill’s prose triggers. - Plan mode is deliberately off for gstack (
pipeline.md§1;manual-bench.shL47-54). The rationale in the script comment: “gstack deliberately excluded — its /ship workflow already runs an eng review gate; we evaluate gstack on its native surface without forcing Claude’s plan mode.” Onlypureandmindfulget--permission-mode plan; every tool with its own planning surface (bmad, omc, ecc, compound, gstack, …) runs without it so results reflect the setup’s own gate, not Claude’s.
How the entry points behave
/investigate is a four-phase debugger enforcing what the skill calls the Iron Law: no fixes without root cause. Phase 1 is root-cause investigation with a regression-diff check; Phase 2 is pattern analysis with an optional sanitized web search; Phase 3 is explicit hypothesis confirmation via temporary logs/assertions before any edit; Phase 4 is the minimal fix; Phase 5 writes a regression test and a capture-learnings entry to ~/.gstack/projects/<slug>/learnings.jsonl. PreToolUse hooks on Edit/Write call freeze/bin/check-freeze.sh to enforce a scope lock.
/ship is where gstack’s “eng-review gate” lives — the reason gstack is excluded from plan mode. It runs: Step 0 platform detection → Step 1 pre-flight + Review Readiness Dashboard (tallies prior /plan-ceo-review, /codex review, /plan-eng-review, /plan-design-review, /plan-devex-review runs from ~/.gstack/ logs; verdict is NO REVIEWS YET until they’ve run) → Step 2 merge base branch → Step 2.5 test-framework bootstrap → Step 3 tests, with a Test Failure Ownership Triage that classifies failures as in-branch vs pre-existing and, on collaborative repos, can open and assign a GitHub issue via gh. Only after review+tests+eval-suites clear does it bump VERSION, update CHANGELOG, commit, push, and open the PR.
Results
Rank 4 / 9 overall, z̄ = +0.071, top-4 tie with bmad / pure / ecc. Per-task: feature rank 2 (z = +0.158, mean 123.65), bugfix rank 5 (z = +0.165), refactor rank 8 (z = −0.111). sigma_round on feature is 3.34, the highest single-task round-to-round spread in the cohort. Self-preference is negative on feature (−3.22) and strongly positive on bugfix (+9.0) and refactor (+8.5) — gstack under-rates its own feature work relative to the panel. Auto-metrics: feature 11 files / +772 / −4 / 94 tests pass / 10 ESLint errors; bugfix 2 files / +185 / 0 / 15 test failures / 10 ESLint errors; refactor 9 files / +121 / −71 / 0 test failures / 7 ESLint errors. The feature session shows Claude auto-routing to /autoplan from the task prose before any user-typed slash command (50 Bash / 18 Read / 7 Edit / 5 Write / 1 Skill invocation across 239 log lines).
Where it fits
gstack trades blank-prompt flexibility for a fixed organisational metaphor. The ceiling is high when the task lines up with a role gstack already has — greenfield feature work, PR shipping — and lower when the bottleneck is exploratory reasoning the preamble didn’t anticipate. On this benchmark the eng-review gate was enough to tie with the leaders on feature work but didn’t rescue the refactor run.
Observed in trial timelines
Skill activations land at mean 0.5 (feature), 1.0 (bugfix), 1.0 (refactor) — the auto-route to /autoplan from task prose is visible in feature t1 but does not always fire. Subagent dispatch is light (mean 2.2 / 1.0 / 1.0) and Bash dominates on refactor (mean 36) — the /ship test+lint+gate sequence is Bash-heavy.
Detail: see the per-trial timeline files linked below.
Trial timelines
Per-trial event timelines auto-extracted from session-logs/*.jsonl — skill activations, plugin/skill file reads, subagents dispatched, code mutations, Bash usage:
Trial timelines
Per-trial session execution extracted from each trial's session-logs/*.jsonl. Each card
shows the subagents dispatched, skill activations, Bash command mix, and the final diff. Switch task
tabs to compare behaviour across feature, bugfix, and refactor trials.
“Use gstack's canonical feature-build sequence: run /autoplan first to produce a reviewed plan (CEO + design + eng + DX review pipeline with auto-decisions), then implement the plan, then run /ship to review the diff, run…”
- New files5
- Edits7
- Bash50
- Skills1
Skill activations (1)
autoplan— Implement Mode 2 CD Batch for TD-CD end-to-end. PRD: docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] … at 03:40
New files created (5)
libs/core/src/domain/savings-cd/td-cd-mode2-pricing.util.spec.tslibs/core/src/domain/savings-cd/td-cd-mode2-pricing.util.tslibs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.tslibs/core/src/domain/savings-cd/td-cd-mode2.strategy.tslibs/core/src/port/service/td-cd-mode2-batch-resolver.port.ts
“Use gstack's canonical feature-build sequence: run /autoplan first to produce a reviewed plan (CEO + design + eng + DX review pipeline with auto-decisions), then implement the plan, then run /ship to review the diff, run…”
- Agents2
- New files6
- Edits4
- Bash16
- Skills1
- Todos6
Skill activations (1)
autoplanat 14:32
Subagents dispatched (2)
Explore· Read TD-CD Mode 2 PRD at 14:33Explore· Map Mode 1 CD implementation at 14:33
Subagent transcripts (2)
agent-ab5886c56785…— I need a detailed map of the existing TD-CD Mode 1 implementation in this NX monorepo at `/Users/ran… [Read×17, Bash×11, Grep×4]agent-ae0ffd7627d3…— Read the PRD at `/Users/randytran/Codes/ai-tool-benchmark/runs/gstack-t2/docs/infina-product-docs/do… [Read×1]
New files created (6)
libs/core/src/domain/savings-cd/cd-mode.spec.tslibs/core/src/domain/savings-cd/cd-mode.tslibs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.tslibs/core/src/domain/savings-cd/td-cd-mode2.strategy.tslibs/core/src/domain/savings-cd/td-cd-mode2.util.spec.tslibs/core/src/domain/savings-cd/td-cd-mode2.util.ts
“Use gstack's canonical feature-build sequence: run /autoplan first to produce a reviewed plan (CEO + design + eng + DX review pipeline with auto-decisions), then implement the plan, then run /ship to review the diff, run…”
- Agents4
- New files4
- Edits4
- Bash26
- Skill files1
- Sessions2
Plugin/skill files read (1 unique)
CLAUDE.md
Subagents dispatched (4)
Explore· Explore Mode 1 CD implementation at 04:37Explore· Explore Mode 1 TD-CD strategy at 04:48Explore· Explore Mode 1 entities & types at 04:48Explore· Explore Mode 1 tests & PRD at 04:49
Subagent transcripts (4)
agent-a0095fdbb95c…— Explore the TD-CD data model and types in this NestJS monorepo. I need to understand: 1. All TypeORM… [Read×24, Bash×19, Glob×1]agent-aabbcff25891…— Thoroughly explore the Mode 1 CD implementation in this NestJS monorepo. I need to understand: 1. Th… [Glob×2, Bash×1]agent-ab5e41b191f5…— Explore the TD-CD Mode 1 implementation in this NestJS monorepo. I need to understand: 1. The ITDCDM… [Bash×30, Read×17, Grep×3, Glob×2]agent-af7e40d8465e…— Explore test patterns and the Mode 1 PRD for TD-CD in this NestJS monorepo. I need: 1. Test files fo… [Read×13, Grep×7, Bash×6, Glob×4]
New files created (3)
/Users/randytran/Codes/ai-tool-benchmark/config/gstack-t3/plans/synchronous-splashing-mccarthy.mdlibs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.tslibs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
“<command-message>office-hours</command-message> <command-name>/office-hours</command-name> <command-args>Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User stories - Mode 2 CD…”
- Agents3
- New files2
- Edits10
- Bash26
- Sessions2
Subagents dispatched (3)
Explore· Explore CD batch models and infra at 09:04Explore· Read strategy port and related files at 08:57Explore· Read test files for Mode 1 at 08:57
Subagent transcripts (3)
agent-a5cdb48597dc…— In /Users/randytran/Codes/ai-tool-benchmark/runs/gstack-t4, I need to understand the CD batch-relate… [Read×25, Grep×9, Glob×8, Bash×1]agent-a8051acb63ee…— In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/gstack-t4, find and read the test files… [Bash×8, Read×6]agent-aca8082e09e9…— In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/gstack-t4, I need to understand the ITD… [Read×13, Glob×11, Bash×4, Grep×3]
New files created (2)
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.tslibs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
“<command-message>investigate</command-message> <command-name>/investigate</command-name> <command-args>A QA bug report was filed. Read docs/benchmark/TASK.md for the full report: reproduction steps, observed vs expected…”
- Agents1
- New files1
- Edits2
- Bash16
- Skills1
Skill activations (1)
shipat 16:55
Subagents dispatched (1)
Explore· Explore savings-cd codebase at 16:45
Subagent transcripts (1)
agent-abd288a49184…— Explore the codebase at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/gstack-t1 thoroughly.… [Read×21, Bash×10, Grep×1]
New files created (1)
gstack-t1/libs/savings-cd/src/domain/savings-cd-batch-data-source.spec.ts
“<command-message>investigate</command-message> <command-name>/investigate</command-name> <command-args>A QA bug report was filed. Read docs/benchmark/TASK.md for the full report: reproduction steps, observed vs expected…”
- Agents1
- New files1
- Edits4
- Bash22
- Skills1
Skill activations (1)
shipat 16:29
Subagents dispatched (1)
Explore· Explore savings-cd codebase at 16:17
Subagent transcripts (1)
agent-aba563046c48…— Explore the codebase at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/gstack-t2 thoroughly.… [Read×19, Bash×14, Grep×3, Glob×2]
New files created (1)
gstack-t2/libs/savings-cd/src/domain/savings-cd-batch-data-source.spec.ts
“A refactor has been requested by the tech lead. Read docs/benchmark/TASK.md for the full brief: the two design seams being cleaned up, the behavioral "done" criteria, the non-goals, and the judgment calls you're trusted…”
- Agents1
- New files3
- Edits17
- Bash42
- Skills1
- Skill files2
Skill activations (1)
shipat 04:15
Plugin/skill files read (2 unique)
/Users/randytran/.claude/skills/gstack/ship/SKILL.md/Users/randytran/Codes/ai-tool-benchmark/config/shp2317/gstack-t1/skills/gstack/ship/SKILL.md
Subagents dispatched (1)
Explore· Explore refactor targets at 03:56
Subagent transcripts (1)
agent-af6e019dbddf…— I need to understand the current state of a codebase for a refactor. The working directory is /Users… [Read×17, Glob×10, Grep×5, Bash×3]
New files created (3)
gstack-t1/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.tsgstack-t1/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.tsgstack-t1/libs/core/src/port/service/td-cd-mode-strategy.port.ts
“A refactor has been requested by the tech lead. Read docs/benchmark/TASK.md for the full brief: the two design seams being cleaned up, the behavioral "done" criteria, the non-goals, and the judgment calls you're trusted…”
- Agents1
- New files2
- Edits14
- Bash30
- Skills1
Skill activations (1)
shipat 06:47
Subagents dispatched (1)
Explore· Explore cdBatchId in Scheme at 06:33
Subagent transcripts (1)
agent-a1cf6a9617d1…— In /Users/randytran/Codes/ai-tool-benchmark/runs/shp2317/gstack-t2, I need a thorough investigation… [Read×20, Bash×11, Grep×9, Glob×1]
New files created (2)
gstack-t2/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.tsgstack-t2/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts