gstack

Upstream: github.com/garrytan/gstack — MIT, by Garry Tan (YC). Pinned at the single-branch depth-1 clone taken on 2026-04-15; skill manifests declare version: 1.0.0, the repo’s VERSION file reports 0.17.0.0, and package.json reports 0.16.2.0. Runtime stack: Claude Code + Bun v1.0+ (plus Node on Windows). Marketed as “Garry’s Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.”

What it is

gstack is a skill pack — 37 directories under config/gstack-t1/skills/ — that frames Claude Code as a simulated product team: CEO (/plan-ceo-review), eng manager (/plan-eng-review), designer (/plan-design-review, /design-review), reviewer (/review), QA (/qa, /qa-only), security officer (/cso), release engineer (/ship, /land-and-deploy, /canary), plus debugging (/investigate), planning (/autoplan, /office-hours), retros, scope locks (/freeze, /guard), and a headless-browser binary. The installer (./setup --no-prefix) drops each skill into $CLAUDE_CONFIG_DIR/skills/, where Claude Code reads them as top-level slash commands.

Every skill begins with a long # Preamble (run first) Bash block that probes ~/.gstack/ for telemetry, proactive-suggest, routing, and vendoring markers, optionally appends a “Skill routing” block to the project CLAUDE.md, and (when proactive is on) auto-invokes peer skills from conversational triggers — "why is this broken" routes to /investigate, "ship it" to /ship, "architecture review" to /plan-eng-review, and so on. Each skill’s description: frontmatter explicitly names its trigger phrases; that prose is what Claude Code matches against the user turn.

Benchmark configuration

How the entry points behave

/investigate is a four-phase debugger enforcing what the skill calls the Iron Law: no fixes without root cause. Phase 1 is root-cause investigation with a regression-diff check; Phase 2 is pattern analysis with an optional sanitized web search; Phase 3 is explicit hypothesis confirmation via temporary logs/assertions before any edit; Phase 4 is the minimal fix; Phase 5 writes a regression test and a capture-learnings entry to ~/.gstack/projects/<slug>/learnings.jsonl. PreToolUse hooks on Edit/Write call freeze/bin/check-freeze.sh to enforce a scope lock.

/ship is where gstack’s “eng-review gate” lives — the reason gstack is excluded from plan mode. It runs: Step 0 platform detection → Step 1 pre-flight + Review Readiness Dashboard (tallies prior /plan-ceo-review, /codex review, /plan-eng-review, /plan-design-review, /plan-devex-review runs from ~/.gstack/ logs; verdict is NO REVIEWS YET until they’ve run) → Step 2 merge base branch → Step 2.5 test-framework bootstrap → Step 3 tests, with a Test Failure Ownership Triage that classifies failures as in-branch vs pre-existing and, on collaborative repos, can open and assign a GitHub issue via gh. Only after review+tests+eval-suites clear does it bump VERSION, update CHANGELOG, commit, push, and open the PR.

Results

Rank 4 / 9 overall, z̄ = +0.071, top-4 tie with bmad / pure / ecc. Per-task: feature rank 2 (z = +0.158, mean 123.65), bugfix rank 5 (z = +0.165), refactor rank 8 (z = −0.111). sigma_round on feature is 3.34, the highest single-task round-to-round spread in the cohort. Self-preference is negative on feature (−3.22) and strongly positive on bugfix (+9.0) and refactor (+8.5) — gstack under-rates its own feature work relative to the panel. Auto-metrics: feature 11 files / +772 / −4 / 94 tests pass / 10 ESLint errors; bugfix 2 files / +185 / 0 / 15 test failures / 10 ESLint errors; refactor 9 files / +121 / −71 / 0 test failures / 7 ESLint errors. The feature session shows Claude auto-routing to /autoplan from the task prose before any user-typed slash command (50 Bash / 18 Read / 7 Edit / 5 Write / 1 Skill invocation across 239 log lines).

Where it fits

gstack trades blank-prompt flexibility for a fixed organisational metaphor. The ceiling is high when the task lines up with a role gstack already has — greenfield feature work, PR shipping — and lower when the bottleneck is exploratory reasoning the preamble didn’t anticipate. On this benchmark the eng-review gate was enough to tie with the leaders on feature work but didn’t rescue the refactor run.

Observed in trial timelines

Skill activations land at mean 0.5 (feature), 1.0 (bugfix), 1.0 (refactor) — the auto-route to /autoplan from task prose is visible in feature t1 but does not always fire. Subagent dispatch is light (mean 2.2 / 1.0 / 1.0) and Bash dominates on refactor (mean 36) — the /ship test+lint+gate sequence is Bash-heavy.

Detail: see the per-trial timeline files linked below.

Trial timelines

Per-trial event timelines auto-extracted from session-logs/*.jsonl — skill activations, plugin/skill file reads, subagents dispatched, code mutations, Bash usage:

Trial timelines

Per-trial session execution extracted from each trial's session-logs/*.jsonl. Each card shows the subagents dispatched, skill activations, Bash command mix, and the final diff. Switch task tabs to compare behaviour across feature, bugfix, and refactor trials.

t1 03:40 → 03:53 UTC · 12 min
2 commits11 files+772

“Use gstack's canonical feature-build sequence: run /autoplan first to produce a reviewed plan (CEO + design + eng + DX review pipeline with auto-decisions), then implement the plan, then run /ship to review the diff, run…”

  • New files5
  • Edits7
  • Bash50
  • Skills1
Bash command mix · 50 calls
  • inspection 14
  • tests 14
  • other 13
  • lint/format 5
  • git ops 3
  • typecheck 1
Skill activations (1)
  • autoplan — Implement Mode 2 CD Batch for TD-CD end-to-end. PRD: docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] … at 03:40
New files created (5)
  • libs/core/src/domain/savings-cd/td-cd-mode2-pricing.util.spec.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2-pricing.util.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
  • libs/core/src/port/service/td-cd-mode2-batch-resolver.port.ts
t2 14:32 → 15:06 UTC · 34 min
2 commits10 files+617

“Use gstack's canonical feature-build sequence: run /autoplan first to produce a reviewed plan (CEO + design + eng + DX review pipeline with auto-decisions), then implement the plan, then run /ship to review the diff, run…”

  • Agents2
  • New files6
  • Edits4
  • Bash16
  • Skills1
  • Todos6
Bash command mix · 16 calls
  • other 7
  • tests 4
  • typecheck 2
  • inspection 1
  • lint/format 1
  • git ops 1
Skill activations (1)
  • autoplan at 14:32
Subagents dispatched (2)
  • Explore · Read TD-CD Mode 2 PRD at 14:33
  • Explore · Map Mode 1 CD implementation at 14:33
Subagent transcripts (2)
  • agent-ab5886c56785… — I need a detailed map of the existing TD-CD Mode 1 implementation in this NX monorepo at `/Users/ran… [Read×17, Bash×11, Grep×4]
  • agent-ae0ffd7627d3… — Read the PRD at `/Users/randytran/Codes/ai-tool-benchmark/runs/gstack-t2/docs/infina-product-docs/do… [Read×1]
New files created (6)
  • libs/core/src/domain/savings-cd/cd-mode.spec.ts
  • libs/core/src/domain/savings-cd/cd-mode.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.util.spec.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.util.ts
t3 04:37 → 05:11 UTC · 34 min
2 commits4 files+406

“Use gstack's canonical feature-build sequence: run /autoplan first to produce a reviewed plan (CEO + design + eng + DX review pipeline with auto-decisions), then implement the plan, then run /ship to review the diff, run…”

  • Agents4
  • New files4
  • Edits4
  • Bash26
  • Skill files1
  • Sessions2
Bash command mix · 26 calls
  • tests 10
  • other 9
  • git ops 5
  • inspection 2
Plugin/skill files read (1 unique)
  • CLAUDE.md
Subagents dispatched (4)
  • Explore · Explore Mode 1 CD implementation at 04:37
  • Explore · Explore Mode 1 TD-CD strategy at 04:48
  • Explore · Explore Mode 1 entities & types at 04:48
  • Explore · Explore Mode 1 tests & PRD at 04:49
Subagent transcripts (4)
  • agent-a0095fdbb95c… — Explore the TD-CD data model and types in this NestJS monorepo. I need to understand: 1. All TypeORM… [Read×24, Bash×19, Glob×1]
  • agent-aabbcff25891… — Thoroughly explore the Mode 1 CD implementation in this NestJS monorepo. I need to understand: 1. Th… [Glob×2, Bash×1]
  • agent-ab5e41b191f5… — Explore the TD-CD Mode 1 implementation in this NestJS monorepo. I need to understand: 1. The ITDCDM… [Bash×30, Read×17, Grep×3, Glob×2]
  • agent-af7e40d8465e… — Explore test patterns and the Mode 1 PRD for TD-CD in this NestJS monorepo. I need: 1. Test files fo… [Read×13, Grep×7, Bash×6, Glob×4]
New files created (3)
  • /Users/randytran/Codes/ai-tool-benchmark/config/gstack-t3/plans/synchronous-splashing-mccarthy.md
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
t4 09:02 → 09:20 UTC · 17 min
2 commits8 files+575

“<command-message>office-hours</command-message> <command-name>/office-hours</command-name> <command-args>Read the PRD at docs/infina-product-docs/docs/core-products/td-cd/user-logic/[PRD] [TD-CD] User stories - Mode 2 CD…”

  • Agents3
  • New files2
  • Edits10
  • Bash26
  • Sessions2
Bash command mix · 26 calls
  • tests 12
  • other 11
  • git ops 2
  • inspection 1
Subagents dispatched (3)
  • Explore · Explore CD batch models and infra at 09:04
  • Explore · Read strategy port and related files at 08:57
  • Explore · Read test files for Mode 1 at 08:57
Subagent transcripts (3)
  • agent-a5cdb48597dc… — In /Users/randytran/Codes/ai-tool-benchmark/runs/gstack-t4, I need to understand the CD batch-relate… [Read×25, Grep×9, Glob×8, Bash×1]
  • agent-a8051acb63ee… — In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/gstack-t4, find and read the test files… [Bash×8, Read×6]
  • agent-aca8082e09e9… — In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/gstack-t4, I need to understand the ITD… [Read×13, Glob×11, Bash×4, Grep×3]
New files created (2)
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
  • libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
t1 16:44 → 16:57 UTC · 12 min
2 commits2 files+185

“<command-message>investigate</command-message> <command-name>/investigate</command-name> <command-args>A QA bug report was filed. Read docs/benchmark/TASK.md for the full report: reproduction steps, observed vs expected…”

  • Agents1
  • New files1
  • Edits2
  • Bash16
  • Skills1
Bash command mix · 16 calls
  • git ops 6
  • tests 5
  • other 4
  • lint/format 1
Skill activations (1)
  • ship at 16:55
Subagents dispatched (1)
  • Explore · Explore savings-cd codebase at 16:45
Subagent transcripts (1)
  • agent-abd288a49184… — Explore the codebase at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/gstack-t1 thoroughly.… [Read×21, Bash×10, Grep×1]
New files created (1)
  • gstack-t1/libs/savings-cd/src/domain/savings-cd-batch-data-source.spec.ts
t2 16:17 → 16:30 UTC · 13 min
2 commits2 files+183

“<command-message>investigate</command-message> <command-name>/investigate</command-name> <command-args>A QA bug report was filed. Read docs/benchmark/TASK.md for the full report: reproduction steps, observed vs expected…”

  • Agents1
  • New files1
  • Edits4
  • Bash22
  • Skills1
Bash command mix · 22 calls
  • git ops 8
  • other 7
  • tests 5
  • inspection 2
Skill activations (1)
  • ship at 16:29
Subagents dispatched (1)
  • Explore · Explore savings-cd codebase at 16:17
Subagent transcripts (1)
  • agent-aba563046c48… — Explore the codebase at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/gstack-t2 thoroughly.… [Read×19, Bash×14, Grep×3, Glob×2]
New files created (1)
  • gstack-t2/libs/savings-cd/src/domain/savings-cd-batch-data-source.spec.ts
t1 03:56 → 04:23 UTC · 27 min
2 commits9 files+121

“A refactor has been requested by the tech lead. Read docs/benchmark/TASK.md for the full brief: the two design seams being cleaned up, the behavioral "done" criteria, the non-goals, and the judgment calls you're trusted…”

  • Agents1
  • New files3
  • Edits17
  • Bash42
  • Skills1
  • Skill files2
Bash command mix · 42 calls
  • tests 20
  • git ops 10
  • other 9
  • inspection 3
Skill activations (1)
  • ship at 04:15
Plugin/skill files read (2 unique)
  • /Users/randytran/.claude/skills/gstack/ship/SKILL.md
  • /Users/randytran/Codes/ai-tool-benchmark/config/shp2317/gstack-t1/skills/gstack/ship/SKILL.md
Subagents dispatched (1)
  • Explore · Explore refactor targets at 03:56
Subagent transcripts (1)
  • agent-af6e019dbddf… — I need to understand the current state of a codebase for a refactor. The working directory is /Users… [Read×17, Glob×10, Grep×5, Bash×3]
New files created (3)
  • gstack-t1/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
  • gstack-t1/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
  • gstack-t1/libs/core/src/port/service/td-cd-mode-strategy.port.ts
t2 06:33 → 06:50 UTC · 16 min
2 commits10 files+107

“A refactor has been requested by the tech lead. Read docs/benchmark/TASK.md for the full brief: the two design seams being cleaned up, the behavioral "done" criteria, the non-goals, and the judgment calls you're trusted…”

  • Agents1
  • New files2
  • Edits14
  • Bash30
  • Skills1
Bash command mix · 30 calls
  • tests 14
  • other 7
  • git ops 7
  • inspection 2
Skill activations (1)
  • ship at 06:47
Subagents dispatched (1)
  • Explore · Explore cdBatchId in Scheme at 06:33
Subagent transcripts (1)
  • agent-a1cf6a9617d1… — In /Users/randytran/Codes/ai-tool-benchmark/runs/shp2317/gstack-t2, I need a thorough investigation… [Read×20, Bash×11, Grep×9, Glob×1]
New files created (2)
  • gstack-t2/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
  • gstack-t2/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts