docs/ — Benchmark Documentation

Everything a reader needs to understand, reproduce, extend, or critique the benchmark.

This folder is organized by reader intent. The top-level README.md and PAPER.md are the author-facing entry points; this folder is the reader-facing reference set.

Map

guides/ — “How do I …?”

File	Answers
`guides/quickstart.md`	How do I clone this repo and run one task end-to-end in ~10 minutes?
`guides/verification.md`	How do I independently verify a specific claim (e.g. “why is `superpower` ranked 9th on bugfix”)?
`guides/extending.md`	How do I add a new tool or a new judge to the panel?

methodology/ — “How was it done?”

File	Covers
`methodology/pipeline.md`	End-to-end flow: clone → execute → judge → aggregate. Canonical reference for the pipeline, including the §9a pre-registered rerun protocol.
`methodology/interaction-protocol.md`	One-shot operator↔tool interaction card: what you may and may not say during a trial.
`methodology/tasks/bugfix-near-maturity.md`	The full bugfix task brief. (Feature and refactor briefs live in each task’s `_blind-eval/` root.)

tools/ — “What is each setup, actually?”

Per-tool profile: version, upstream repo, mechanism (skills/hooks/prompts), what each trial loaded, observed strengths and failure modes.

File	Tool	Mechanism
`tools/README.md`	—	Comparison matrix across all 9 tools (leaderboard, mechanism taxonomy, per-task winners, failure modes)
`tools/bmad.md`	`bmad`	Role-based multi-agent (`/bmad-quick-dev`)
`tools/claudekit.md`	`claudekit`	Skill pack + hook gates (`/ck:cook --auto`)
`tools/compound.md`	`compound`	Multi-agent pipeline (`/lfg`)
`tools/ecc.md`	`ecc`	Plugin pack (`/everything-claude-code:plan`, `/build-fix`)
`tools/gstack.md`	`gstack`	Product-team simulator (`/autoplan`, `/investigate`, `/ship`)
`tools/mindful.md`	`mindful`	CLAUDE.md principles + PreToolUse hooks
`tools/omc.md`	`omc`	Meta-orchestrator (`/oh-my-claudecode:autopilot`)
`tools/pure.md`	`pure`	Vanilla Claude Code + `--permission-mode plan`
`tools/superpower.md`	`superpower`	Skill registry (`/superpowers:*`)

analysis/ — “What did we learn?”

Rank-order alone is low-signal once the top-4 CIs overlap. This folder holds the why.

File	Covers
`analysis/README.md`	Index + caveats for the analysis set.
`analysis/why-top-4-tied.md`	What `bmad` / `ecc` / `pure` / `gstack` share, and why the ordering inside the tie is sampler noise.
`analysis/skill-and-hook-patterns.md`	Seven mechanism→outcome patterns drawn from the session transcripts, with non-findings listed explicitly.
`analysis/skill-content-effectiveness.md`	Wording-level analysis of the slash-commands and skills that drove top-5 bugfix performance.
`analysis/trial-timelines/`	Per-trial event timelines auto-extracted from every `session-logs/*.jsonl`. Contains `README.md` (methodology), `aggregate.md`/`aggregate.json` (cross-tool statistics), and `feature/`, `bugfix/`, `refactor/` subfolders with one file per tool. Regenerate with `python3 scripts/extract-trial-timeline.py`.

preview/ — rendered markdown for the site

HTML renders of every docs markdown file so readers can view them in-browser without leaving the landing page. Rebuilt whenever the source markdown changes.

Landing page (served from this folder)

index.html, styles.css, favicon.svg, _headers are served by Cloudflare Pages from the docs/ root. The canonical public URL is https://claude-tool-benchmark.pages.dev/. (The repo slug is claude-tool-benchmark; the friendly title used on the site is ai-tool-benchmark — same project, two names.)

Reader routes

“I just want the numbers.” → ../results/FINAL-REPORT-3JUDGE-20260422.md (tabular) or ../PAPER.md (narrative).
“I want to verify a specific claim.” → guides/verification.md.
“I want to re-run one trial.” → guides/quickstart.md, or methodology/pipeline.md §11 for the full command sequence.
“I want to understand what each tool actually does.” → tools/README.md (comparison), then the per-tool profile that catches your eye.
“I want the learnings, not the ranks.” → analysis/README.md → why-top-4-tied.md.
“I want to add my own tool or judge.” → guides/extending.md.
“I want to see what a specific tool did in a specific trial.” → analysis/trial-timelines/<task>/<tool>.md.

Upstream references

The two foundational docs live at the repo root, not here, because they’re the public interface of the project:

../README.md — TL;DR + TOC + caveats headline.
../PAPER.md — research-paper-style full report.