docs/ — Benchmark Documentation

Everything a reader needs to understand, reproduce, extend, or critique the benchmark.

This folder is organized by reader intent. The top-level README.md and PAPER.md are the author-facing entry points; this folder is the reader-facing reference set.


Map

guides/ — “How do I …?”

File Answers
guides/quickstart.md How do I clone this repo and run one task end-to-end in ~10 minutes?
guides/verification.md How do I independently verify a specific claim (e.g. “why is superpower ranked 9th on bugfix”)?
guides/extending.md How do I add a new tool or a new judge to the panel?

methodology/ — “How was it done?”

File Covers
methodology/pipeline.md End-to-end flow: clone → execute → judge → aggregate. Canonical reference for the pipeline, including the §9a pre-registered rerun protocol.
methodology/interaction-protocol.md One-shot operator↔tool interaction card: what you may and may not say during a trial.
methodology/tasks/bugfix-near-maturity.md The full bugfix task brief. (Feature and refactor briefs live in each task’s _blind-eval/ root.)

tools/ — “What is each setup, actually?”

Per-tool profile: version, upstream repo, mechanism (skills/hooks/prompts), what each trial loaded, observed strengths and failure modes.

File Tool Mechanism
tools/README.md Comparison matrix across all 9 tools (leaderboard, mechanism taxonomy, per-task winners, failure modes)
tools/bmad.md bmad Role-based multi-agent (/bmad-quick-dev)
tools/claudekit.md claudekit Skill pack + hook gates (/ck:cook --auto)
tools/compound.md compound Multi-agent pipeline (/lfg)
tools/ecc.md ecc Plugin pack (/everything-claude-code:plan, /build-fix)
tools/gstack.md gstack Product-team simulator (/autoplan, /investigate, /ship)
tools/mindful.md mindful CLAUDE.md principles + PreToolUse hooks
tools/omc.md omc Meta-orchestrator (/oh-my-claudecode:autopilot)
tools/pure.md pure Vanilla Claude Code + --permission-mode plan
tools/superpower.md superpower Skill registry (/superpowers:*)

analysis/ — “What did we learn?”

Rank-order alone is low-signal once the top-4 CIs overlap. This folder holds the why.

File Covers
analysis/README.md Index + caveats for the analysis set.
analysis/why-top-4-tied.md What bmad / ecc / pure / gstack share, and why the ordering inside the tie is sampler noise.
analysis/skill-and-hook-patterns.md Seven mechanism→outcome patterns drawn from the session transcripts, with non-findings listed explicitly.
analysis/skill-content-effectiveness.md Wording-level analysis of the slash-commands and skills that drove top-5 bugfix performance.
analysis/trial-timelines/ Per-trial event timelines auto-extracted from every session-logs/*.jsonl. Contains README.md (methodology), aggregate.md/aggregate.json (cross-tool statistics), and feature/, bugfix/, refactor/ subfolders with one file per tool. Regenerate with python3 scripts/extract-trial-timeline.py.

preview/ — rendered markdown for the site

HTML renders of every docs markdown file so readers can view them in-browser without leaving the landing page. Rebuilt whenever the source markdown changes.

Landing page (served from this folder)

index.html, styles.css, favicon.svg, _headers are served by Cloudflare Pages from the docs/ root. The canonical public URL is https://claude-tool-benchmark.pages.dev/. (The repo slug is claude-tool-benchmark; the friendly title used on the site is ai-tool-benchmark — same project, two names.)


Reader routes


Upstream references

The two foundational docs live at the repo root, not here, because they’re the public interface of the project: