docs/ — Benchmark Documentation
Everything a reader needs to understand, reproduce, extend, or critique the benchmark.
This folder is organized by reader intent. The top-level README.md and PAPER.md are the author-facing entry points; this folder is the reader-facing reference set.
Map
guides/ — “How do I …?”
| File | Answers |
|---|---|
guides/quickstart.md |
How do I clone this repo and run one task end-to-end in ~10 minutes? |
guides/verification.md |
How do I independently verify a specific claim (e.g. “why is superpower ranked 9th on bugfix”)? |
guides/extending.md |
How do I add a new tool or a new judge to the panel? |
methodology/ — “How was it done?”
| File | Covers |
|---|---|
methodology/pipeline.md |
End-to-end flow: clone → execute → judge → aggregate. Canonical reference for the pipeline, including the §9a pre-registered rerun protocol. |
methodology/interaction-protocol.md |
One-shot operator↔tool interaction card: what you may and may not say during a trial. |
methodology/tasks/bugfix-near-maturity.md |
The full bugfix task brief. (Feature and refactor briefs live in each task’s _blind-eval/ root.) |
tools/ — “What is each setup, actually?”
Per-tool profile: version, upstream repo, mechanism (skills/hooks/prompts), what each trial loaded, observed strengths and failure modes.
| File | Tool | Mechanism |
|---|---|---|
tools/README.md |
— | Comparison matrix across all 9 tools (leaderboard, mechanism taxonomy, per-task winners, failure modes) |
tools/bmad.md |
bmad |
Role-based multi-agent (/bmad-quick-dev) |
tools/claudekit.md |
claudekit |
Skill pack + hook gates (/ck:cook --auto) |
tools/compound.md |
compound |
Multi-agent pipeline (/lfg) |
tools/ecc.md |
ecc |
Plugin pack (/everything-claude-code:plan, /build-fix) |
tools/gstack.md |
gstack |
Product-team simulator (/autoplan, /investigate, /ship) |
tools/mindful.md |
mindful |
CLAUDE.md principles + PreToolUse hooks |
tools/omc.md |
omc |
Meta-orchestrator (/oh-my-claudecode:autopilot) |
tools/pure.md |
pure |
Vanilla Claude Code + --permission-mode plan |
tools/superpower.md |
superpower |
Skill registry (/superpowers:*) |
analysis/ — “What did we learn?”
Rank-order alone is low-signal once the top-4 CIs overlap. This folder holds the why.
| File | Covers |
|---|---|
analysis/README.md |
Index + caveats for the analysis set. |
analysis/why-top-4-tied.md |
What bmad / ecc / pure / gstack share, and why the ordering inside the tie is sampler noise. |
analysis/skill-and-hook-patterns.md |
Seven mechanism→outcome patterns drawn from the session transcripts, with non-findings listed explicitly. |
analysis/skill-content-effectiveness.md |
Wording-level analysis of the slash-commands and skills that drove top-5 bugfix performance. |
analysis/trial-timelines/ |
Per-trial event timelines auto-extracted from every session-logs/*.jsonl. Contains README.md (methodology), aggregate.md/aggregate.json (cross-tool statistics), and feature/, bugfix/, refactor/ subfolders with one file per tool. Regenerate with python3 scripts/extract-trial-timeline.py. |
preview/ — rendered markdown for the site
HTML renders of every docs markdown file so readers can view them in-browser without leaving the landing page. Rebuilt whenever the source markdown changes.
Landing page (served from this folder)
index.html, styles.css, favicon.svg, _headers are served by Cloudflare Pages from the docs/ root. The canonical public URL is https://claude-tool-benchmark.pages.dev/. (The repo slug is claude-tool-benchmark; the friendly title used on the site is ai-tool-benchmark — same project, two names.)
Reader routes
- “I just want the numbers.” →
../results/FINAL-REPORT-3JUDGE-20260422.md(tabular) or../PAPER.md(narrative). - “I want to verify a specific claim.” →
guides/verification.md. - “I want to re-run one trial.” →
guides/quickstart.md, ormethodology/pipeline.md§11 for the full command sequence. - “I want to understand what each tool actually does.” →
tools/README.md(comparison), then the per-tool profile that catches your eye. - “I want the learnings, not the ranks.” →
analysis/README.md→why-top-4-tied.md. - “I want to add my own tool or judge.” →
guides/extending.md. - “I want to see what a specific tool did in a specific trial.” →
analysis/trial-timelines/<task>/<tool>.md.
Upstream references
The two foundational docs live at the repo root, not here, because they’re the public interface of the project:
../README.md— TL;DR + TOC + caveats headline.../PAPER.md— research-paper-style full report.