omc (oh-my-claudecode)

Disclosure. omc is maintained by the benchmark author (Yeachan-Heo/oh-my-claudecode); see README caveat 12. The ranking below should be read in that light — the tool was evaluated on a benchmark its own author designed, and it still lands near the bottom. That is the expected direction for an honest self-report, but it does not cancel the conflict of interest. Every rubric item, task brief, and judge prompt was chosen by the same author; an external replication is the only thing that would remove the concern.

Identity

Repo: github.com/Yeachan-Heo/oh-my-claudecode
Plugin version at run: 4.13.0 (bugfix trial), 4.13.1 (refactor trial); feature trial predates the version-stamp hook
Package: oh-my-claude-sisyphus on npm
License: MIT, Copyright 2025 Yeachan Heo
Install mechanism: Claude Code plugin marketplace (claude plugin install oh-my-claudecode@omc)
Invocation in this benchmark: two-message flow — /oh-my-claudecode:omc-setup, then /oh-my-claudecode:autopilot <task>

Mechanism

omc is an orchestration layer. It ships a large surface area — agents, hooks, skills, a status line, a .omc/ state directory, and an experimental agent-teams flag (CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1) — and routes work through specialized subagents via a top-level “autopilot” skill. The stated model is delegate-first: a planner decomposes the task, executor subagents write code, and reviewer/verifier subagents close the loop before a final commit. Session logs for the bench runs confirm heavy Task-tool usage: the feature and refactor trials each spawned three persisted subagent logs (omc-feature-t1: executor + Explore + verifier; omc-refactor-t1: three subagents); the bugfix trial spawned one (a single Explore dispatch). See docs/analysis/trial-timelines/{feature,bugfix,refactor}/omc.md for the per-trial breakdown. A parent transcript of 1.3 MB on the feature run and 1.0 MB on refactor is not unusual for this tool — it spends tokens coordinating.

Results

Task	Mean (/200)	95% CI	z	Per-task rank
Feature	116.7	[114.2, 119.4]	−0.108	7 / 9
Bugfix	158.4	[152.5, 164.3]	−0.599	8 / 9
Refactor	162.1	[159.8, 164.6]	+0.125	2 / 9
Combined z̄			−0.194	8 / 9

Rank-sum: 17 (tied with compound).

The refactor result is not a defense. Inter-judge Spearman ρ CIs straddle zero on the refactor task (PAPER §4.3, README caveat 3), so rank positions there are noise-dominated. Refactor self-preference for omc is +11.875, second-highest in the cohort — but PAPER §4.5 is explicit that this metric is not identified in a cohort where every executor runs on the same Anthropic base model. The honest read of the four rows is: omc is rank 8 overall, driven by a below-cohort bugfix trial (−0.599) and a below-cohort feature trial (−0.108). The refactor cell is consistent with that picture; it does not contradict it.

Automated artifacts line up with the judge scores. Feature: 8 files changed, +547 / −2 lines on a brief that did not require that much surface; 10 ESLint errors survived. Bugfix: only 2 files touched / +44 lines, but 15 core-test failures on commit — the orchestration did not prevent a shallow fix that broke the existing suite. Refactor: 11 files / +109 / −44, 8 ESLint errors, tests green.

What the mechanism did here

Two things stand out when the artifacts are read next to the session logs.

The ceremony entry point costs a full message. /oh-my-claudecode:omc-setup is a setup step — it bootstraps CLAUDE.md injection, hooks, and the .omc/ state directory — and no other tool in the cohort requires it. The bench runner has a dedicated OMC_TWO_MSG=1 branch precisely to accommodate this. In a single-session evaluation the setup message contributes no code; it is pure prelude. That alone does not explain an eight-point gap, but it sets the tone for the rest of the trace.

Orchestration did not substitute for scope discipline. The bugfix result is the most telling artifact in the set. The tool’s planner/executor/verifier loop produced a 2-file, 44-line patch that left 15 tests red. A simpler tool could have failed the same way, but the premise of an autopilot with multiple reviewer passes is that it catches exactly this class of mistake. On this trial it did not. On feature, the opposite failure mode shows: +547 lines for a PRD that the top-of-cohort tools closed in 100–300 lines of net change. Heavier does not correlate with higher rubric scores in this cohort — pure, the unadorned baseline, sits in the top-4 tie.

Honest read

omc is a plausible design. Its per-round σ is low (0.6 on bugfix, 0.6 on refactor, 1.8 on feature), meaning the tool is stable — re-running the judges would not move its mean much. The problem is not noise. It is that the orchestration surface, on these three tasks, did not produce code that judges preferred over a plain Claude session. The bench author’s own tool underperforms a zero-ceremony baseline, and the fix is not to reweight the judges; it is to take the artifacts at face value and investigate why a multi-agent loop produced a shallow bugfix and an oversized feature.

Reproducing

TASK=feature ./scripts/create-clones.sh 1
TASK=feature ./scripts/manual-bench.sh omc 1
# Paste message 1 (/oh-my-claudecode:omc-setup), wait, then paste message 2
# (/oh-my-claudecode:autopilot <task>). Exit when the tool has committed.

Artifacts live under results/omc/t1/, results/bugfix/omc/t1/, results/refactor/omc/t1/. Session logs and subagent transcripts are retained verbatim; the .omc/ state directory is gitignored by the benchmark safety rules and does not leave the clone.

Observed in trial timelines

omc is the only tool that consistently produces multi-file session logs (feature t1: 6 separate *.jsonl files; refactor t1–t2: 2 each), reflecting the autopilot pipeline’s process boundaries. Feature t1 spans roughly 13 hours of wall-clock time — by far the longest-running trial in the cohort.

Detail: see the per-trial timeline files linked below.

Trial timelines

Per-trial event timelines auto-extracted from session-logs/*.jsonl — skill activations, plugin/skill file reads, subagents dispatched, code mutations, Bash usage:

Trial timelines

Per-trial session execution extracted from each trial's session-logs/*.jsonl. Each card shows the subagents dispatched, skill activations, Bash command mix, and the final diff. Switch task tabs to compare behaviour across feature, bugfix, and refactor trials.

Feature4 trials Bugfix2 trials Refactor2 trials

t1 02:26 → 15:30 UTC · 2224 min

2 commits8 files+547

“Use /oh-my-claudecode:autopilot to handle this task end-to-end. Let it run its full pipeline (spec → ralplan → execute → verify → commit) in this session. Read the PRD at docs/infina-product-docs/docs/core-products/td-cd…”

Agents3
New files2
Edits11
Bash66
Skills6
Sessions6
Todos5

Bash command mix · 66 calls

other 36
tests 14
install/build 8
git ops 4
inspection 3
typecheck 1

Skill activations (6)

hud — setup at 02:28
oh-my-claudecode:cancel at 02:37
oh-my-claudecode:autopilot — Implement Mode 2 CD Batch for TD-CD product end-to-end. Read PRD at docs/infina-product-docs/docs/core-products/td-cd/us… at 02:38

Subagents dispatched (3)

oh-my-claudecode:executor · Implement TD-CD Mode 2 at 02:39
Explore · Explore Mode 1 implementation at 15:08
oh-my-claudecode:verifier · Verify Mode 2 implementation at 15:24

Subagent transcripts (3)

agent-a05accac6ff2… — Implement TD-CD Mode 2 (CD Batch) end-to-end in /Users/randytran/Codes/ai-tool-benchmark/runs/omc-t1… [Bash×31, Read×10, Grep×10, Edit×5]
agent-a63e3c279ebf… — Verify the Mode 2 CD Batch implementation in this NestJS monorepo. The commit just landed at HEAD. K… [Bash×25, Read×10, ToolSearch×1, Monitor×1]
agent-a8659f479f53… — Thoroughly explore the Mode 1 CD implementation in this NestJS monorepo. I need to understand: 1. Th… [Bash×32, Read×28]

New files created (2)

libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts

t2 10:27 → 10:42 UTC · 14 min

2 commits14 files+608

“<local-command-caveat>Caveat: The messages below were generated by the user while running local commands. DO NOT respond to these messages or otherwise consider them in your response unless the user explicitly asks you t…”

Agents2
New files8
Edits11
Bash29
Skills1
Sessions2

Bash command mix · 29 calls

tests 13
other 7
git ops 5
inspection 2
typecheck 1
lint/format 1

Skill activations (1)

oh-my-claudecode:cancel at 10:41

Subagents dispatched (2)

Explore · Read PRD document at 10:27
Explore · Explore Mode 1 implementation at 10:27

Subagent transcripts (2)

agent-a3c3f2709101… — Explore the Mode 1 TD-CD implementation thoroughly in this NestJS monorepo. I need to understand the… [Read×19, Bash×17]
agent-a579c004fbb2… — Read the file at this exact path and return its full contents: /Users/randytran/Codes/ai-tool-benchm… [Read×1]

New files created (7)

libs/core/src/domain/savings-cd/savings-cd-payment-schedule.service.spec.ts
libs/core/src/domain/savings-cd/savings-cd-payment-schedule.service.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
libs/core/src/model/savings-cd-payment-schedule.model.ts
libs/core/src/port/service/savings-cd-payment-schedule-service.port.ts
libs/core/src/storage/entity/savings-cd-payment-schedule.entity.ts

t3 04:35 → 05:07 UTC · 31 min

2 commits5 files+718

Agents7
New files2
Edits5
Bash19
Skills1
Sessions3

Bash command mix · 19 calls

tests 10
other 5
git ops 3
lint/format 1

Skill activations (1)

oh-my-claudecode:cancel at 05:02

Subagents dispatched (7)

Explore · Explore Mode 1 implementation at 04:35
Explore · Study Mode 1 implementation at 04:41
Explore · Study savings-cd entities and ports at 04:42
Explore · Explore strategy port and tests at 04:49
Explore · Explore libs/savings-cd patterns at 04:49
oh-my-claudecode:architect · Architecture completeness review at 05:01
oh-my-claudecode:code-reviewer · Code quality review at 05:02

Subagent transcripts (7)

agent-a1a99f3955a2… — Explore the Mode 1 CD implementation thoroughly in the NestJS monorepo at /Users/randytran/Codes/ai-… [Bash×51, Read×20, Grep×3]
agent-a237bc2b52cf… — Review the Mode 2 CD Batch strategy implementation for functional completeness against the PRD requi… [Read×5]
agent-a56bb0d94809… — In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/omc-t3, I need to find and read several… [Read×19, Bash×18, Glob×15, Grep×8]
agent-a5ff031cce3c… — Review the Mode 2 CD Batch strategy implementation for code quality. Files to review: - /Users/randy… [Read×4, mcp__plugin_oh-my-claudecode_t__lsp_diagnostics×2, ToolSearch×1]
agent-a66c927275dc… — In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/omc-t3, explore the libs/savings-cd/ di… [Read×9]
agent-ab8afdbed00e… — Explore the TD-CD related entities, repositories, DTOs, and constants in this NestJS monorepo. I nee… [Read×34, Glob×12, Grep×9, Bash×7]
agent-af9355976725… — Thoroughly explore the Mode 1 TD-CD implementation in this NestJS monorepo. I need to understand: 1.… [Bash×42, Read×29]

New files created (2)

libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts

t4 09:08 → 09:30 UTC · 22 min

2 commits3 files+432

“<command-message>oh-my-claudecode:omc-setup</command-message> <command-name>/oh-my-claudecode:omc-setup</command-name>”

Agents4
New files2
Edits13
Bash48
Skills2
Sessions2
Todos6

Bash command mix · 48 calls

other 21
tests 13
install/build 6
git ops 4
inspection 2
lint/format 2

Skill activations (2)

oh-my-claudecode:hud — setup at 09:09
oh-my-claudecode:cancel at 09:29

Subagents dispatched (4)

Explore · Study Mode 1 implementation patterns at 09:15
oh-my-claudecode:architect · Architecture validation at 09:26
oh-my-claudecode:security-reviewer · Security review at 09:26
oh-my-claudecode:code-reviewer · Code quality review at 09:27

Subagent transcripts (4)

agent-a1f2e65b660f… — I need to understand the existing Mode 1 TD-CD implementation patterns in this NestJS monorepo. Plea… [Bash×24, Read×16, Glob×5]
agent-a2bbcd0e51a2… — Security review the new Mode 2 TD-CD strategy implementation. Files to review: - libs/core/src/domai… [Read×2]
agent-a66771480dd5… — Review the Mode 2 TD-CD strategy implementation for functional completeness against the PRD requirem… [Read×3]
agent-a71a4d6f9058… — Code quality review of the new Mode 2 TD-CD strategy implementation. Files to review: - libs/core/sr… [Read×5, Bash×2, mcp__plugin_oh-my-claudecode_t__lsp_diagnostics×2, Grep×1]

New files created (2)

libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts

t1 16:08 → 16:17 UTC · 9 min

2 commits2 files+44

“<command-message>oh-my-claudecode:omc-setup</command-message> <command-name>/oh-my-claudecode:omc-setup</command-name>”

Agents1
Edits3
Bash37
Skills1
Sessions2

Bash command mix · 37 calls

other 19
tests 6
install/build 5
inspection 3
git ops 2
typecheck 1
lint/format 1

Skill activations (1)

oh-my-claudecode:hud — setup at 16:09

Subagents dispatched (1)

Explore · Explore savings-cd codebase at 16:12

Subagent transcripts (1)

agent-ab807dd96f8a… — In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/omc-t1, I need to understand th… [Read×20, Grep×10, Glob×5]

t2 16:21 → 16:33 UTC · 11 min

2 commits4 files+66

Agents1
Edits11
Bash43
Skills1
Sessions2

Bash command mix · 43 calls

other 22
tests 7
install/build 6
inspection 4
git ops 3
lint/format 1

Skill activations (1)

oh-my-claudecode:hud — setup at 16:14

Subagents dispatched (1)

Explore · Explore savings-cd codebase at 16:22

Subagent transcripts (1)

agent-a168697353b5… — In the repo at /Users/randytran/Codes/ai-tool-benchmark/runs/shp2376/omc-t2, I need to find code rel… [Read×17, Bash×12, Grep×12]

t1 03:51 → 04:12 UTC · 21 min

2 commits11 files+109

“<command-message>oh-my-claudecode:omc-setup</command-message> <command-name>/oh-my-claudecode:omc-setup</command-name>”

Agents3
New files1
Edits23
Bash37
Skills2
Sessions2

Bash command mix · 37 calls

other 22
install/build 6
git ops 4
tests 3
inspection 2

Skill activations (2)

oh-my-claudecode:hud — setup at 03:52
oh-my-claudecode:cancel at 04:12

Subagents dispatched (3)

statusline-setup · Configure HUD statusline at 03:52
Explore · Explore Scheme and TSSchemeSetting at 04:03
Explore · Explore ITDCDModeStrategy and CDBatch at 04:03

Subagent transcripts (3)

agent-aedb78dad12a… — Configure the statusLine in /Users/randytran/Codes/ai-tool-benchmark/config/shp2317/omc-t1/settings.… [Read×1, Edit×1]
agent-aee2d682a002… — I need to understand the current structure of Scheme and TSSchemeSetting in this NestJS/TypeORM mono… [Read×10, Grep×6, Glob×4, Bash×2]
agent-af5bca4639f1… — I need to understand the current structure of ITDCDModeStrategy and CDBatch in this NestJS/TypeORM m… [Read×11, Glob×5, Grep×4, Bash×3]

New files created (1)

omc-t1/libs/core/src/model/cd-batch-info.model.ts

t2 03:51 → 04:32 UTC · 40 min

2 commits11 files+163

“<command-message>oh-my-claudecode:omc-setup</command-message> <command-name>/oh-my-claudecode:omc-setup</command-name>”

Agents3
New files3
Edits18
Bash35
Skills1
Sessions2

Bash command mix · 35 calls

other 21
install/build 4
tests 4
git ops 4
inspection 2

Skill activations (1)

oh-my-claudecode:cancel at 04:30

Subagents dispatched (3)

statusline-setup · Setup HUD statusline at 03:53
Explore · Explore Scheme and TSSchemeSetting at 04:14
Explore · Explore strategy port and CDBatch at 04:14

Subagent transcripts (3)

agent-a4a321d0ce12… — Thoroughness: very thorough In /Users/randytran/Codes/ai-tool-benchmark/runs/shp2317/omc-t2, I need… [Bash×16, Glob×9, Read×8, Grep×2]
agent-ab5ae7b3104d… — Thoroughness: very thorough In /Users/randytran/Codes/ai-tool-benchmark/runs/shp2317/omc-t2, I need… [Read×13, Bash×5, Grep×4, Glob×4]
agent-acd4695cc5ee… — Set up the OMC HUD statusline for Claude Code. Install the HUD wrapper script and configure the stat… [Read×8, Edit×1]

New files created (3)

omc-t2/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.spec.ts
omc-t2/libs/core/src/domain/savings-cd/td-cd-mode2.strategy.ts
omc-t2/libs/core/src/model/cd-batch-info.model.ts