refactor: collapse legacy command bodies into skills

2026-05-20 07:43:07 +08:00 · 2026-04-01 02:25:42 -07:00
parent 6833454778
commit 975100db55
19 changed files with 630 additions and 702 deletions
--- a/commands/eval.md
+++ b/commands/eval.md
@@ -1,120 +1,23 @@
-# Eval Command
+---
+description: Legacy slash-entry shim for the eval-harness skill. Prefer the skill directly.
+---

-Manage eval-driven development workflow.
+# Eval Command (Legacy Shim)

-## Usage
+Use this only if you still invoke `/eval`. The maintained workflow lives in `skills/eval-harness/SKILL.md`.

-`/eval [define|check|report|list] [feature-name]`
+## Canonical Surface

-## Define Evals
-
-`/eval define feature-name`
-
-Create a new eval definition:
-
-1. Create `.claude/evals/feature-name.md` with template:
-
-```markdown
-## EVAL: feature-name
-Created: $(date)
-
-### Capability Evals
- [ ] [Description of capability 1]
- [ ] [Description of capability 2]
-
-### Regression Evals
- [ ] [Existing behavior 1 still works]
- [ ] [Existing behavior 2 still works]
-
-### Success Criteria
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
-```
-
-2. Prompt user to fill in specific criteria
-
-## Check Evals
-
-`/eval check feature-name`
-
-Run evals for a feature:
-
-1. Read eval definition from `.claude/evals/feature-name.md`
-2. For each capability eval:
-   - Attempt to verify criterion
-   - Record PASS/FAIL
-   - Log attempt in `.claude/evals/feature-name.log`
-3. For each regression eval:
-   - Run relevant tests
-   - Compare against baseline
-   - Record PASS/FAIL
-4. Report current status:
-
-```
-EVAL CHECK: feature-name
-========================
-Capability: X/Y passing
-Regression: X/Y passing
-Status: IN PROGRESS / READY
-```
-
-## Report Evals
-
-`/eval report feature-name`
-
-Generate comprehensive eval report:
-
-```
-EVAL REPORT: feature-name
-=========================
-Generated: $(date)
-
-CAPABILITY EVALS
----------------
-[eval-1]: PASS (pass@1)
-[eval-2]: PASS (pass@2) - required retry
-[eval-3]: FAIL - see notes
-
-REGRESSION EVALS
----------------
-[test-1]: PASS
-[test-2]: PASS
-[test-3]: PASS
-
-METRICS
-------
-Capability pass@1: 67%
-Capability pass@3: 100%
-Regression pass^3: 100%
-
-NOTES
-----
-[Any issues, edge cases, or observations]
-
-RECOMMENDATION
--------------
-[SHIP / NEEDS WORK / BLOCKED]
-```
-
-## List Evals
-
-`/eval list`
-
-Show all eval definitions:
-
-```
-EVAL DEFINITIONS
-================
-feature-auth      [3/5 passing] IN PROGRESS
-feature-search    [5/5 passing] READY
-feature-export    [0/4 passing] NOT STARTED
-```
+- Prefer the `eval-harness` skill directly.
+- Keep this file only as a compatibility entry point.

 ## Arguments

-$ARGUMENTS:
- `define <name>` - Create new eval definition
- `check <name>` - Run and check evals
- `report <name>` - Generate full report
- `list` - Show all evals
- `clean` - Remove old eval logs (keeps last 10 runs)
+`$ARGUMENTS`
+
+## Delegation
+
+Apply the `eval-harness` skill.
+- Support the same user intents as before: define, check, report, list, and cleanup.
+- Keep evals capability-first, regression-backed, and evidence-based.
+- Use the skill as the canonical evaluator instead of maintaining a separate command-specific playbook.