feat(commands): improve learn-eval with checklist-based holistic verdict

Replace the 5-dimension numeric scoring rubric with a checklist + holistic
verdict system (Save / Improve then Save / Absorb into [X] / Drop).

Key improvements:
- Explicit pre-save checklist: grep skills/ for duplicates, check MEMORY.md,
  consider appending to existing skills, confirm reusability
- 4-way verdict instead of binary save/don't-save: adds "Absorb into [X]"
  to prevent skill file proliferation, and "Improve then Save" for iterative
  refinement
- Verdict-specific confirmation flows tailored to each outcome
- Design rationale explaining why holistic judgment outperforms numeric
  scoring with modern frontier models
This commit is contained in:
Tatsuya Shimomoto
2026-03-08 19:35:28 +09:00
committed by Affaan Mustafa
parent 4fa817cd7d
commit 32e11b8701

View File

@@ -1,10 +1,10 @@
---
description: Extract reusable patterns from the session, self-evaluate quality before saving, and determine the right save location (Global vs Project).
description: "Extract reusable patterns from the session, self-evaluate quality before saving, and determine the right save location (Global vs Project)."
---
# /learn-eval - Extract, Evaluate, then Save
Extends `/learn` with a quality gate and save-location decision before writing any skill file.
Extends `/learn` with a quality gate, save-location decision, and knowledge-placement awareness before writing any skill file.
## What to Extract
@@ -51,36 +51,60 @@ origin: auto-extracted
[Trigger conditions]
```
5. **Self-evaluate before saving** using this rubric:
5. **Quality gate — Checklist + Holistic verdict**
| Dimension | 1 | 3 | 5 |
|-----------|---|---|---|
| Specificity | Abstract principles only, no code examples | Representative code example present | Rich examples covering all usage patterns |
| Actionability | Unclear what to do | Main steps are understandable | Immediately actionable, edge cases covered |
| Scope Fit | Too broad or too narrow | Mostly appropriate, some boundary ambiguity | Name, trigger, and content perfectly aligned |
| Non-redundancy | Nearly identical to another skill | Some overlap but unique perspective exists | Completely unique value |
| Coverage | Covers only a fraction of the target task | Main cases covered, common variants missing | Main cases, edge cases, and pitfalls covered |
#### 5a. Required checklist (verify by actually reading files)
- Score each dimension 15
- If any dimension scores 12, improve the draft and re-score until all dimensions are ≥ 3
- Show the user the scores table and the final draft
Execute **all** of the following before evaluating the draft:
6. Ask user to confirm:
- Show: proposed save path + scores table + final draft
- Wait for explicit confirmation before writing
- [ ] Grep `~/.claude/skills/` by keyword to check for content overlap
- [ ] Check MEMORY.md (both project and global) for overlap
- [ ] Consider whether appending to an existing skill would suffice
- [ ] Confirm this is a reusable pattern, not a one-off fix
7. Save to the determined location
#### 5b. Holistic verdict
## Output Format for Step 5 (scores table)
Synthesize the checklist results and draft quality, then choose **one** of the following:
| Dimension | Score | Rationale |
|-----------|-------|-----------|
| Specificity | N/5 | ... |
| Actionability | N/5 | ... |
| Scope Fit | N/5 | ... |
| Non-redundancy | N/5 | ... |
| Coverage | N/5 | ... |
| **Total** | **N/25** | |
| Verdict | Meaning | Next Action |
|---------|---------|-------------|
| **Save** | Unique, specific, well-scoped | Proceed to Step 6 |
| **Improve then Save** | Valuable but needs refinement | List improvements → revise → re-evaluate (once) |
| **Absorb into [X]** | Should be appended to an existing skill | Show target skill and additions → Step 6 |
| **Drop** | Trivial, redundant, or too abstract | Explain reasoning and stop |
**Guideline dimensions** (informing the verdict, not scored):
- **Specificity & Actionability**: Contains code examples or commands that are immediately usable
- **Scope Fit**: Name, trigger conditions, and content are aligned and focused on a single pattern
- **Uniqueness**: Provides value not covered by existing skills (informed by checklist results)
- **Reusability**: Realistic trigger scenarios exist in future sessions
6. **Verdict-specific confirmation flow**
- **Save**: Present save path + checklist results + 1-line verdict rationale + full draft → save after user confirmation
- **Absorb into [X]**: Present target path + additions (diff format) + checklist results + verdict rationale → append after user confirmation
- **Drop**: Show checklist results + reasoning only (no confirmation needed)
7. Save / Absorb to the determined location
## Output Format for Step 5
```
### Checklist
- [x] skills/ grep: no overlap (or: overlap found → details)
- [x] MEMORY.md: no overlap (or: overlap found → details)
- [x] Existing skill append: new file appropriate (or: should append to [X])
- [x] Reusability: confirmed (or: one-off → Drop)
### Verdict: Save / Improve then Save / Absorb into [X] / Drop
**Rationale:** (1-2 sentences explaining the verdict)
```
## Design Rationale
This version replaces the previous 5-dimension numeric scoring rubric (Specificity, Actionability, Scope Fit, Non-redundancy, Coverage scored 1-5) with a checklist-based holistic verdict system. Modern frontier models (Opus 4.6+) have strong contextual judgment — forcing rich qualitative signals into numeric scores loses nuance and can produce misleading totals. The holistic approach lets the model weigh all factors naturally, producing more accurate save/drop decisions while the explicit checklist ensures no critical check is skipped.
## Notes
@@ -88,4 +112,4 @@ origin: auto-extracted
- Don't extract one-time issues (specific API outages, etc.)
- Focus on patterns that will save time in future sessions
- Keep skills focused — one pattern per skill
- If Coverage score is low, add related variants before saving
- When the verdict is Absorb, append to the existing skill rather than creating a new file