mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-03-30 13:43:26 +08:00
* feat(skills): add skill-comply — automated behavioral compliance measurement Automated compliance measurement for skills, rules, and agent definitions. Generates behavioral specs, runs scenarios at 3 strictness levels, classifies tool calls via LLM, and produces self-contained reports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(skill-comply): address bot review feedback - AGENTS.md: fix stale skill count (115 → 117) in project structure - run.py: replace remaining print() with logger, add zero-division guard, create parent dirs for --output path - runner.py: add returncode check for claude subprocess, clarify relative_to path traversal validation - parser.py: use is_file() instead of exists(), catch KeyError for missing trace fields, add file check in parse_spec - classifier.py: log warnings on malformed classification output, guard against non-dict JSON responses - grader.py: filter negative indices from LLM classification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
59 lines
2.3 KiB
Markdown
59 lines
2.3 KiB
Markdown
---
|
|
name: skill-comply
|
|
description: Visualize whether skills, rules, and agent definitions are actually followed — auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines
|
|
origin: ECC
|
|
tools: Read, Bash
|
|
---
|
|
|
|
# skill-comply: Automated Compliance Measurement
|
|
|
|
Measures whether coding agents actually follow skills, rules, or agent definitions by:
|
|
1. Auto-generating expected behavioral sequences (specs) from any .md file
|
|
2. Auto-generating scenarios with decreasing prompt strictness (supportive → neutral → competing)
|
|
3. Running `claude -p` and capturing tool call traces via stream-json
|
|
4. Classifying tool calls against spec steps using LLM (not regex)
|
|
5. Checking temporal ordering deterministically
|
|
6. Generating self-contained reports with spec, prompts, and timelines
|
|
|
|
## Supported Targets
|
|
|
|
- **Skills** (`skills/*/SKILL.md`): Workflow skills like search-first, TDD guides
|
|
- **Rules** (`rules/common/*.md`): Mandatory rules like testing.md, security.md, git-workflow.md
|
|
- **Agent definitions** (`agents/*.md`): Whether an agent gets invoked when expected (internal workflow verification not yet supported)
|
|
|
|
## When to Activate
|
|
|
|
- User runs `/skill-comply <path>`
|
|
- User asks "is this rule actually being followed?"
|
|
- After adding new rules/skills, to verify agent compliance
|
|
- Periodically as part of quality maintenance
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
# Full run
|
|
uv run python -m scripts.run ~/.claude/rules/common/testing.md
|
|
|
|
# Dry run (no cost, spec + scenarios only)
|
|
uv run python -m scripts.run --dry-run ~/.claude/skills/search-first/SKILL.md
|
|
|
|
# Custom models
|
|
uv run python -m scripts.run --gen-model haiku --model sonnet <path>
|
|
```
|
|
|
|
## Key Concept: Prompt Independence
|
|
|
|
Measures whether a skill/rule is followed even when the prompt doesn't explicitly support it.
|
|
|
|
## Report Contents
|
|
|
|
Reports are self-contained and include:
|
|
1. Expected behavioral sequence (auto-generated spec)
|
|
2. Scenario prompts (what was asked at each strictness level)
|
|
3. Compliance scores per scenario
|
|
4. Tool call timelines with LLM classification labels
|
|
|
|
### Advanced (optional)
|
|
|
|
For users familiar with hooks, reports also include hook promotion recommendations for steps with low compliance. This is informational — the main value is the compliance visibility itself.
|