Files
Shimo a2e465c74d feat(skills): add skill-comply — automated behavioral compliance measurement (#724)
* feat(skills): add skill-comply — automated behavioral compliance measurement

Automated compliance measurement for skills, rules, and agent definitions.
Generates behavioral specs, runs scenarios at 3 strictness levels,
classifies tool calls via LLM, and produces self-contained reports.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(skill-comply): address bot review feedback

- AGENTS.md: fix stale skill count (115 → 117) in project structure
- run.py: replace remaining print() with logger, add zero-division guard,
  create parent dirs for --output path
- runner.py: add returncode check for claude subprocess, clarify
  relative_to path traversal validation
- parser.py: use is_file() instead of exists(), catch KeyError for
  missing trace fields, add file check in parse_spec
- classifier.py: log warnings on malformed classification output,
  guard against non-dict JSON responses
- grader.py: filter negative indices from LLM classification

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 21:51:49 -07:00

2.3 KiB

name, description, origin, tools
name description origin tools
skill-comply Visualize whether skills, rules, and agent definitions are actually followed — auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines ECC Read, Bash

skill-comply: Automated Compliance Measurement

Measures whether coding agents actually follow skills, rules, or agent definitions by:

  1. Auto-generating expected behavioral sequences (specs) from any .md file
  2. Auto-generating scenarios with decreasing prompt strictness (supportive → neutral → competing)
  3. Running claude -p and capturing tool call traces via stream-json
  4. Classifying tool calls against spec steps using LLM (not regex)
  5. Checking temporal ordering deterministically
  6. Generating self-contained reports with spec, prompts, and timelines

Supported Targets

  • Skills (skills/*/SKILL.md): Workflow skills like search-first, TDD guides
  • Rules (rules/common/*.md): Mandatory rules like testing.md, security.md, git-workflow.md
  • Agent definitions (agents/*.md): Whether an agent gets invoked when expected (internal workflow verification not yet supported)

When to Activate

  • User runs /skill-comply <path>
  • User asks "is this rule actually being followed?"
  • After adding new rules/skills, to verify agent compliance
  • Periodically as part of quality maintenance

Usage

# Full run
uv run python -m scripts.run ~/.claude/rules/common/testing.md

# Dry run (no cost, spec + scenarios only)
uv run python -m scripts.run --dry-run ~/.claude/skills/search-first/SKILL.md

# Custom models
uv run python -m scripts.run --gen-model haiku --model sonnet <path>

Key Concept: Prompt Independence

Measures whether a skill/rule is followed even when the prompt doesn't explicitly support it.

Report Contents

Reports are self-contained and include:

  1. Expected behavioral sequence (auto-generated spec)
  2. Scenario prompts (what was asked at each strictness level)
  3. Compliance scores per scenario
  4. Tool call timelines with LLM classification labels

Advanced (optional)

For users familiar with hooks, reports also include hook promotion recommendations for steps with low compliance. This is informational — the main value is the compliance visibility itself.