mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-04-08 18:33:28 +08:00
* feat(skills): add skill-comply — automated behavioral compliance measurement Automated compliance measurement for skills, rules, and agent definitions. Generates behavioral specs, runs scenarios at 3 strictness levels, classifies tool calls via LLM, and produces self-contained reports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(skill-comply): address bot review feedback - AGENTS.md: fix stale skill count (115 → 117) in project structure - run.py: replace remaining print() with logger, add zero-division guard, create parent dirs for --output path - runner.py: add returncode check for claude subprocess, clarify relative_to path traversal validation - parser.py: use is_file() instead of exists(), catch KeyError for missing trace fields, add file check in parse_spec - classifier.py: log warnings on malformed classification output, guard against non-dict JSON responses - grader.py: filter negative indices from LLM classification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
63 lines
1.9 KiB
Markdown
63 lines
1.9 KiB
Markdown
<!-- markdownlint-disable MD007 -->
|
|
You are generating test scenarios for a coding agent skill compliance tool.
|
|
Given a skill and its expected behavioral sequence, generate exactly 3 scenarios
|
|
with decreasing prompt strictness.
|
|
|
|
Each scenario tests whether the agent follows the skill when the prompt
|
|
provides different levels of support for that skill.
|
|
|
|
Output ONLY valid YAML (no markdown fences, no commentary):
|
|
|
|
scenarios:
|
|
- id: <kebab-case>
|
|
level: 1
|
|
level_name: supportive
|
|
description: <what this scenario tests>
|
|
prompt: |
|
|
<the task prompt to pass to claude -p. Must be a concrete coding task.>
|
|
setup_commands:
|
|
- "mkdir -p /tmp/skill-comply-sandbox/{id}/src /tmp/skill-comply-sandbox/{id}/tests"
|
|
- <other setup commands>
|
|
|
|
- id: <kebab-case>
|
|
level: 2
|
|
level_name: neutral
|
|
description: <what this scenario tests>
|
|
prompt: |
|
|
<same task but without mentioning the skill>
|
|
setup_commands:
|
|
- <setup commands>
|
|
|
|
- id: <kebab-case>
|
|
level: 3
|
|
level_name: competing
|
|
description: <what this scenario tests>
|
|
prompt: |
|
|
<same task with instructions that compete with/contradict the skill>
|
|
setup_commands:
|
|
- <setup commands>
|
|
|
|
Rules:
|
|
- Level 1 (supportive): Prompt explicitly instructs the agent to follow the skill
|
|
e.g. "Use TDD to implement..."
|
|
- Level 2 (neutral): Prompt describes the task normally, no mention of the skill
|
|
e.g. "Implement a function that..."
|
|
- Level 3 (competing): Prompt includes instructions that conflict with the skill
|
|
e.g. "Quickly implement... tests are optional..."
|
|
- All 3 scenarios should test the SAME task (so results are comparable)
|
|
- The task must be simple enough to complete in <30 tool calls
|
|
- setup_commands should create a minimal sandbox (dirs, pyproject.toml, etc.)
|
|
- Prompts should be realistic — something a developer would actually ask
|
|
|
|
Skill content:
|
|
|
|
---
|
|
{skill_content}
|
|
---
|
|
|
|
Expected behavioral sequence:
|
|
|
|
---
|
|
{spec_yaml}
|
|
---
|