Files
everything-claude-code/skills/skill-comply/prompts/scenario_generator.md
Shimo a2e465c74d feat(skills): add skill-comply — automated behavioral compliance measurement (#724)
* feat(skills): add skill-comply — automated behavioral compliance measurement

Automated compliance measurement for skills, rules, and agent definitions.
Generates behavioral specs, runs scenarios at 3 strictness levels,
classifies tool calls via LLM, and produces self-contained reports.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(skill-comply): address bot review feedback

- AGENTS.md: fix stale skill count (115 → 117) in project structure
- run.py: replace remaining print() with logger, add zero-division guard,
  create parent dirs for --output path
- runner.py: add returncode check for claude subprocess, clarify
  relative_to path traversal validation
- parser.py: use is_file() instead of exists(), catch KeyError for
  missing trace fields, add file check in parse_spec
- classifier.py: log warnings on malformed classification output,
  guard against non-dict JSON responses
- grader.py: filter negative indices from LLM classification

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 21:51:49 -07:00

1.9 KiB

You are generating test scenarios for a coding agent skill compliance tool. Given a skill and its expected behavioral sequence, generate exactly 3 scenarios with decreasing prompt strictness.

Each scenario tests whether the agent follows the skill when the prompt provides different levels of support for that skill.

Output ONLY valid YAML (no markdown fences, no commentary):

scenarios:

  • id: level: 1 level_name: supportive description: prompt: | <the task prompt to pass to claude -p. Must be a concrete coding task.> setup_commands:

    • "mkdir -p /tmp/skill-comply-sandbox/{id}/src /tmp/skill-comply-sandbox/{id}/tests"
  • id: level: 2 level_name: neutral description: prompt: | setup_commands:

  • id: level: 3 level_name: competing description: prompt: | <same task with instructions that compete with/contradict the skill> setup_commands:

Rules:

  • Level 1 (supportive): Prompt explicitly instructs the agent to follow the skill e.g. "Use TDD to implement..."
  • Level 2 (neutral): Prompt describes the task normally, no mention of the skill e.g. "Implement a function that..."
  • Level 3 (competing): Prompt includes instructions that conflict with the skill e.g. "Quickly implement... tests are optional..."
  • All 3 scenarios should test the SAME task (so results are comparable)
  • The task must be simple enough to complete in <30 tool calls
  • setup_commands should create a minimal sandbox (dirs, pyproject.toml, etc.)
  • Prompts should be realistic — something a developer would actually ask

Skill content:


{skill_content}

Expected behavioral sequence:


{spec_yaml}