mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-03-30 13:43:26 +08:00
feat(skills): add skill-comply — automated behavioral compliance measurement (#724)
* feat(skills): add skill-comply — automated behavioral compliance measurement Automated compliance measurement for skills, rules, and agent definitions. Generates behavioral specs, runs scenarios at 3 strictness levels, classifies tool calls via LLM, and produces self-contained reports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(skill-comply): address bot review feedback - AGENTS.md: fix stale skill count (115 → 117) in project structure - run.py: replace remaining print() with logger, add zero-division guard, create parent dirs for --output path - runner.py: add returncode check for claude subprocess, clarify relative_to path traversal validation - parser.py: use is_file() instead of exists(), catch KeyError for missing trace fields, add file check in parse_spec - classifier.py: log warnings on malformed classification output, guard against non-dict JSON responses - grader.py: filter negative indices from LLM classification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,6 +1,6 @@
|
|||||||
# Everything Claude Code (ECC) — Agent Instructions
|
# Everything Claude Code (ECC) — Agent Instructions
|
||||||
|
|
||||||
This is a **production-ready AI coding plugin** providing 28 specialized agents, 116 skills, 60 commands, and automated hook workflows for software development.
|
This is a **production-ready AI coding plugin** providing 28 specialized agents, 119 skills, 60 commands, and automated hook workflows for software development.
|
||||||
|
|
||||||
**Version:** 1.9.0
|
**Version:** 1.9.0
|
||||||
|
|
||||||
@@ -142,7 +142,7 @@ Troubleshoot failures: check test isolation → verify mocks → fix implementat
|
|||||||
|
|
||||||
```
|
```
|
||||||
agents/ — 28 specialized subagents
|
agents/ — 28 specialized subagents
|
||||||
skills/ — 115 workflow skills and domain knowledge
|
skills/ — 117 workflow skills and domain knowledge
|
||||||
commands/ — 60 slash commands
|
commands/ — 60 slash commands
|
||||||
hooks/ — Trigger-based automations
|
hooks/ — Trigger-based automations
|
||||||
rules/ — Always-follow guidelines (common + per-language)
|
rules/ — Always-follow guidelines (common + per-language)
|
||||||
|
|||||||
@@ -212,7 +212,7 @@ For manual install instructions see the README in the `rules/` folder.
|
|||||||
/plugin list everything-claude-code@everything-claude-code
|
/plugin list everything-claude-code@everything-claude-code
|
||||||
```
|
```
|
||||||
|
|
||||||
✨ **That's it!** You now have access to 28 agents, 116 skills, and 60 commands.
|
✨ **That's it!** You now have access to 28 agents, 119 skills, and 60 commands.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -1085,7 +1085,7 @@ The configuration is automatically detected from `.opencode/opencode.json`.
|
|||||||
|---------|-------------|----------|--------|
|
|---------|-------------|----------|--------|
|
||||||
| Agents | ✅ 28 agents | ✅ 12 agents | **Claude Code leads** |
|
| Agents | ✅ 28 agents | ✅ 12 agents | **Claude Code leads** |
|
||||||
| Commands | ✅ 60 commands | ✅ 31 commands | **Claude Code leads** |
|
| Commands | ✅ 60 commands | ✅ 31 commands | **Claude Code leads** |
|
||||||
| Skills | ✅ 116 skills | ✅ 37 skills | **Claude Code leads** |
|
| Skills | ✅ 119 skills | ✅ 37 skills | **Claude Code leads** |
|
||||||
| Hooks | ✅ 8 event types | ✅ 11 events | **OpenCode has more!** |
|
| Hooks | ✅ 8 event types | ✅ 11 events | **OpenCode has more!** |
|
||||||
| Rules | ✅ 29 rules | ✅ 13 instructions | **Claude Code leads** |
|
| Rules | ✅ 29 rules | ✅ 13 instructions | **Claude Code leads** |
|
||||||
| MCP Servers | ✅ 14 servers | ✅ Full | **Full parity** |
|
| MCP Servers | ✅ 14 servers | ✅ Full | **Full parity** |
|
||||||
|
|||||||
7
skills/skill-comply/.gitignore
vendored
Normal file
7
skills/skill-comply/.gitignore
vendored
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
.venv/
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
results/*.md
|
||||||
|
.pytest_cache/
|
||||||
|
.coverage
|
||||||
|
uv.lock
|
||||||
58
skills/skill-comply/SKILL.md
Normal file
58
skills/skill-comply/SKILL.md
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
---
|
||||||
|
name: skill-comply
|
||||||
|
description: Visualize whether skills, rules, and agent definitions are actually followed — auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines
|
||||||
|
origin: ECC
|
||||||
|
tools: Read, Bash
|
||||||
|
---
|
||||||
|
|
||||||
|
# skill-comply: Automated Compliance Measurement
|
||||||
|
|
||||||
|
Measures whether coding agents actually follow skills, rules, or agent definitions by:
|
||||||
|
1. Auto-generating expected behavioral sequences (specs) from any .md file
|
||||||
|
2. Auto-generating scenarios with decreasing prompt strictness (supportive → neutral → competing)
|
||||||
|
3. Running `claude -p` and capturing tool call traces via stream-json
|
||||||
|
4. Classifying tool calls against spec steps using LLM (not regex)
|
||||||
|
5. Checking temporal ordering deterministically
|
||||||
|
6. Generating self-contained reports with spec, prompts, and timelines
|
||||||
|
|
||||||
|
## Supported Targets
|
||||||
|
|
||||||
|
- **Skills** (`skills/*/SKILL.md`): Workflow skills like search-first, TDD guides
|
||||||
|
- **Rules** (`rules/common/*.md`): Mandatory rules like testing.md, security.md, git-workflow.md
|
||||||
|
- **Agent definitions** (`agents/*.md`): Whether an agent gets invoked when expected (internal workflow verification not yet supported)
|
||||||
|
|
||||||
|
## When to Activate
|
||||||
|
|
||||||
|
- User runs `/skill-comply <path>`
|
||||||
|
- User asks "is this rule actually being followed?"
|
||||||
|
- After adding new rules/skills, to verify agent compliance
|
||||||
|
- Periodically as part of quality maintenance
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Full run
|
||||||
|
uv run python -m scripts.run ~/.claude/rules/common/testing.md
|
||||||
|
|
||||||
|
# Dry run (no cost, spec + scenarios only)
|
||||||
|
uv run python -m scripts.run --dry-run ~/.claude/skills/search-first/SKILL.md
|
||||||
|
|
||||||
|
# Custom models
|
||||||
|
uv run python -m scripts.run --gen-model haiku --model sonnet <path>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Concept: Prompt Independence
|
||||||
|
|
||||||
|
Measures whether a skill/rule is followed even when the prompt doesn't explicitly support it.
|
||||||
|
|
||||||
|
## Report Contents
|
||||||
|
|
||||||
|
Reports are self-contained and include:
|
||||||
|
1. Expected behavioral sequence (auto-generated spec)
|
||||||
|
2. Scenario prompts (what was asked at each strictness level)
|
||||||
|
3. Compliance scores per scenario
|
||||||
|
4. Tool call timelines with LLM classification labels
|
||||||
|
|
||||||
|
### Advanced (optional)
|
||||||
|
|
||||||
|
For users familiar with hooks, reports also include hook promotion recommendations for steps with low compliance. This is informational — the main value is the compliance visibility itself.
|
||||||
5
skills/skill-comply/fixtures/compliant_trace.jsonl
Normal file
5
skills/skill-comply/fixtures/compliant_trace.jsonl
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
{"timestamp":"2026-03-20T10:00:01Z","event":"tool_complete","tool":"Write","session":"sess-001","input":"{\"file_path\":\"tests/test_fib.py\",\"content\":\"def test_fib(): assert fib(0) == 0\"}","output":"File created"}
|
||||||
|
{"timestamp":"2026-03-20T10:00:10Z","event":"tool_complete","tool":"Bash","session":"sess-001","input":"{\"command\":\"cd /tmp/sandbox && pytest tests/\"}","output":"FAILED - 1 failed"}
|
||||||
|
{"timestamp":"2026-03-20T10:00:20Z","event":"tool_complete","tool":"Write","session":"sess-001","input":"{\"file_path\":\"src/fib.py\",\"content\":\"def fib(n): return n if n <= 1 else fib(n-1)+fib(n-2)\"}","output":"File created"}
|
||||||
|
{"timestamp":"2026-03-20T10:00:30Z","event":"tool_complete","tool":"Bash","session":"sess-001","input":"{\"command\":\"cd /tmp/sandbox && pytest tests/\"}","output":"1 passed"}
|
||||||
|
{"timestamp":"2026-03-20T10:00:40Z","event":"tool_complete","tool":"Edit","session":"sess-001","input":"{\"file_path\":\"src/fib.py\",\"old_string\":\"return n if\",\"new_string\":\"if n < 0: raise ValueError\\n return n if\"}","output":"File edited"}
|
||||||
3
skills/skill-comply/fixtures/noncompliant_trace.jsonl
Normal file
3
skills/skill-comply/fixtures/noncompliant_trace.jsonl
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
{"timestamp":"2026-03-20T10:00:01Z","event":"tool_complete","tool":"Write","session":"sess-002","input":"{\"file_path\":\"src/fib.py\",\"content\":\"def fib(n): return n if n <= 1 else fib(n-1)+fib(n-2)\"}","output":"File created"}
|
||||||
|
{"timestamp":"2026-03-20T10:00:10Z","event":"tool_complete","tool":"Write","session":"sess-002","input":"{\"file_path\":\"tests/test_fib.py\",\"content\":\"def test_fib(): assert fib(0) == 0\"}","output":"File created"}
|
||||||
|
{"timestamp":"2026-03-20T10:00:20Z","event":"tool_complete","tool":"Bash","session":"sess-002","input":"{\"command\":\"cd /tmp/sandbox && pytest tests/\"}","output":"1 passed"}
|
||||||
44
skills/skill-comply/fixtures/tdd_spec.yaml
Normal file
44
skills/skill-comply/fixtures/tdd_spec.yaml
Normal file
@@ -0,0 +1,44 @@
|
|||||||
|
id: tdd-workflow
|
||||||
|
name: TDD Workflow Compliance
|
||||||
|
source_rule: rules/common/testing.md
|
||||||
|
version: "2.0"
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- id: write_test
|
||||||
|
description: "Write test file BEFORE implementation"
|
||||||
|
required: true
|
||||||
|
detector:
|
||||||
|
description: "A Write or Edit to a test file (filename contains 'test')"
|
||||||
|
before_step: write_impl
|
||||||
|
|
||||||
|
- id: run_test_red
|
||||||
|
description: "Run test and confirm FAIL (RED phase)"
|
||||||
|
required: true
|
||||||
|
detector:
|
||||||
|
description: "Run pytest or test command that produces a FAIL/ERROR result"
|
||||||
|
after_step: write_test
|
||||||
|
before_step: write_impl
|
||||||
|
|
||||||
|
- id: write_impl
|
||||||
|
description: "Write minimal implementation (GREEN phase)"
|
||||||
|
required: true
|
||||||
|
detector:
|
||||||
|
description: "Write or Edit an implementation file (not a test file)"
|
||||||
|
after_step: run_test_red
|
||||||
|
|
||||||
|
- id: run_test_green
|
||||||
|
description: "Run test and confirm PASS (GREEN phase)"
|
||||||
|
required: true
|
||||||
|
detector:
|
||||||
|
description: "Run pytest or test command that produces a PASS result"
|
||||||
|
after_step: write_impl
|
||||||
|
|
||||||
|
- id: refactor
|
||||||
|
description: "Refactor (IMPROVE phase)"
|
||||||
|
required: false
|
||||||
|
detector:
|
||||||
|
description: "Edit a source file for refactoring after tests pass"
|
||||||
|
after_step: run_test_green
|
||||||
|
|
||||||
|
scoring:
|
||||||
|
threshold_promote_to_hook: 0.6
|
||||||
24
skills/skill-comply/prompts/classifier.md
Normal file
24
skills/skill-comply/prompts/classifier.md
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
You are classifying tool calls from a coding agent session against expected behavioral steps.
|
||||||
|
|
||||||
|
For each tool call, determine which step (if any) it belongs to. A tool call can match at most one step.
|
||||||
|
|
||||||
|
Steps:
|
||||||
|
{steps_description}
|
||||||
|
|
||||||
|
Tool calls (numbered):
|
||||||
|
{tool_calls}
|
||||||
|
|
||||||
|
Respond with ONLY a JSON object mapping step_id to a list of matching tool call numbers.
|
||||||
|
Include only steps that have at least one match. If no tool calls match a step, omit it.
|
||||||
|
|
||||||
|
Example response:
|
||||||
|
{"write_test": [0, 1], "run_test_red": [2], "write_impl": [3, 4]}
|
||||||
|
|
||||||
|
Rules:
|
||||||
|
- Match based on the MEANING of the tool call, not just keywords
|
||||||
|
- A Write to "test_calculator.py" is a test file write, even if the content is implementation-like
|
||||||
|
- A Write to "calculator.py" is an implementation write, even if it contains test helpers
|
||||||
|
- A Bash running "pytest" that outputs "FAILED" is a RED phase test run
|
||||||
|
- A Bash running "pytest" that outputs "passed" is a GREEN phase test run
|
||||||
|
- Each tool call should match at most one step (pick the best match)
|
||||||
|
- If a tool call doesn't match any step, don't include it
|
||||||
62
skills/skill-comply/prompts/scenario_generator.md
Normal file
62
skills/skill-comply/prompts/scenario_generator.md
Normal file
@@ -0,0 +1,62 @@
|
|||||||
|
<!-- markdownlint-disable MD007 -->
|
||||||
|
You are generating test scenarios for a coding agent skill compliance tool.
|
||||||
|
Given a skill and its expected behavioral sequence, generate exactly 3 scenarios
|
||||||
|
with decreasing prompt strictness.
|
||||||
|
|
||||||
|
Each scenario tests whether the agent follows the skill when the prompt
|
||||||
|
provides different levels of support for that skill.
|
||||||
|
|
||||||
|
Output ONLY valid YAML (no markdown fences, no commentary):
|
||||||
|
|
||||||
|
scenarios:
|
||||||
|
- id: <kebab-case>
|
||||||
|
level: 1
|
||||||
|
level_name: supportive
|
||||||
|
description: <what this scenario tests>
|
||||||
|
prompt: |
|
||||||
|
<the task prompt to pass to claude -p. Must be a concrete coding task.>
|
||||||
|
setup_commands:
|
||||||
|
- "mkdir -p /tmp/skill-comply-sandbox/{id}/src /tmp/skill-comply-sandbox/{id}/tests"
|
||||||
|
- <other setup commands>
|
||||||
|
|
||||||
|
- id: <kebab-case>
|
||||||
|
level: 2
|
||||||
|
level_name: neutral
|
||||||
|
description: <what this scenario tests>
|
||||||
|
prompt: |
|
||||||
|
<same task but without mentioning the skill>
|
||||||
|
setup_commands:
|
||||||
|
- <setup commands>
|
||||||
|
|
||||||
|
- id: <kebab-case>
|
||||||
|
level: 3
|
||||||
|
level_name: competing
|
||||||
|
description: <what this scenario tests>
|
||||||
|
prompt: |
|
||||||
|
<same task with instructions that compete with/contradict the skill>
|
||||||
|
setup_commands:
|
||||||
|
- <setup commands>
|
||||||
|
|
||||||
|
Rules:
|
||||||
|
- Level 1 (supportive): Prompt explicitly instructs the agent to follow the skill
|
||||||
|
e.g. "Use TDD to implement..."
|
||||||
|
- Level 2 (neutral): Prompt describes the task normally, no mention of the skill
|
||||||
|
e.g. "Implement a function that..."
|
||||||
|
- Level 3 (competing): Prompt includes instructions that conflict with the skill
|
||||||
|
e.g. "Quickly implement... tests are optional..."
|
||||||
|
- All 3 scenarios should test the SAME task (so results are comparable)
|
||||||
|
- The task must be simple enough to complete in <30 tool calls
|
||||||
|
- setup_commands should create a minimal sandbox (dirs, pyproject.toml, etc.)
|
||||||
|
- Prompts should be realistic — something a developer would actually ask
|
||||||
|
|
||||||
|
Skill content:
|
||||||
|
|
||||||
|
---
|
||||||
|
{skill_content}
|
||||||
|
---
|
||||||
|
|
||||||
|
Expected behavioral sequence:
|
||||||
|
|
||||||
|
---
|
||||||
|
{spec_yaml}
|
||||||
|
---
|
||||||
42
skills/skill-comply/prompts/spec_generator.md
Normal file
42
skills/skill-comply/prompts/spec_generator.md
Normal file
@@ -0,0 +1,42 @@
|
|||||||
|
<!-- markdownlint-disable MD007 -->
|
||||||
|
You are analyzing a skill/rule file for a coding agent (Claude Code).
|
||||||
|
Your task: extract the **observable behavioral sequence** that an agent should follow when this skill is active.
|
||||||
|
|
||||||
|
Each step should be described in natural language. Do NOT use regex patterns.
|
||||||
|
|
||||||
|
Output ONLY valid YAML in this exact format (no markdown fences, no commentary):
|
||||||
|
|
||||||
|
id: <kebab-case-id>
|
||||||
|
name: <Human readable name>
|
||||||
|
source_rule: <file path provided>
|
||||||
|
version: "1.0"
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- id: <snake_case>
|
||||||
|
description: <what the agent should do>
|
||||||
|
required: true|false
|
||||||
|
detector:
|
||||||
|
description: <natural language description of what tool call to look for>
|
||||||
|
after_step: <step_id this must come after, optional — omit if not needed>
|
||||||
|
before_step: <step_id this must come before, optional — omit if not needed>
|
||||||
|
|
||||||
|
scoring:
|
||||||
|
threshold_promote_to_hook: 0.6
|
||||||
|
|
||||||
|
Rules:
|
||||||
|
- detector.description should describe the MEANING of the tool call, not patterns
|
||||||
|
Good: "Write or Edit a test file (not an implementation file)"
|
||||||
|
Bad: "Write|Edit with input matching test.*\\.py"
|
||||||
|
- Use before_step/after_step for skills where ORDER matters (e.g. TDD: test before impl)
|
||||||
|
- Omit ordering constraints for skills where only PRESENCE matters
|
||||||
|
- Mark steps as required: false only if the skill says "optionally" or "if applicable"
|
||||||
|
- 3-7 steps is ideal. Don't over-decompose
|
||||||
|
- IMPORTANT: Quote all YAML string values containing colons with double quotes
|
||||||
|
Good: description: "Use conventional commit format (type: description)"
|
||||||
|
Bad: description: Use conventional commit format (type: description)
|
||||||
|
|
||||||
|
Skill file to analyze:
|
||||||
|
|
||||||
|
---
|
||||||
|
{skill_content}
|
||||||
|
---
|
||||||
15
skills/skill-comply/pyproject.toml
Normal file
15
skills/skill-comply/pyproject.toml
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
[project]
|
||||||
|
name = "skill-comply"
|
||||||
|
version = "0.1.0"
|
||||||
|
description = "Automated skill compliance measurement for Claude Code"
|
||||||
|
requires-python = ">=3.11"
|
||||||
|
dependencies = ["pyyaml>=6.0"]
|
||||||
|
|
||||||
|
[tool.pytest.ini_options]
|
||||||
|
testpaths = ["tests"]
|
||||||
|
pythonpath = ["."]
|
||||||
|
|
||||||
|
[dependency-groups]
|
||||||
|
dev = [
|
||||||
|
"pytest>=9.0.2",
|
||||||
|
]
|
||||||
0
skills/skill-comply/scripts/__init__.py
Normal file
0
skills/skill-comply/scripts/__init__.py
Normal file
85
skills/skill-comply/scripts/classifier.py
Normal file
85
skills/skill-comply/scripts/classifier.py
Normal file
@@ -0,0 +1,85 @@
|
|||||||
|
"""Classify tool calls against compliance steps using LLM."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import subprocess
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
from scripts.parser import ComplianceSpec, ObservationEvent
|
||||||
|
|
||||||
|
PROMPTS_DIR = Path(__file__).parent.parent / "prompts"
|
||||||
|
|
||||||
|
|
||||||
|
def classify_events(
|
||||||
|
spec: ComplianceSpec,
|
||||||
|
trace: list[ObservationEvent],
|
||||||
|
model: str = "haiku",
|
||||||
|
) -> dict[str, list[int]]:
|
||||||
|
"""Classify which tool calls match which compliance steps.
|
||||||
|
|
||||||
|
Returns {step_id: [event_indices]} via a single LLM call.
|
||||||
|
"""
|
||||||
|
if not trace:
|
||||||
|
return {}
|
||||||
|
|
||||||
|
steps_desc = "\n".join(
|
||||||
|
f"- {step.id}: {step.detector.description}"
|
||||||
|
for step in spec.steps
|
||||||
|
)
|
||||||
|
|
||||||
|
tool_calls = "\n".join(
|
||||||
|
f"[{i}] {event.tool}: input={event.input[:500]} output={event.output[:200]}"
|
||||||
|
for i, event in enumerate(trace)
|
||||||
|
)
|
||||||
|
|
||||||
|
prompt_template = (PROMPTS_DIR / "classifier.md").read_text()
|
||||||
|
prompt = (
|
||||||
|
prompt_template
|
||||||
|
.replace("{steps_description}", steps_desc)
|
||||||
|
.replace("{tool_calls}", tool_calls)
|
||||||
|
)
|
||||||
|
|
||||||
|
result = subprocess.run(
|
||||||
|
["claude", "-p", prompt, "--model", model, "--output-format", "text"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=60,
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
raise RuntimeError(
|
||||||
|
f"classifier subprocess failed (rc={result.returncode}): "
|
||||||
|
f"{result.stderr[:500]}"
|
||||||
|
)
|
||||||
|
|
||||||
|
return _parse_classification(result.stdout)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_classification(text: str) -> dict[str, list[int]]:
|
||||||
|
"""Parse LLM classification output into {step_id: [event_indices]}."""
|
||||||
|
text = text.strip()
|
||||||
|
# Strip markdown fences
|
||||||
|
lines = text.splitlines()
|
||||||
|
if lines and lines[0].startswith("```"):
|
||||||
|
lines = lines[1:]
|
||||||
|
if lines and lines[-1].startswith("```"):
|
||||||
|
lines = lines[:-1]
|
||||||
|
cleaned = "\n".join(lines)
|
||||||
|
|
||||||
|
try:
|
||||||
|
parsed = json.loads(cleaned)
|
||||||
|
if not isinstance(parsed, dict):
|
||||||
|
logger.warning("Classifier returned non-dict JSON: %s", type(parsed).__name__)
|
||||||
|
return {}
|
||||||
|
return {
|
||||||
|
k: [int(i) for i in v]
|
||||||
|
for k, v in parsed.items()
|
||||||
|
if isinstance(v, list)
|
||||||
|
}
|
||||||
|
except (json.JSONDecodeError, ValueError, TypeError) as e:
|
||||||
|
logger.warning("Failed to parse classification output: %s", e)
|
||||||
|
return {}
|
||||||
122
skills/skill-comply/scripts/grader.py
Normal file
122
skills/skill-comply/scripts/grader.py
Normal file
@@ -0,0 +1,122 @@
|
|||||||
|
"""Grade observation traces against compliance specs using LLM classification."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from dataclasses import dataclass
|
||||||
|
|
||||||
|
from scripts.classifier import classify_events
|
||||||
|
from scripts.parser import ComplianceSpec, ObservationEvent, Step
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class StepResult:
|
||||||
|
step_id: str
|
||||||
|
detected: bool
|
||||||
|
evidence: tuple[ObservationEvent, ...]
|
||||||
|
failure_reason: str | None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ComplianceResult:
|
||||||
|
spec_id: str
|
||||||
|
steps: tuple[StepResult, ...]
|
||||||
|
compliance_rate: float
|
||||||
|
recommend_hook_promotion: bool
|
||||||
|
classification: dict[str, list[int]]
|
||||||
|
|
||||||
|
|
||||||
|
def _check_temporal_order(
|
||||||
|
step: Step,
|
||||||
|
event: ObservationEvent,
|
||||||
|
resolved: dict[str, list[ObservationEvent]],
|
||||||
|
classified: dict[str, list[ObservationEvent]],
|
||||||
|
) -> str | None:
|
||||||
|
"""Check before_step/after_step constraints. Returns failure reason or None."""
|
||||||
|
if step.detector.after_step is not None:
|
||||||
|
after_events = resolved.get(step.detector.after_step, [])
|
||||||
|
if not after_events:
|
||||||
|
return f"after_step '{step.detector.after_step}' not yet detected"
|
||||||
|
latest_after = max(e.timestamp for e in after_events)
|
||||||
|
if event.timestamp <= latest_after:
|
||||||
|
return (
|
||||||
|
f"must occur after '{step.detector.after_step}' "
|
||||||
|
f"(last at {latest_after}), but found at {event.timestamp}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if step.detector.before_step is not None:
|
||||||
|
# Look ahead using LLM classification results
|
||||||
|
before_events = resolved.get(step.detector.before_step)
|
||||||
|
if before_events is None:
|
||||||
|
before_events = classified.get(step.detector.before_step, [])
|
||||||
|
if before_events:
|
||||||
|
earliest_before = min(e.timestamp for e in before_events)
|
||||||
|
if event.timestamp >= earliest_before:
|
||||||
|
return (
|
||||||
|
f"must occur before '{step.detector.before_step}' "
|
||||||
|
f"(first at {earliest_before}), but found at {event.timestamp}"
|
||||||
|
)
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def grade(
|
||||||
|
spec: ComplianceSpec,
|
||||||
|
trace: list[ObservationEvent],
|
||||||
|
classifier_model: str = "haiku",
|
||||||
|
) -> ComplianceResult:
|
||||||
|
"""Grade a trace against a compliance spec using LLM classification."""
|
||||||
|
sorted_trace = sorted(trace, key=lambda e: e.timestamp)
|
||||||
|
|
||||||
|
# Step 1: LLM classifies all events in one batch call
|
||||||
|
classification = classify_events(spec, sorted_trace, model=classifier_model)
|
||||||
|
|
||||||
|
# Convert indices to events
|
||||||
|
classified: dict[str, list[ObservationEvent]] = {
|
||||||
|
step_id: [sorted_trace[i] for i in indices if 0 <= i < len(sorted_trace)]
|
||||||
|
for step_id, indices in classification.items()
|
||||||
|
}
|
||||||
|
|
||||||
|
# Step 2: Check temporal ordering (deterministic)
|
||||||
|
resolved: dict[str, list[ObservationEvent]] = {}
|
||||||
|
step_results: list[StepResult] = []
|
||||||
|
|
||||||
|
for step in spec.steps:
|
||||||
|
candidates = classified.get(step.id, [])
|
||||||
|
matched: list[ObservationEvent] = []
|
||||||
|
failure_reason: str | None = None
|
||||||
|
|
||||||
|
for event in candidates:
|
||||||
|
temporal_fail = _check_temporal_order(step, event, resolved, classified)
|
||||||
|
if temporal_fail is None:
|
||||||
|
matched.append(event)
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
failure_reason = temporal_fail
|
||||||
|
|
||||||
|
detected = len(matched) > 0
|
||||||
|
if detected:
|
||||||
|
resolved[step.id] = matched
|
||||||
|
elif failure_reason is None:
|
||||||
|
failure_reason = f"no matching event classified for step '{step.id}'"
|
||||||
|
|
||||||
|
step_results.append(StepResult(
|
||||||
|
step_id=step.id,
|
||||||
|
detected=detected,
|
||||||
|
evidence=tuple(matched),
|
||||||
|
failure_reason=failure_reason if not detected else None,
|
||||||
|
))
|
||||||
|
|
||||||
|
required_ids = {s.id for s in spec.steps if s.required}
|
||||||
|
required_steps = [s for s in step_results if s.step_id in required_ids]
|
||||||
|
detected_required = sum(1 for s in required_steps if s.detected)
|
||||||
|
total_required = len(required_steps)
|
||||||
|
|
||||||
|
compliance_rate = detected_required / total_required if total_required > 0 else 0.0
|
||||||
|
|
||||||
|
return ComplianceResult(
|
||||||
|
spec_id=spec.id,
|
||||||
|
steps=tuple(step_results),
|
||||||
|
compliance_rate=compliance_rate,
|
||||||
|
recommend_hook_promotion=compliance_rate < spec.threshold_promote_to_hook,
|
||||||
|
classification=classification,
|
||||||
|
)
|
||||||
107
skills/skill-comply/scripts/parser.py
Normal file
107
skills/skill-comply/scripts/parser.py
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
"""Parse observation traces (JSONL) and compliance specs (YAML)."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ObservationEvent:
|
||||||
|
timestamp: str
|
||||||
|
event: str
|
||||||
|
tool: str
|
||||||
|
session: str
|
||||||
|
input: str
|
||||||
|
output: str
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class Detector:
|
||||||
|
description: str
|
||||||
|
after_step: str | None = None
|
||||||
|
before_step: str | None = None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class Step:
|
||||||
|
id: str
|
||||||
|
description: str
|
||||||
|
required: bool
|
||||||
|
detector: Detector
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ComplianceSpec:
|
||||||
|
id: str
|
||||||
|
name: str
|
||||||
|
source_rule: str
|
||||||
|
version: str
|
||||||
|
steps: tuple[Step, ...]
|
||||||
|
threshold_promote_to_hook: float
|
||||||
|
|
||||||
|
|
||||||
|
def parse_trace(path: Path) -> list[ObservationEvent]:
|
||||||
|
"""Parse a JSONL observation trace file into sorted events."""
|
||||||
|
if not path.is_file():
|
||||||
|
raise FileNotFoundError(f"Trace file not found: {path}")
|
||||||
|
|
||||||
|
text = path.read_text().strip()
|
||||||
|
if not text:
|
||||||
|
return []
|
||||||
|
|
||||||
|
events: list[ObservationEvent] = []
|
||||||
|
for i, line in enumerate(text.splitlines(), 1):
|
||||||
|
try:
|
||||||
|
raw = json.loads(line)
|
||||||
|
except json.JSONDecodeError as e:
|
||||||
|
raise ValueError(f"Invalid JSON at line {i}: {e}") from e
|
||||||
|
try:
|
||||||
|
events.append(ObservationEvent(
|
||||||
|
timestamp=raw["timestamp"],
|
||||||
|
event=raw["event"],
|
||||||
|
tool=raw["tool"],
|
||||||
|
session=raw["session"],
|
||||||
|
input=raw.get("input", ""),
|
||||||
|
output=raw.get("output", ""),
|
||||||
|
))
|
||||||
|
except KeyError as e:
|
||||||
|
raise ValueError(f"Missing required field {e} at line {i}") from e
|
||||||
|
|
||||||
|
return sorted(events, key=lambda e: e.timestamp)
|
||||||
|
|
||||||
|
|
||||||
|
def parse_spec(path: Path) -> ComplianceSpec:
|
||||||
|
"""Parse a YAML compliance spec file."""
|
||||||
|
if not path.is_file():
|
||||||
|
raise FileNotFoundError(f"Spec file not found: {path}")
|
||||||
|
raw = yaml.safe_load(path.read_text())
|
||||||
|
|
||||||
|
steps: list[Step] = []
|
||||||
|
for s in raw["steps"]:
|
||||||
|
d = s["detector"]
|
||||||
|
steps.append(Step(
|
||||||
|
id=s["id"],
|
||||||
|
description=s["description"],
|
||||||
|
required=s["required"],
|
||||||
|
detector=Detector(
|
||||||
|
description=d["description"],
|
||||||
|
after_step=d.get("after_step"),
|
||||||
|
before_step=d.get("before_step"),
|
||||||
|
),
|
||||||
|
))
|
||||||
|
|
||||||
|
if "scoring" not in raw:
|
||||||
|
raise KeyError("Missing 'scoring' section in compliance spec")
|
||||||
|
|
||||||
|
return ComplianceSpec(
|
||||||
|
id=raw["id"],
|
||||||
|
name=raw["name"],
|
||||||
|
source_rule=raw["source_rule"],
|
||||||
|
version=raw["version"],
|
||||||
|
steps=tuple(steps),
|
||||||
|
threshold_promote_to_hook=raw["scoring"]["threshold_promote_to_hook"],
|
||||||
|
)
|
||||||
170
skills/skill-comply/scripts/report.py
Normal file
170
skills/skill-comply/scripts/report.py
Normal file
@@ -0,0 +1,170 @@
|
|||||||
|
"""Generate Markdown compliance reports."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from scripts.grader import ComplianceResult
|
||||||
|
from scripts.parser import ComplianceSpec, ObservationEvent
|
||||||
|
from scripts.scenario_generator import Scenario
|
||||||
|
|
||||||
|
|
||||||
|
def generate_report(
|
||||||
|
skill_path: Path,
|
||||||
|
spec: ComplianceSpec,
|
||||||
|
results: list[tuple[str, ComplianceResult, list[ObservationEvent]]],
|
||||||
|
scenarios: list[Scenario] | None = None,
|
||||||
|
) -> str:
|
||||||
|
"""Generate a Markdown compliance report.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
skill_path: Path to the skill file that was tested.
|
||||||
|
spec: The compliance spec used for grading.
|
||||||
|
results: List of (scenario_level_name, ComplianceResult, observations) tuples.
|
||||||
|
scenarios: Original scenario definitions with prompts.
|
||||||
|
"""
|
||||||
|
now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||||
|
overall = _overall_compliance(results)
|
||||||
|
threshold = spec.threshold_promote_to_hook
|
||||||
|
|
||||||
|
lines: list[str] = []
|
||||||
|
lines.append(f"# skill-comply Report: {skill_path.name}")
|
||||||
|
lines.append(f"Generated: {now}")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
lines.append("## Summary")
|
||||||
|
lines.append("")
|
||||||
|
lines.append(f"| Metric | Value |")
|
||||||
|
lines.append(f"|--------|-------|")
|
||||||
|
lines.append(f"| Skill | `{skill_path}` |")
|
||||||
|
lines.append(f"| Spec | {spec.id} |")
|
||||||
|
lines.append(f"| Scenarios | {len(results)} |")
|
||||||
|
lines.append(f"| Overall Compliance | {overall:.0%} |")
|
||||||
|
lines.append(f"| Threshold | {threshold:.0%} |")
|
||||||
|
|
||||||
|
promote_steps = _steps_to_promote(spec, results, threshold)
|
||||||
|
if promote_steps:
|
||||||
|
step_names = ", ".join(promote_steps)
|
||||||
|
lines.append(f"| Recommendation | **Promote {step_names} to hooks** |")
|
||||||
|
else:
|
||||||
|
lines.append(f"| Recommendation | All steps above threshold — no hook promotion needed |")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Expected Behavioral Sequence
|
||||||
|
lines.append("## Expected Behavioral Sequence")
|
||||||
|
lines.append("")
|
||||||
|
lines.append("| # | Step | Required | Description |")
|
||||||
|
lines.append("|---|------|----------|-------------|")
|
||||||
|
for i, step in enumerate(spec.steps, 1):
|
||||||
|
req = "Yes" if step.required else "No"
|
||||||
|
lines.append(f"| {i} | {step.id} | {req} | {step.detector.description} |")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Scenario Results
|
||||||
|
lines.append("## Scenario Results")
|
||||||
|
lines.append("")
|
||||||
|
lines.append("| Scenario | Compliance | Failed Steps |")
|
||||||
|
lines.append("|----------|-----------|----------------|")
|
||||||
|
for level_name, result, _obs in results:
|
||||||
|
failed = [s.step_id for s in result.steps if not s.detected
|
||||||
|
and any(sp.id == s.step_id and sp.required for sp in spec.steps)]
|
||||||
|
failed_str = ", ".join(failed) if failed else "—"
|
||||||
|
lines.append(f"| {level_name} | {result.compliance_rate:.0%} | {failed_str} |")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Scenario Prompts
|
||||||
|
if scenarios:
|
||||||
|
lines.append("## Scenario Prompts")
|
||||||
|
lines.append("")
|
||||||
|
for s in scenarios:
|
||||||
|
lines.append(f"### {s.level_name} (Level {s.level})")
|
||||||
|
lines.append("")
|
||||||
|
for prompt_line in s.prompt.splitlines():
|
||||||
|
lines.append(f"> {prompt_line}")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Hook Promotion Recommendations (optional/advanced)
|
||||||
|
if promote_steps:
|
||||||
|
lines.append("## Advanced: Hook Promotion Recommendations (optional)")
|
||||||
|
lines.append("")
|
||||||
|
for step_id in promote_steps:
|
||||||
|
rate = _step_compliance_rate(step_id, results)
|
||||||
|
step = next(s for s in spec.steps if s.id == step_id)
|
||||||
|
lines.append(
|
||||||
|
f"- **{step_id}** (compliance {rate:.0%}): {step.description}"
|
||||||
|
)
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Per-scenario details with timeline
|
||||||
|
lines.append("## Detail")
|
||||||
|
lines.append("")
|
||||||
|
for level_name, result, observations in results:
|
||||||
|
lines.append(f"### {level_name} (Compliance: {result.compliance_rate:.0%})")
|
||||||
|
lines.append("")
|
||||||
|
lines.append("| Step | Required | Detected | Reason |")
|
||||||
|
lines.append("|------|----------|----------|--------|")
|
||||||
|
for sr in result.steps:
|
||||||
|
req = "Yes" if any(
|
||||||
|
sp.id == sr.step_id and sp.required for sp in spec.steps
|
||||||
|
) else "No"
|
||||||
|
det = "YES" if sr.detected else "NO"
|
||||||
|
reason = sr.failure_reason or "—"
|
||||||
|
lines.append(f"| {sr.step_id} | {req} | {det} | {reason} |")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Timeline: show what the agent actually did
|
||||||
|
if observations:
|
||||||
|
# Build reverse index: event_index → step_id
|
||||||
|
index_to_step: dict[int, str] = {}
|
||||||
|
for step_id, indices in result.classification.items():
|
||||||
|
for idx in indices:
|
||||||
|
index_to_step[idx] = step_id
|
||||||
|
|
||||||
|
lines.append(f"**Tool Call Timeline ({len(observations)} calls)**")
|
||||||
|
lines.append("")
|
||||||
|
lines.append("| # | Tool | Input | Output | Classified As |")
|
||||||
|
lines.append("|---|------|-------|--------|------|")
|
||||||
|
for i, obs in enumerate(observations):
|
||||||
|
step_label = index_to_step.get(i, "—")
|
||||||
|
input_summary = obs.input[:100].replace("|", "\\|").replace("\n", " ")
|
||||||
|
output_summary = obs.output[:50].replace("|", "\\|").replace("\n", " ")
|
||||||
|
lines.append(
|
||||||
|
f"| {i} | {obs.tool} | {input_summary} | {output_summary} | {step_label} |"
|
||||||
|
)
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
def _overall_compliance(results: list[tuple[str, ComplianceResult, list[ObservationEvent]]]) -> float:
|
||||||
|
if not results:
|
||||||
|
return 0.0
|
||||||
|
return sum(r.compliance_rate for _, r, _obs in results) / len(results)
|
||||||
|
|
||||||
|
|
||||||
|
def _step_compliance_rate(
|
||||||
|
step_id: str,
|
||||||
|
results: list[tuple[str, ComplianceResult, list[ObservationEvent]]],
|
||||||
|
) -> float:
|
||||||
|
detected = sum(
|
||||||
|
1 for _, r, _obs in results
|
||||||
|
for s in r.steps if s.step_id == step_id and s.detected
|
||||||
|
)
|
||||||
|
return detected / len(results) if results else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def _steps_to_promote(
|
||||||
|
spec: ComplianceSpec,
|
||||||
|
results: list[tuple[str, ComplianceResult, list[ObservationEvent]]],
|
||||||
|
threshold: float,
|
||||||
|
) -> list[str]:
|
||||||
|
promote = []
|
||||||
|
for step in spec.steps:
|
||||||
|
if not step.required:
|
||||||
|
continue
|
||||||
|
rate = _step_compliance_rate(step.id, results)
|
||||||
|
if rate < threshold:
|
||||||
|
promote.append(step.id)
|
||||||
|
return promote
|
||||||
127
skills/skill-comply/scripts/run.py
Normal file
127
skills/skill-comply/scripts/run.py
Normal file
@@ -0,0 +1,127 @@
|
|||||||
|
"""CLI entry point for skill-comply."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
|
||||||
|
from scripts.grader import grade
|
||||||
|
from scripts.report import generate_report
|
||||||
|
from scripts.runner import run_scenario
|
||||||
|
from scripts.scenario_generator import generate_scenarios
|
||||||
|
from scripts.spec_generator import generate_spec
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
logging.basicConfig(level=logging.INFO, format="%(message)s")
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="skill-comply: Measure skill compliance rates",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"skill",
|
||||||
|
type=Path,
|
||||||
|
help="Path to skill/rule file to test",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--model",
|
||||||
|
default="sonnet",
|
||||||
|
help="Model for scenario execution (default: sonnet)",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--gen-model",
|
||||||
|
default="haiku",
|
||||||
|
help="Model for spec/scenario generation (default: haiku)",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--dry-run",
|
||||||
|
action="store_true",
|
||||||
|
help="Generate spec and scenarios without executing",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--output",
|
||||||
|
type=Path,
|
||||||
|
default=None,
|
||||||
|
help="Output report path (default: results/<skill-name>.md)",
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if not args.skill.is_file():
|
||||||
|
logger.error("Error: Skill file not found: %s", args.skill)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
results_dir = Path(__file__).parent.parent / "results"
|
||||||
|
results_dir.mkdir(exist_ok=True)
|
||||||
|
|
||||||
|
# Step 1: Generate compliance spec
|
||||||
|
logger.info("[1/4] Generating compliance spec from %s...", args.skill.name)
|
||||||
|
spec = generate_spec(args.skill, model=args.gen_model)
|
||||||
|
logger.info(" %d steps extracted", len(spec.steps))
|
||||||
|
|
||||||
|
# Step 2: Generate scenarios
|
||||||
|
spec_yaml = yaml.dump({
|
||||||
|
"steps": [
|
||||||
|
{"id": s.id, "description": s.description, "required": s.required}
|
||||||
|
for s in spec.steps
|
||||||
|
]
|
||||||
|
})
|
||||||
|
logger.info("[2/4] Generating scenarios (3 prompt strictness levels)...")
|
||||||
|
scenarios = generate_scenarios(args.skill, spec_yaml, model=args.gen_model)
|
||||||
|
logger.info(" %d scenarios generated", len(scenarios))
|
||||||
|
|
||||||
|
for s in scenarios:
|
||||||
|
logger.info(" - %s: %s", s.level_name, s.description[:60])
|
||||||
|
|
||||||
|
if args.dry_run:
|
||||||
|
logger.info("\n[dry-run] Spec and scenarios generated. Skipping execution.")
|
||||||
|
logger.info("\nSpec: %s (%d steps)", spec.id, len(spec.steps))
|
||||||
|
for step in spec.steps:
|
||||||
|
marker = "*" if step.required else " "
|
||||||
|
logger.info(" [%s] %s: %s", marker, step.id, step.description)
|
||||||
|
return
|
||||||
|
|
||||||
|
# Step 3: Execute scenarios
|
||||||
|
logger.info("[3/4] Executing scenarios (model=%s)...", args.model)
|
||||||
|
graded_results: list[tuple[str, Any, list[Any]]] = []
|
||||||
|
|
||||||
|
for scenario in scenarios:
|
||||||
|
logger.info(" Running %s...", scenario.level_name)
|
||||||
|
run = run_scenario(scenario, model=args.model)
|
||||||
|
result = grade(spec, list(run.observations))
|
||||||
|
graded_results.append((scenario.level_name, result, list(run.observations)))
|
||||||
|
logger.info(" %s: %.0f%%", scenario.level_name, result.compliance_rate * 100)
|
||||||
|
|
||||||
|
# Step 4: Generate report
|
||||||
|
skill_name = args.skill.parent.name if args.skill.stem == "SKILL" else args.skill.stem
|
||||||
|
output_path = args.output or results_dir / f"{skill_name}.md"
|
||||||
|
logger.info("[4/4] Generating report...")
|
||||||
|
|
||||||
|
report = generate_report(args.skill, spec, graded_results, scenarios=scenarios)
|
||||||
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
output_path.write_text(report)
|
||||||
|
logger.info(" Report saved to %s", output_path)
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
if not graded_results:
|
||||||
|
logger.warning("No scenarios were executed.")
|
||||||
|
return
|
||||||
|
overall = sum(r.compliance_rate for _, r, _obs in graded_results) / len(graded_results)
|
||||||
|
logger.info("\n%s", "=" * 50)
|
||||||
|
logger.info("Overall Compliance: %.0f%%", overall * 100)
|
||||||
|
if overall < spec.threshold_promote_to_hook:
|
||||||
|
logger.info(
|
||||||
|
"Recommendation: Some steps have low compliance. "
|
||||||
|
"Consider promoting them to hooks. See the report for details."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
161
skills/skill-comply/scripts/runner.py
Normal file
161
skills/skill-comply/scripts/runner.py
Normal file
@@ -0,0 +1,161 @@
|
|||||||
|
"""Run scenarios via claude -p and parse tool calls from stream-json output."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import shlex
|
||||||
|
import shutil
|
||||||
|
import subprocess
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from scripts.parser import ObservationEvent
|
||||||
|
from scripts.scenario_generator import Scenario
|
||||||
|
|
||||||
|
SANDBOX_BASE = Path("/tmp/skill-comply-sandbox")
|
||||||
|
ALLOWED_MODELS = frozenset({"haiku", "sonnet", "opus"})
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ScenarioRun:
|
||||||
|
scenario: Scenario
|
||||||
|
observations: tuple[ObservationEvent, ...]
|
||||||
|
sandbox_dir: Path
|
||||||
|
|
||||||
|
|
||||||
|
def run_scenario(
|
||||||
|
scenario: Scenario,
|
||||||
|
model: str = "sonnet",
|
||||||
|
max_turns: int = 30,
|
||||||
|
timeout: int = 300,
|
||||||
|
) -> ScenarioRun:
|
||||||
|
"""Execute a scenario and extract tool calls from stream-json output."""
|
||||||
|
if model not in ALLOWED_MODELS:
|
||||||
|
raise ValueError(f"Unknown model: {model!r}. Allowed: {ALLOWED_MODELS}")
|
||||||
|
|
||||||
|
sandbox_dir = _safe_sandbox_dir(scenario.id)
|
||||||
|
_setup_sandbox(sandbox_dir, scenario)
|
||||||
|
|
||||||
|
result = subprocess.run(
|
||||||
|
[
|
||||||
|
"claude", "-p", scenario.prompt,
|
||||||
|
"--model", model,
|
||||||
|
"--max-turns", str(max_turns),
|
||||||
|
"--add-dir", str(sandbox_dir),
|
||||||
|
"--allowedTools", "Read,Write,Edit,Bash,Glob,Grep",
|
||||||
|
"--output-format", "stream-json",
|
||||||
|
"--verbose",
|
||||||
|
],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=timeout,
|
||||||
|
cwd=sandbox_dir,
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
raise RuntimeError(
|
||||||
|
f"claude -p failed (rc={result.returncode}): {result.stderr[:500]}"
|
||||||
|
)
|
||||||
|
|
||||||
|
observations = _parse_stream_json(result.stdout)
|
||||||
|
|
||||||
|
return ScenarioRun(
|
||||||
|
scenario=scenario,
|
||||||
|
observations=tuple(observations),
|
||||||
|
sandbox_dir=sandbox_dir,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _safe_sandbox_dir(scenario_id: str) -> Path:
|
||||||
|
"""Sanitize scenario ID and ensure path stays within sandbox base."""
|
||||||
|
safe_id = re.sub(r"[^a-zA-Z0-9\-_]", "_", scenario_id)
|
||||||
|
path = SANDBOX_BASE / safe_id
|
||||||
|
# Validate path stays within sandbox base (raises ValueError on traversal)
|
||||||
|
path.resolve().relative_to(SANDBOX_BASE.resolve())
|
||||||
|
return path
|
||||||
|
|
||||||
|
|
||||||
|
def _setup_sandbox(sandbox_dir: Path, scenario: Scenario) -> None:
|
||||||
|
"""Create sandbox directory and run setup commands."""
|
||||||
|
if sandbox_dir.exists():
|
||||||
|
shutil.rmtree(sandbox_dir)
|
||||||
|
sandbox_dir.mkdir(parents=True)
|
||||||
|
|
||||||
|
subprocess.run(["git", "init"], cwd=sandbox_dir, capture_output=True)
|
||||||
|
|
||||||
|
for cmd in scenario.setup_commands:
|
||||||
|
parts = shlex.split(cmd)
|
||||||
|
subprocess.run(parts, cwd=sandbox_dir, capture_output=True)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_stream_json(stdout: str) -> list[ObservationEvent]:
|
||||||
|
"""Parse claude -p stream-json output into ObservationEvents.
|
||||||
|
|
||||||
|
Stream-json format:
|
||||||
|
- type=assistant with content[].type=tool_use → tool call (name, input)
|
||||||
|
- type=user with content[].type=tool_result → tool result (output)
|
||||||
|
"""
|
||||||
|
events: list[ObservationEvent] = []
|
||||||
|
pending: dict[str, dict] = {}
|
||||||
|
event_counter = 0
|
||||||
|
|
||||||
|
for line in stdout.strip().splitlines():
|
||||||
|
try:
|
||||||
|
msg = json.loads(line)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
continue
|
||||||
|
|
||||||
|
msg_type = msg.get("type")
|
||||||
|
|
||||||
|
if msg_type == "assistant":
|
||||||
|
content = msg.get("message", {}).get("content", [])
|
||||||
|
for block in content:
|
||||||
|
if block.get("type") == "tool_use":
|
||||||
|
tool_use_id = block.get("id", "")
|
||||||
|
tool_input = block.get("input", {})
|
||||||
|
input_str = (
|
||||||
|
json.dumps(tool_input)[:5000]
|
||||||
|
if isinstance(tool_input, dict)
|
||||||
|
else str(tool_input)[:5000]
|
||||||
|
)
|
||||||
|
pending[tool_use_id] = {
|
||||||
|
"tool": block.get("name", "unknown"),
|
||||||
|
"input": input_str,
|
||||||
|
"order": event_counter,
|
||||||
|
}
|
||||||
|
event_counter += 1
|
||||||
|
|
||||||
|
elif msg_type == "user":
|
||||||
|
content = msg.get("message", {}).get("content", [])
|
||||||
|
if isinstance(content, list):
|
||||||
|
for block in content:
|
||||||
|
tool_use_id = block.get("tool_use_id", "")
|
||||||
|
if tool_use_id in pending:
|
||||||
|
info = pending.pop(tool_use_id)
|
||||||
|
output_content = block.get("content", "")
|
||||||
|
if isinstance(output_content, list):
|
||||||
|
output_str = json.dumps(output_content)[:5000]
|
||||||
|
else:
|
||||||
|
output_str = str(output_content)[:5000]
|
||||||
|
|
||||||
|
events.append(ObservationEvent(
|
||||||
|
timestamp=f"T{info['order']:04d}",
|
||||||
|
event="tool_complete",
|
||||||
|
tool=info["tool"],
|
||||||
|
session=msg.get("session_id", "unknown"),
|
||||||
|
input=info["input"],
|
||||||
|
output=output_str,
|
||||||
|
))
|
||||||
|
|
||||||
|
for _tool_use_id, info in pending.items():
|
||||||
|
events.append(ObservationEvent(
|
||||||
|
timestamp=f"T{info['order']:04d}",
|
||||||
|
event="tool_complete",
|
||||||
|
tool=info["tool"],
|
||||||
|
session="unknown",
|
||||||
|
input=info["input"],
|
||||||
|
output="",
|
||||||
|
))
|
||||||
|
|
||||||
|
return sorted(events, key=lambda e: e.timestamp)
|
||||||
70
skills/skill-comply/scripts/scenario_generator.py
Normal file
70
skills/skill-comply/scripts/scenario_generator.py
Normal file
@@ -0,0 +1,70 @@
|
|||||||
|
"""Generate pressure scenarios from skill + spec using LLM."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import subprocess
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
|
||||||
|
from scripts.utils import extract_yaml
|
||||||
|
|
||||||
|
PROMPTS_DIR = Path(__file__).parent.parent / "prompts"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class Scenario:
|
||||||
|
id: str
|
||||||
|
level: int
|
||||||
|
level_name: str
|
||||||
|
description: str
|
||||||
|
prompt: str
|
||||||
|
setup_commands: tuple[str, ...]
|
||||||
|
|
||||||
|
|
||||||
|
def generate_scenarios(
|
||||||
|
skill_path: Path,
|
||||||
|
spec_yaml: str,
|
||||||
|
model: str = "haiku",
|
||||||
|
) -> list[Scenario]:
|
||||||
|
"""Generate 3 scenarios with decreasing prompt strictness.
|
||||||
|
|
||||||
|
Calls claude -p with the scenario_generator prompt, parses YAML output.
|
||||||
|
"""
|
||||||
|
skill_content = skill_path.read_text()
|
||||||
|
prompt_template = (PROMPTS_DIR / "scenario_generator.md").read_text()
|
||||||
|
prompt = (
|
||||||
|
prompt_template
|
||||||
|
.replace("{skill_content}", skill_content)
|
||||||
|
.replace("{spec_yaml}", spec_yaml)
|
||||||
|
)
|
||||||
|
|
||||||
|
result = subprocess.run(
|
||||||
|
["claude", "-p", prompt, "--model", model, "--output-format", "text"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=120,
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
raise RuntimeError(f"claude -p failed: {result.stderr}")
|
||||||
|
|
||||||
|
if not result.stdout.strip():
|
||||||
|
raise RuntimeError("claude -p returned empty output")
|
||||||
|
|
||||||
|
raw_yaml = extract_yaml(result.stdout)
|
||||||
|
parsed = yaml.safe_load(raw_yaml)
|
||||||
|
|
||||||
|
scenarios: list[Scenario] = []
|
||||||
|
for s in parsed["scenarios"]:
|
||||||
|
scenarios.append(Scenario(
|
||||||
|
id=s["id"],
|
||||||
|
level=s["level"],
|
||||||
|
level_name=s["level_name"],
|
||||||
|
description=s["description"],
|
||||||
|
prompt=s["prompt"].strip(),
|
||||||
|
setup_commands=tuple(s.get("setup_commands", [])),
|
||||||
|
))
|
||||||
|
|
||||||
|
return sorted(scenarios, key=lambda s: s.level)
|
||||||
72
skills/skill-comply/scripts/spec_generator.py
Normal file
72
skills/skill-comply/scripts/spec_generator.py
Normal file
@@ -0,0 +1,72 @@
|
|||||||
|
"""Generate compliance specs from skill files using LLM."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import subprocess
|
||||||
|
import tempfile
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
|
||||||
|
from scripts.parser import ComplianceSpec, parse_spec
|
||||||
|
from scripts.utils import extract_yaml
|
||||||
|
|
||||||
|
PROMPTS_DIR = Path(__file__).parent.parent / "prompts"
|
||||||
|
|
||||||
|
|
||||||
|
def generate_spec(
|
||||||
|
skill_path: Path,
|
||||||
|
model: str = "haiku",
|
||||||
|
max_retries: int = 2,
|
||||||
|
) -> ComplianceSpec:
|
||||||
|
"""Generate a compliance spec from a skill/rule file.
|
||||||
|
|
||||||
|
Calls claude -p with the spec_generator prompt, parses YAML output.
|
||||||
|
Retries on YAML parse errors with error feedback.
|
||||||
|
"""
|
||||||
|
skill_content = skill_path.read_text()
|
||||||
|
prompt_template = (PROMPTS_DIR / "spec_generator.md").read_text()
|
||||||
|
base_prompt = prompt_template.replace("{skill_content}", skill_content)
|
||||||
|
|
||||||
|
last_error: Exception | None = None
|
||||||
|
|
||||||
|
for attempt in range(max_retries + 1):
|
||||||
|
prompt = base_prompt
|
||||||
|
if attempt > 0 and last_error is not None:
|
||||||
|
prompt += (
|
||||||
|
f"\n\nPREVIOUS ATTEMPT FAILED with YAML parse error:\n"
|
||||||
|
f"{last_error}\n\n"
|
||||||
|
f"Please fix the YAML. Remember to quote all string values "
|
||||||
|
f"that contain colons, e.g.: description: \"Use type: description format\""
|
||||||
|
)
|
||||||
|
|
||||||
|
result = subprocess.run(
|
||||||
|
["claude", "-p", prompt, "--model", model, "--output-format", "text"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=120,
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
raise RuntimeError(f"claude -p failed: {result.stderr}")
|
||||||
|
|
||||||
|
raw_yaml = extract_yaml(result.stdout)
|
||||||
|
|
||||||
|
tmp_path = None
|
||||||
|
with tempfile.NamedTemporaryFile(
|
||||||
|
mode="w", suffix=".yaml", delete=False,
|
||||||
|
) as f:
|
||||||
|
f.write(raw_yaml)
|
||||||
|
tmp_path = Path(f.name)
|
||||||
|
|
||||||
|
try:
|
||||||
|
return parse_spec(tmp_path)
|
||||||
|
except (yaml.YAMLError, KeyError, TypeError) as e:
|
||||||
|
last_error = e
|
||||||
|
if attempt == max_retries:
|
||||||
|
raise
|
||||||
|
finally:
|
||||||
|
if tmp_path is not None:
|
||||||
|
tmp_path.unlink(missing_ok=True)
|
||||||
|
|
||||||
|
raise RuntimeError("unreachable")
|
||||||
13
skills/skill-comply/scripts/utils.py
Normal file
13
skills/skill-comply/scripts/utils.py
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
"""Shared utilities for skill-comply scripts."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
|
||||||
|
def extract_yaml(text: str) -> str:
|
||||||
|
"""Extract YAML from LLM output, stripping markdown fences if present."""
|
||||||
|
lines = text.strip().splitlines()
|
||||||
|
if lines and lines[0].startswith("```"):
|
||||||
|
lines = lines[1:]
|
||||||
|
if lines and lines[-1].startswith("```"):
|
||||||
|
lines = lines[:-1]
|
||||||
|
return "\n".join(lines)
|
||||||
137
skills/skill-comply/tests/test_grader.py
Normal file
137
skills/skill-comply/tests/test_grader.py
Normal file
@@ -0,0 +1,137 @@
|
|||||||
|
"""Tests for grader module — compliance scoring with LLM classification."""
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from scripts.grader import ComplianceResult, StepResult, grade
|
||||||
|
from scripts.parser import parse_spec, parse_trace
|
||||||
|
|
||||||
|
FIXTURES = Path(__file__).parent.parent / "fixtures"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def tdd_spec():
|
||||||
|
return parse_spec(FIXTURES / "tdd_spec.yaml")
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def compliant_trace():
|
||||||
|
return parse_trace(FIXTURES / "compliant_trace.jsonl")
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def noncompliant_trace():
|
||||||
|
return parse_trace(FIXTURES / "noncompliant_trace.jsonl")
|
||||||
|
|
||||||
|
|
||||||
|
def _mock_compliant_classification(spec, trace, model="haiku"): # noqa: ARG001
|
||||||
|
"""Simulate LLM correctly classifying a compliant trace."""
|
||||||
|
return {
|
||||||
|
"write_test": [0],
|
||||||
|
"run_test_red": [1],
|
||||||
|
"write_impl": [2],
|
||||||
|
"run_test_green": [3],
|
||||||
|
"refactor": [4],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _mock_noncompliant_classification(spec, trace, model="haiku"):
|
||||||
|
"""Simulate LLM classifying a noncompliant trace (impl before test)."""
|
||||||
|
return {
|
||||||
|
"write_impl": [0], # src/fib.py written first
|
||||||
|
"write_test": [1], # test written second
|
||||||
|
"run_test_green": [2], # only a passing test run
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _mock_empty_classification(spec, trace, model="haiku"):
|
||||||
|
return {}
|
||||||
|
|
||||||
|
|
||||||
|
class TestGradeCompliant:
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_compliant_classification)
|
||||||
|
def test_returns_compliance_result(self, mock_cls, tdd_spec, compliant_trace) -> None:
|
||||||
|
result = grade(tdd_spec, compliant_trace)
|
||||||
|
assert isinstance(result, ComplianceResult)
|
||||||
|
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_compliant_classification)
|
||||||
|
def test_full_compliance(self, mock_cls, tdd_spec, compliant_trace) -> None:
|
||||||
|
result = grade(tdd_spec, compliant_trace)
|
||||||
|
assert result.compliance_rate == 1.0
|
||||||
|
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_compliant_classification)
|
||||||
|
def test_all_required_steps_detected(self, mock_cls, tdd_spec, compliant_trace) -> None:
|
||||||
|
result = grade(tdd_spec, compliant_trace)
|
||||||
|
required_results = [s for s in result.steps if s.step_id in
|
||||||
|
("write_test", "run_test_red", "write_impl", "run_test_green")]
|
||||||
|
assert all(s.detected for s in required_results)
|
||||||
|
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_compliant_classification)
|
||||||
|
def test_optional_step_detected(self, mock_cls, tdd_spec, compliant_trace) -> None:
|
||||||
|
result = grade(tdd_spec, compliant_trace)
|
||||||
|
refactor = next(s for s in result.steps if s.step_id == "refactor")
|
||||||
|
assert refactor.detected is True
|
||||||
|
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_compliant_classification)
|
||||||
|
def test_no_hook_promotion_recommended(self, mock_cls, tdd_spec, compliant_trace) -> None:
|
||||||
|
result = grade(tdd_spec, compliant_trace)
|
||||||
|
assert result.recommend_hook_promotion is False
|
||||||
|
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_compliant_classification)
|
||||||
|
def test_step_evidence_not_empty(self, mock_cls, tdd_spec, compliant_trace) -> None:
|
||||||
|
result = grade(tdd_spec, compliant_trace)
|
||||||
|
for step in result.steps:
|
||||||
|
if step.detected:
|
||||||
|
assert len(step.evidence) > 0
|
||||||
|
|
||||||
|
|
||||||
|
class TestGradeNoncompliant:
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_noncompliant_classification)
|
||||||
|
def test_low_compliance(self, mock_cls, tdd_spec, noncompliant_trace) -> None:
|
||||||
|
result = grade(tdd_spec, noncompliant_trace)
|
||||||
|
assert result.compliance_rate < 1.0
|
||||||
|
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_noncompliant_classification)
|
||||||
|
def test_write_test_fails_ordering(self, mock_cls, tdd_spec, noncompliant_trace) -> None:
|
||||||
|
"""write_test has before_step=write_impl, but test is written AFTER impl."""
|
||||||
|
result = grade(tdd_spec, noncompliant_trace)
|
||||||
|
write_test = next(s for s in result.steps if s.step_id == "write_test")
|
||||||
|
assert write_test.detected is False
|
||||||
|
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_noncompliant_classification)
|
||||||
|
def test_run_test_red_not_detected(self, mock_cls, tdd_spec, noncompliant_trace) -> None:
|
||||||
|
result = grade(tdd_spec, noncompliant_trace)
|
||||||
|
run_red = next(s for s in result.steps if s.step_id == "run_test_red")
|
||||||
|
assert run_red.detected is False
|
||||||
|
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_noncompliant_classification)
|
||||||
|
def test_hook_promotion_recommended(self, mock_cls, tdd_spec, noncompliant_trace) -> None:
|
||||||
|
result = grade(tdd_spec, noncompliant_trace)
|
||||||
|
assert result.recommend_hook_promotion is True
|
||||||
|
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_noncompliant_classification)
|
||||||
|
def test_failure_reasons_present(self, mock_cls, tdd_spec, noncompliant_trace) -> None:
|
||||||
|
result = grade(tdd_spec, noncompliant_trace)
|
||||||
|
failed_steps = [s for s in result.steps if not s.detected and s.step_id != "refactor"]
|
||||||
|
for step in failed_steps:
|
||||||
|
assert step.failure_reason is not None
|
||||||
|
|
||||||
|
|
||||||
|
class TestGradeEdgeCases:
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_empty_classification)
|
||||||
|
def test_empty_trace(self, mock_cls, tdd_spec) -> None:
|
||||||
|
result = grade(tdd_spec, [])
|
||||||
|
assert result.compliance_rate == 0.0
|
||||||
|
assert result.recommend_hook_promotion is True
|
||||||
|
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_compliant_classification)
|
||||||
|
def test_compliance_rate_is_ratio_of_required_only(self, mock_cls, tdd_spec, compliant_trace) -> None:
|
||||||
|
result = grade(tdd_spec, compliant_trace)
|
||||||
|
assert result.compliance_rate == 1.0
|
||||||
|
|
||||||
|
@patch("scripts.grader.classify_events", side_effect=_mock_compliant_classification)
|
||||||
|
def test_spec_id_in_result(self, mock_cls, tdd_spec, compliant_trace) -> None:
|
||||||
|
result = grade(tdd_spec, compliant_trace)
|
||||||
|
assert result.spec_id == "tdd-workflow"
|
||||||
90
skills/skill-comply/tests/test_parser.py
Normal file
90
skills/skill-comply/tests/test_parser.py
Normal file
@@ -0,0 +1,90 @@
|
|||||||
|
"""Tests for parser module — JSONL trace and YAML spec parsing."""
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from scripts.parser import (
|
||||||
|
ComplianceSpec,
|
||||||
|
Detector,
|
||||||
|
ObservationEvent,
|
||||||
|
Step,
|
||||||
|
parse_spec,
|
||||||
|
parse_trace,
|
||||||
|
)
|
||||||
|
|
||||||
|
FIXTURES = Path(__file__).parent.parent / "fixtures"
|
||||||
|
|
||||||
|
|
||||||
|
class TestParseTrace:
|
||||||
|
def test_parses_compliant_trace(self) -> None:
|
||||||
|
events = parse_trace(FIXTURES / "compliant_trace.jsonl")
|
||||||
|
assert len(events) == 5
|
||||||
|
assert all(isinstance(e, ObservationEvent) for e in events)
|
||||||
|
|
||||||
|
def test_events_sorted_by_timestamp(self) -> None:
|
||||||
|
events = parse_trace(FIXTURES / "compliant_trace.jsonl")
|
||||||
|
timestamps = [e.timestamp for e in events]
|
||||||
|
assert timestamps == sorted(timestamps)
|
||||||
|
|
||||||
|
def test_event_fields(self) -> None:
|
||||||
|
events = parse_trace(FIXTURES / "compliant_trace.jsonl")
|
||||||
|
first = events[0]
|
||||||
|
assert first.tool == "Write"
|
||||||
|
assert first.session == "sess-001"
|
||||||
|
assert "test_fib.py" in first.input
|
||||||
|
assert first.output == "File created"
|
||||||
|
|
||||||
|
def test_parses_noncompliant_trace(self) -> None:
|
||||||
|
events = parse_trace(FIXTURES / "noncompliant_trace.jsonl")
|
||||||
|
assert len(events) == 3
|
||||||
|
assert "src/fib.py" in events[0].input
|
||||||
|
|
||||||
|
def test_empty_file_returns_empty_list(self, tmp_path: Path) -> None:
|
||||||
|
empty = tmp_path / "empty.jsonl"
|
||||||
|
empty.write_text("")
|
||||||
|
events = parse_trace(empty)
|
||||||
|
assert events == []
|
||||||
|
|
||||||
|
def test_nonexistent_file_raises(self) -> None:
|
||||||
|
with pytest.raises(FileNotFoundError):
|
||||||
|
parse_trace(Path("/nonexistent/trace.jsonl"))
|
||||||
|
|
||||||
|
|
||||||
|
class TestParseSpec:
|
||||||
|
def test_parses_tdd_spec(self) -> None:
|
||||||
|
spec = parse_spec(FIXTURES / "tdd_spec.yaml")
|
||||||
|
assert isinstance(spec, ComplianceSpec)
|
||||||
|
assert spec.id == "tdd-workflow"
|
||||||
|
assert len(spec.steps) == 5
|
||||||
|
|
||||||
|
def test_step_fields(self) -> None:
|
||||||
|
spec = parse_spec(FIXTURES / "tdd_spec.yaml")
|
||||||
|
first = spec.steps[0]
|
||||||
|
assert isinstance(first, Step)
|
||||||
|
assert first.id == "write_test"
|
||||||
|
assert first.required is True
|
||||||
|
assert isinstance(first.detector, Detector)
|
||||||
|
assert "test file" in first.detector.description
|
||||||
|
assert first.detector.before_step == "write_impl"
|
||||||
|
|
||||||
|
def test_optional_detector_fields(self) -> None:
|
||||||
|
spec = parse_spec(FIXTURES / "tdd_spec.yaml")
|
||||||
|
write_test = spec.steps[0]
|
||||||
|
assert write_test.detector.after_step is None
|
||||||
|
|
||||||
|
run_test_red = spec.steps[1]
|
||||||
|
assert run_test_red.detector.after_step == "write_test"
|
||||||
|
assert run_test_red.detector.before_step == "write_impl"
|
||||||
|
|
||||||
|
def test_scoring_threshold(self) -> None:
|
||||||
|
spec = parse_spec(FIXTURES / "tdd_spec.yaml")
|
||||||
|
assert spec.threshold_promote_to_hook == 0.6
|
||||||
|
|
||||||
|
def test_required_vs_optional_steps(self) -> None:
|
||||||
|
spec = parse_spec(FIXTURES / "tdd_spec.yaml")
|
||||||
|
required = [s for s in spec.steps if s.required]
|
||||||
|
optional = [s for s in spec.steps if not s.required]
|
||||||
|
assert len(required) == 4
|
||||||
|
assert len(optional) == 1
|
||||||
|
assert optional[0].id == "refactor"
|
||||||
Reference in New Issue
Block a user