From 0f40fd030c222f1c56ae9ef05d5115ac9fd140a0 Mon Sep 17 00:00:00 2001 From: Hidai Bar-Mor <31502796+hidai25@users.noreply.github.com> Date: Wed, 1 Apr 2026 00:13:32 +0300 Subject: [PATCH] feat(skills): add evalview-agent-testing skill and MCP server (#828) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * feat(skills): add evalview-agent-testing skill and MCP server Add EvalView as a regression testing skill for AI agents. EvalView snapshots agent behavior (tool calls, parameters, output), then diffs against baselines after every change — catching regressions before they ship. Skill covers: - CLI workflow (init → snapshot → check → monitor) - Python API (gate() / gate_async() for autonomous loops) - Quick mode (no LLM judge, $0, sub-second) - CI/CD integration (GitHub Actions with PR comments) - MCP integration (8 tools for Claude Code) - Multi-turn test cases - OpenClaw integration for autonomous agents Also adds evalview MCP server to mcp-servers.json. * fix(skills): pin action SHA and remove unvetted external links - Pin hidai25/eval-view action to commit SHA instead of @main - Replace external GitHub links with PyPI package link (vetted registry) Addresses cubic-dev-ai review feedback. * fix(skills): replace third-party action with pip install + CLI Use plain pip install + evalview CLI instead of a third-party GitHub Action. No external actions, no secrets passed to unvetted code. Addresses cubic-dev-ai supply-chain review feedback. * fix(skills): add destructive revert warning for gate_or_revert Add prominent warning that gate_or_revert runs git checkout, discarding uncommitted changes. Documents the revert_cmd override for safer alternatives like git stash. Addresses cubic-dev-ai review feedback. * fix(skills): pin pip version range and document fail-on tradeoffs - Pin evalview to >=0.5,<1 to prevent breaking CI on major upgrades - Document --fail-on REGRESSION vs --strict tradeoff so users understand what gates and what passes through Addresses greptile-apps review feedback. * fix: use python3 -m evalview for venv compatibility in MCP config Follows the same pattern as insaits entry. Resolves correctly even when evalview is installed in a virtual environment that isn't on the system PATH. * fix: align MCP install command with mcp-servers.json pattern Use python3 -m evalview mcp serve consistently across both the skill docs and the MCP config catalog. * fix: use evalview CLI entry point for MCP command pip install evalview installs the evalview binary to PATH, so using it directly is consistent with the install docs and avoids python3 version mismatch issues. * fix: pin install version to match CI section * fix: pin all pip install references consistently * fix: add API key placeholder and pin install version in MCP config Add OPENAI_API_KEY env placeholder matching other entries. Note that the key is optional — deterministic checks work without it. Pin install version to match skill docs. * fix: guard score_delta format for non-scored statuses --------- Co-authored-by: Affaan Mustafa --- mcp-configs/mcp-servers.json | 8 ++ skills/evalview-agent-testing/SKILL.md | 160 +++++++++++++++++++++++++ 2 files changed, 168 insertions(+) create mode 100644 skills/evalview-agent-testing/SKILL.md diff --git a/mcp-configs/mcp-servers.json b/mcp-configs/mcp-servers.json index 8c0b02b8..c62d8866 100644 --- a/mcp-configs/mcp-servers.json +++ b/mcp-configs/mcp-servers.json @@ -152,6 +152,14 @@ "CONFLUENCE_API_TOKEN": "YOUR_CONFLUENCE_TOKEN_HERE" }, "description": "Confluence Cloud integration — search pages, retrieve content, explore spaces" + }, + "evalview": { + "command": "python3", + "args": ["-m", "evalview", "mcp", "serve"], + "env": { + "OPENAI_API_KEY": "YOUR_OPENAI_API_KEY_HERE" + }, + "description": "AI agent regression testing — snapshot behavior, detect regressions in tool calls and output quality. 8 tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, generate_visual_report. API key optional — deterministic checks (tool diff, output hash) work without it. Install: pip install \"evalview>=0.5,<1\"" } }, "_comments": { diff --git a/skills/evalview-agent-testing/SKILL.md b/skills/evalview-agent-testing/SKILL.md new file mode 100644 index 00000000..326d7a1a --- /dev/null +++ b/skills/evalview-agent-testing/SKILL.md @@ -0,0 +1,160 @@ +--- +name: evalview-agent-testing +description: Regression testing for AI agents using EvalView. Snapshot agent behavior, detect regressions in tool calls and output quality, and block broken agents before production. +origin: ECC +tools: Bash, Read, Write +--- + +# EvalView Agent Testing + +Automated regression testing for AI agents. EvalView snapshots your agent's behavior (tool calls, parameters, sequence, output), then diffs against the baseline after every change. When something breaks, you know immediately — before it ships. + +## When to Activate + +- After modifying agent code, prompts, or tool definitions +- After a model update or provider change +- Before deploying an agent to production +- When setting up CI/CD for an agent project +- When an autonomous loop (OpenClaw, coding agents) needs a fitness function +- When agent output changes unexpectedly and you need to identify what shifted + +## Core Workflow + +```bash +# 1. Set up +pip install "evalview>=0.5,<1" +evalview init # Detect agent, create starter test suite + +# 2. Baseline +evalview snapshot # Save current behavior as golden baseline + +# 3. Gate every change +evalview check # Diff against baseline — catches regressions + +# 4. Monitor in production +evalview monitor --slack-webhook https://hooks.slack.com/services/... +``` + +## Understanding Check Results + +| Status | Meaning | Action | +|--------|---------|--------| +| `PASSED` | Behavior matches baseline | Ship with confidence | +| `TOOLS_CHANGED` | Different tools called | Review the diff | +| `OUTPUT_CHANGED` | Same tools, output shifted | Review the diff | +| `REGRESSION` | Score dropped significantly | Fix before shipping | + +## Python API for Autonomous Loops + +Use `gate()` as a programmatic regression gate inside agent frameworks, autonomous coding loops, or CI scripts: + +```python +from evalview import gate, DiffStatus + +# Full evaluation +result = gate(test_dir="tests/") +if not result.passed: + for d in result.diffs: + if not d.passed: + delta = f" ({d.score_delta:+.1f})" if d.score_delta is not None else "" + print(f" {d.test_name}: {d.status.value}{delta}") + +# Quick mode — no LLM judge, $0, sub-second +result = gate(test_dir="tests/", quick=True) +``` + +### Auto-Revert on Regression + +```python +from evalview.openclaw import gate_or_revert + +# In an autonomous coding loop: +make_code_change() +if not gate_or_revert("tests/", quick=True): + # Change was automatically reverted + try_alternative_approach() +``` + +> **Warning:** `gate_or_revert` runs `git checkout -- .` when a regression is detected, discarding uncommitted changes. Commit or stash work-in-progress before entering the loop. You can also pass a custom revert command: `gate_or_revert("tests/", revert_cmd="git stash")`. + +## MCP Integration + +EvalView exposes 8 tools via MCP — works with Claude Code, Cursor, and any MCP client: + +```bash +claude mcp add --transport stdio evalview -- evalview mcp serve +``` + +Tools: `create_test`, `run_snapshot`, `run_check`, `list_tests`, `validate_skill`, `generate_skill_tests`, `run_skill_test`, `generate_visual_report` + +After connecting, Claude Code can proactively check for regressions after code changes: +- "Did my refactor break anything?" triggers `run_check` +- "Save this as the new baseline" triggers `run_snapshot` +- "Add a test for the weather tool" triggers `create_test` + +## CI/CD Integration + +```yaml +# .github/workflows/evalview.yml +name: Agent Regression Check +on: [pull_request, push] +jobs: + check: + runs-on: ubuntu-latest + env: + OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} + steps: + - uses: actions/checkout@v4 + - run: pip install "evalview>=0.5,<1" + - run: evalview check --fail-on REGRESSION +``` + +`--fail-on REGRESSION` gates on score drops only. For stricter gating that also catches tool sequence changes, use `--fail-on REGRESSION,TOOLS_CHANGED` or `--strict` (fails on any change). + +## Test Case Format + +```yaml +name: refund-flow +input: + query: "I need a refund for order #4812" +expected: + tools: ["lookup_order", "check_refund_policy", "issue_refund"] + forbidden_tools: ["delete_order"] + output: + contains: ["refund", "processed"] + not_contains: ["error"] +thresholds: + min_score: 70 +``` + +Multi-turn tests are also supported: + +```yaml +name: clarification-flow +turns: + - query: "I want a refund" + expected: + output: + contains: ["order number"] + - query: "Order 4812" + expected: + tools: ["lookup_order", "issue_refund"] +``` + +## Best Practices + +- **Snapshot after every intentional change.** Baselines should reflect intended behavior. +- **Use `--preview` before snapshotting.** `evalview snapshot --preview` shows what would change without saving. +- **Quick mode for tight loops.** `gate(quick=True)` skips the LLM judge — free and fast for iterative development. +- **Full evaluation for final validation.** Run without `quick=True` before deploying to get LLM-as-judge scoring. +- **Commit `.evalview/golden/` to git.** Baselines should be versioned. Don't commit `state.json`. +- **Use variants for non-deterministic agents.** `evalview snapshot --variant v2` stores alternate valid behaviors (up to 5). +- **Monitor in production.** `evalview monitor` catches gradual drift that individual checks miss. + +## Installation + +```bash +pip install "evalview>=0.5,<1" +``` + +Package: [evalview on PyPI](https://pypi.org/project/evalview/)