From 44dfc35b16a832307279f1940a5953008e8bcda9 Mon Sep 17 00:00:00 2001
From: Affaan Mustafa <affaan@dcube.ai>
Date: Tue, 31 Mar 2026 14:27:09 -0700
Subject: [PATCH] =?UTF-8?q?fix(security):=20remove=20evalview-agent-testin?=
 =?UTF-8?q?g=20skill=20=E2=80=94=20external=20dependency?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Removed skills/evalview-agent-testing/ which required `pip install evalview`
from an unvetted third-party package. ECC skills must be self-contained
and not require installing external packages to function.

If we need agent regression testing, we build it natively in ECC.
---
 skills/evalview-agent-testing/SKILL.md | 160 -------------------------
 1 file changed, 160 deletions(-)
 delete mode 100644 skills/evalview-agent-testing/SKILL.md

diff --git a/skills/evalview-agent-testing/SKILL.md b/skills/evalview-agent-testing/SKILL.md
deleted file mode 100644
index 326d7a1a..00000000
--- a/skills/evalview-agent-testing/SKILL.md
+++ /dev/null
@@ -1,160 +0,0 @@
----
-name: evalview-agent-testing
-description: Regression testing for AI agents using EvalView. Snapshot agent behavior, detect regressions in tool calls and output quality, and block broken agents before production.
-origin: ECC
-tools: Bash, Read, Write
----
-
-# EvalView Agent Testing
-
-Automated regression testing for AI agents. EvalView snapshots your agent's behavior (tool calls, parameters, sequence, output), then diffs against the baseline after every change. When something breaks, you know immediately — before it ships.
-
-## When to Activate
-
-- After modifying agent code, prompts, or tool definitions
-- After a model update or provider change
-- Before deploying an agent to production
-- When setting up CI/CD for an agent project
-- When an autonomous loop (OpenClaw, coding agents) needs a fitness function
-- When agent output changes unexpectedly and you need to identify what shifted
-
-## Core Workflow
-
-```bash
-# 1. Set up
-pip install "evalview>=0.5,<1"
-evalview init              # Detect agent, create starter test suite
-
-# 2. Baseline
-evalview snapshot           # Save current behavior as golden baseline
-
-# 3. Gate every change
-evalview check              # Diff against baseline — catches regressions
-
-# 4. Monitor in production
-evalview monitor --slack-webhook https://hooks.slack.com/services/...
-```
-
-## Understanding Check Results
-
-| Status | Meaning | Action |
-|--------|---------|--------|
-| `PASSED` | Behavior matches baseline | Ship with confidence |
-| `TOOLS_CHANGED` | Different tools called | Review the diff |
-| `OUTPUT_CHANGED` | Same tools, output shifted | Review the diff |
-| `REGRESSION` | Score dropped significantly | Fix before shipping |
-
-## Python API for Autonomous Loops
-
-Use `gate()` as a programmatic regression gate inside agent frameworks, autonomous coding loops, or CI scripts:
-
-```python
-from evalview import gate, DiffStatus
-
-# Full evaluation
-result = gate(test_dir="tests/")
-if not result.passed:
-    for d in result.diffs:
-        if not d.passed:
-            delta = f" ({d.score_delta:+.1f})" if d.score_delta is not None else ""
-            print(f"  {d.test_name}: {d.status.value}{delta}")
-
-# Quick mode — no LLM judge, $0, sub-second
-result = gate(test_dir="tests/", quick=True)
-```
-
-### Auto-Revert on Regression
-
-```python
-from evalview.openclaw import gate_or_revert
-
-# In an autonomous coding loop:
-make_code_change()
-if not gate_or_revert("tests/", quick=True):
-    # Change was automatically reverted
-    try_alternative_approach()
-```
-
-> **Warning:** `gate_or_revert` runs `git checkout -- .` when a regression is detected, discarding uncommitted changes. Commit or stash work-in-progress before entering the loop. You can also pass a custom revert command: `gate_or_revert("tests/", revert_cmd="git stash")`.
-
-## MCP Integration
-
-EvalView exposes 8 tools via MCP — works with Claude Code, Cursor, and any MCP client:
-
-```bash
-claude mcp add --transport stdio evalview -- evalview mcp serve
-```
-
-Tools: `create_test`, `run_snapshot`, `run_check`, `list_tests`, `validate_skill`, `generate_skill_tests`, `run_skill_test`, `generate_visual_report`
-
-After connecting, Claude Code can proactively check for regressions after code changes:
-- "Did my refactor break anything?" triggers `run_check`
-- "Save this as the new baseline" triggers `run_snapshot`
-- "Add a test for the weather tool" triggers `create_test`
-
-## CI/CD Integration
-
-```yaml
-# .github/workflows/evalview.yml
-name: Agent Regression Check
-on: [pull_request, push]
-jobs:
-  check:
-    runs-on: ubuntu-latest
-    env:
-      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
-    steps:
-      - uses: actions/checkout@v4
-      - run: pip install "evalview>=0.5,<1"
-      - run: evalview check --fail-on REGRESSION
-```
-
-`--fail-on REGRESSION` gates on score drops only. For stricter gating that also catches tool sequence changes, use `--fail-on REGRESSION,TOOLS_CHANGED` or `--strict` (fails on any change).
-
-## Test Case Format
-
-```yaml
-name: refund-flow
-input:
-  query: "I need a refund for order #4812"
-expected:
-  tools: ["lookup_order", "check_refund_policy", "issue_refund"]
-  forbidden_tools: ["delete_order"]
-  output:
-    contains: ["refund", "processed"]
-    not_contains: ["error"]
-thresholds:
-  min_score: 70
-```
-
-Multi-turn tests are also supported:
-
-```yaml
-name: clarification-flow
-turns:
-  - query: "I want a refund"
-    expected:
-      output:
-        contains: ["order number"]
-  - query: "Order 4812"
-    expected:
-      tools: ["lookup_order", "issue_refund"]
-```
-
-## Best Practices
-
-- **Snapshot after every intentional change.** Baselines should reflect intended behavior.
-- **Use `--preview` before snapshotting.** `evalview snapshot --preview` shows what would change without saving.
-- **Quick mode for tight loops.** `gate(quick=True)` skips the LLM judge — free and fast for iterative development.
-- **Full evaluation for final validation.** Run without `quick=True` before deploying to get LLM-as-judge scoring.
-- **Commit `.evalview/golden/` to git.** Baselines should be versioned. Don't commit `state.json`.
-- **Use variants for non-deterministic agents.** `evalview snapshot --variant v2` stores alternate valid behaviors (up to 5).
-- **Monitor in production.** `evalview monitor` catches gradual drift that individual checks miss.
-
-## Installation
-
-```bash
-pip install "evalview>=0.5,<1"
-```
-
-Package: [evalview on PyPI](https://pypi.org/project/evalview/)