feat(skills): add agent-eval for head-to-head coding agent comparison (#540)

* feat(skills): add agent-eval for head-to-head coding agent comparison * fix(skills): address PR #540 review feedback for agent-eval skill - Remove duplicate "When to Use" section (kept "When to Activate") - Add Installation section with pip install instructions - Change origin from "community" to "ECC" per repo convention - Add commit field to YAML task example for reproducibility - Fix pass@k mislabeling to "pass rate across repeated runs" - Soften worktree isolation language to "reproducibility isolation" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Pin agent-eval install to specific commit hash Address PR review feedback: pin the VCS install to commit 6d062a2 to avoid supply-chain risk from unpinned external deps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Joaquin Hui Gomez <joaquinhui1995@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-30 03:51:27 +08:00 · 2026-03-20 07:20:18 +00:00
parent a6bd90713d
commit 27234fb790
1 changed files with 148 additions and 0 deletions
@@ -0,0 +1,148 @@
+---
+name: agent-eval
+description: Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics
+origin: ECC
+tools: Read, Write, Edit, Bash, Grep, Glob
+---
+
+# Agent Eval Skill
+
+A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.
+
+## When to Activate
+
+- Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
+- Measuring agent performance before adopting a new tool or model
+- Running regression checks when an agent updates its model or tooling
+- Producing data-backed agent selection decisions for a team
+
+## Installation
+
+```bash
+# pinned to v0.1.0 — latest stable commit
+pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b
+```
+
+## Core Concepts
+
+### YAML Task Definitions
+
+Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:
+
+```yaml
+name: add-retry-logic
+description: Add exponential backoff retry to the HTTP client
+repo: ./my-project
+files:
+  - src/http_client.py
+prompt: |
+  Add retry logic with exponential backoff to all HTTP requests.
+  Max 3 retries. Initial delay 1s, max delay 30s.
+judge:
+  - type: pytest
+    command: pytest tests/test_http_client.py -v
+  - type: grep
+    pattern: "exponential_backoff|retry"
+    files: src/http_client.py
+commit: "abc1234"  # pin to specific commit for reproducibility
+```
+
+### Git Worktree Isolation
+
+Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.
+
+### Metrics Collected
+
+| Metric | What It Measures |
+|--------|-----------------|
+| Pass rate | Did the agent produce code that passes the judge? |
+| Cost | API spend per task (when available) |
+| Time | Wall-clock seconds to completion |
+| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |
+
+## Workflow
+
+### 1. Define Tasks
+
+Create a `tasks/` directory with YAML files, one per task:
+
+```bash
+mkdir tasks
+# Write task definitions (see template above)
+```
+
+### 2. Run Agents
+
+Execute agents against your tasks:
+
+```bash
+agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
+```
+
+Each run:
+1. Creates a fresh git worktree from the specified commit
+2. Hands the prompt to the agent
+3. Runs the judge criteria
+4. Records pass/fail, cost, and time
+
+### 3. Compare Results
+
+Generate a comparison report:
+
+```bash
+agent-eval report --format table
+```
+
+```
+Task: add-retry-logic (3 runs each)
+┌──────────────┬───────────┬────────┬────────┬─────────────┐
+│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
+├──────────────┼───────────┼────────┼────────┼─────────────┤
+│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
+│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
+└──────────────┴───────────┴────────┴────────┴─────────────┘
+```
+
+## Judge Types
+
+### Code-Based (deterministic)
+
+```yaml
+judge:
+  - type: pytest
+    command: pytest tests/ -v
+  - type: command
+    command: npm run build
+```
+
+### Pattern-Based
+
+```yaml
+judge:
+  - type: grep
+    pattern: "class.*Retry"
+    files: src/**/*.py
+```
+
+### Model-Based (LLM-as-judge)
+
+```yaml
+judge:
+  - type: llm
+    prompt: |
+      Does this implementation correctly handle exponential backoff?
+      Check for: max retries, increasing delays, jitter.
+```
+
+## Best Practices
+
+- **Start with 3-5 tasks** that represent your real workload, not toy examples
+- **Run at least 3 trials** per agent to capture variance — agents are non-deterministic
+- **Pin the commit** in your task YAML so results are reproducible across days/weeks
+- **Include at least one deterministic judge** (tests, build) per task — LLM judges add noise
+- **Track cost alongside pass rate** — a 95% agent at 10x the cost may not be the right choice
+- **Version your task definitions** — they are test fixtures, treat them as code
+
+## Links
+
+- Repository: [github.com/joaquinhuigomez/agent-eval](https://github.com/joaquinhuigomez/agent-eval)