From 27234fb79037d1246d3a1c2b28cff22a77331ae5 Mon Sep 17 00:00:00 2001 From: Joaquin Hui <132194176+joaquinhuigomez@users.noreply.github.com> Date: Fri, 20 Mar 2026 07:20:18 +0000 Subject: [PATCH] feat(skills): add agent-eval for head-to-head coding agent comparison (#540) * feat(skills): add agent-eval for head-to-head coding agent comparison * fix(skills): address PR #540 review feedback for agent-eval skill - Remove duplicate "When to Use" section (kept "When to Activate") - Add Installation section with pip install instructions - Change origin from "community" to "ECC" per repo convention - Add commit field to YAML task example for reproducibility - Fix pass@k mislabeling to "pass rate across repeated runs" - Soften worktree isolation language to "reproducibility isolation" Co-Authored-By: Claude Opus 4.6 * Pin agent-eval install to specific commit hash Address PR review feedback: pin the VCS install to commit 6d062a2 to avoid supply-chain risk from unpinned external deps. Co-Authored-By: Claude Opus 4.6 --------- Co-authored-by: Joaquin Hui Gomez Co-authored-by: Claude Opus 4.6 --- skills/agent-eval/SKILL.md | 148 +++++++++++++++++++++++++++++++++++++ 1 file changed, 148 insertions(+) create mode 100644 skills/agent-eval/SKILL.md diff --git a/skills/agent-eval/SKILL.md b/skills/agent-eval/SKILL.md new file mode 100644 index 00000000..071fa141 --- /dev/null +++ b/skills/agent-eval/SKILL.md @@ -0,0 +1,148 @@ +--- +name: agent-eval +description: Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics +origin: ECC +tools: Read, Write, Edit, Bash, Grep, Glob +--- + +# Agent Eval Skill + +A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it. + +## When to Activate + +- Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase +- Measuring agent performance before adopting a new tool or model +- Running regression checks when an agent updates its model or tooling +- Producing data-backed agent selection decisions for a team + +## Installation + +```bash +# pinned to v0.1.0 — latest stable commit +pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b +``` + +## Core Concepts + +### YAML Task Definitions + +Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success: + +```yaml +name: add-retry-logic +description: Add exponential backoff retry to the HTTP client +repo: ./my-project +files: + - src/http_client.py +prompt: | + Add retry logic with exponential backoff to all HTTP requests. + Max 3 retries. Initial delay 1s, max delay 30s. +judge: + - type: pytest + command: pytest tests/test_http_client.py -v + - type: grep + pattern: "exponential_backoff|retry" + files: src/http_client.py +commit: "abc1234" # pin to specific commit for reproducibility +``` + +### Git Worktree Isolation + +Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo. + +### Metrics Collected + +| Metric | What It Measures | +|--------|-----------------| +| Pass rate | Did the agent produce code that passes the judge? | +| Cost | API spend per task (when available) | +| Time | Wall-clock seconds to completion | +| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) | + +## Workflow + +### 1. Define Tasks + +Create a `tasks/` directory with YAML files, one per task: + +```bash +mkdir tasks +# Write task definitions (see template above) +``` + +### 2. Run Agents + +Execute agents against your tasks: + +```bash +agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3 +``` + +Each run: +1. Creates a fresh git worktree from the specified commit +2. Hands the prompt to the agent +3. Runs the judge criteria +4. Records pass/fail, cost, and time + +### 3. Compare Results + +Generate a comparison report: + +```bash +agent-eval report --format table +``` + +``` +Task: add-retry-logic (3 runs each) +┌──────────────┬───────────┬────────┬────────┬─────────────┐ +│ Agent │ Pass Rate │ Cost │ Time │ Consistency │ +├──────────────┼───────────┼────────┼────────┼─────────────┤ +│ claude-code │ 3/3 │ $0.12 │ 45s │ 100% │ +│ aider │ 2/3 │ $0.08 │ 38s │ 67% │ +└──────────────┴───────────┴────────┴────────┴─────────────┘ +``` + +## Judge Types + +### Code-Based (deterministic) + +```yaml +judge: + - type: pytest + command: pytest tests/ -v + - type: command + command: npm run build +``` + +### Pattern-Based + +```yaml +judge: + - type: grep + pattern: "class.*Retry" + files: src/**/*.py +``` + +### Model-Based (LLM-as-judge) + +```yaml +judge: + - type: llm + prompt: | + Does this implementation correctly handle exponential backoff? + Check for: max retries, increasing delays, jitter. +``` + +## Best Practices + +- **Start with 3-5 tasks** that represent your real workload, not toy examples +- **Run at least 3 trials** per agent to capture variance — agents are non-deterministic +- **Pin the commit** in your task YAML so results are reproducible across days/weeks +- **Include at least one deterministic judge** (tests, build) per task — LLM judges add noise +- **Track cost alongside pass rate** — a 95% agent at 10x the cost may not be the right choice +- **Version your task definitions** — they are test fixtures, treat them as code + +## Links + +- Repository: [github.com/joaquinhuigomez/agent-eval](https://github.com/joaquinhuigomez/agent-eval)