From 27234fb79037d1246d3a1c2b28cff22a77331ae5 Mon Sep 17 00:00:00 2001
From: Joaquin Hui <132194176+joaquinhuigomez@users.noreply.github.com>
Date: Fri, 20 Mar 2026 07:20:18 +0000
Subject: [PATCH] feat(skills): add agent-eval for head-to-head coding agent
 comparison (#540)

* feat(skills): add agent-eval for head-to-head coding agent comparison

* fix(skills): address PR #540 review feedback for agent-eval skill

- Remove duplicate "When to Use" section (kept "When to Activate")
- Add Installation section with pip install instructions
- Change origin from "community" to "ECC" per repo convention
- Add commit field to YAML task example for reproducibility
- Fix pass@k mislabeling to "pass rate across repeated runs"
- Soften worktree isolation language to "reproducibility isolation"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Pin agent-eval install to specific commit hash

Address PR review feedback: pin the VCS install to commit
6d062a2 to avoid supply-chain risk from unpinned external deps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Joaquin Hui Gomez <joaquinhui1995@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
---
 skills/agent-eval/SKILL.md | 148 +++++++++++++++++++++++++++++++++++++
 1 file changed, 148 insertions(+)
 create mode 100644 skills/agent-eval/SKILL.md

diff --git a/skills/agent-eval/SKILL.md b/skills/agent-eval/SKILL.md
new file mode 100644
index 00000000..071fa141
--- /dev/null
+++ b/skills/agent-eval/SKILL.md
@@ -0,0 +1,148 @@
+---
+name: agent-eval
+description: Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics
+origin: ECC
+tools: Read, Write, Edit, Bash, Grep, Glob
+---
+
+# Agent Eval Skill
+
+A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.
+
+## When to Activate
+
+- Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
+- Measuring agent performance before adopting a new tool or model
+- Running regression checks when an agent updates its model or tooling
+- Producing data-backed agent selection decisions for a team
+
+## Installation
+
+```bash
+# pinned to v0.1.0 — latest stable commit
+pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b
+```
+
+## Core Concepts
+
+### YAML Task Definitions
+
+Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:
+
+```yaml
+name: add-retry-logic
+description: Add exponential backoff retry to the HTTP client
+repo: ./my-project
+files:
+  - src/http_client.py
+prompt: |
+  Add retry logic with exponential backoff to all HTTP requests.
+  Max 3 retries. Initial delay 1s, max delay 30s.
+judge:
+  - type: pytest
+    command: pytest tests/test_http_client.py -v
+  - type: grep
+    pattern: "exponential_backoff|retry"
+    files: src/http_client.py
+commit: "abc1234"  # pin to specific commit for reproducibility
+```
+
+### Git Worktree Isolation
+
+Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.
+
+### Metrics Collected
+
+| Metric | What It Measures |
+|--------|-----------------|
+| Pass rate | Did the agent produce code that passes the judge? |
+| Cost | API spend per task (when available) |
+| Time | Wall-clock seconds to completion |
+| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |
+
+## Workflow
+
+### 1. Define Tasks
+
+Create a `tasks/` directory with YAML files, one per task:
+
+```bash
+mkdir tasks
+# Write task definitions (see template above)
+```
+
+### 2. Run Agents
+
+Execute agents against your tasks:
+
+```bash
+agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
+```
+
+Each run:
+1. Creates a fresh git worktree from the specified commit
+2. Hands the prompt to the agent
+3. Runs the judge criteria
+4. Records pass/fail, cost, and time
+
+### 3. Compare Results
+
+Generate a comparison report:
+
+```bash
+agent-eval report --format table
+```
+
+```
+Task: add-retry-logic (3 runs each)
+┌──────────────┬───────────┬────────┬────────┬─────────────┐
+│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
+├──────────────┼───────────┼────────┼────────┼─────────────┤
+│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
+│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
+└──────────────┴───────────┴────────┴────────┴─────────────┘
+```
+
+## Judge Types
+
+### Code-Based (deterministic)
+
+```yaml
+judge:
+  - type: pytest
+    command: pytest tests/ -v
+  - type: command
+    command: npm run build
+```
+
+### Pattern-Based
+
+```yaml
+judge:
+  - type: grep
+    pattern: "class.*Retry"
+    files: src/**/*.py
+```
+
+### Model-Based (LLM-as-judge)
+
+```yaml
+judge:
+  - type: llm
+    prompt: |
+      Does this implementation correctly handle exponential backoff?
+      Check for: max retries, increasing delays, jitter.
+```
+
+## Best Practices
+
+- **Start with 3-5 tasks** that represent your real workload, not toy examples
+- **Run at least 3 trials** per agent to capture variance — agents are non-deterministic
+- **Pin the commit** in your task YAML so results are reproducible across days/weeks
+- **Include at least one deterministic judge** (tests, build) per task — LLM judges add noise
+- **Track cost alongside pass rate** — a 95% agent at 10x the cost may not be the right choice
+- **Version your task definitions** — they are test fixtures, treat them as code
+
+## Links
+
+- Repository: [github.com/joaquinhuigomez/agent-eval](https://github.com/joaquinhuigomez/agent-eval)