mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-03-30 13:43:26 +08:00
feat(skills): add agent-eval for head-to-head coding agent comparison (#540)
* feat(skills): add agent-eval for head-to-head coding agent comparison * fix(skills): address PR #540 review feedback for agent-eval skill - Remove duplicate "When to Use" section (kept "When to Activate") - Add Installation section with pip install instructions - Change origin from "community" to "ECC" per repo convention - Add commit field to YAML task example for reproducibility - Fix pass@k mislabeling to "pass rate across repeated runs" - Soften worktree isolation language to "reproducibility isolation" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Pin agent-eval install to specific commit hash Address PR review feedback: pin the VCS install to commit 6d062a2 to avoid supply-chain risk from unpinned external deps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Joaquin Hui Gomez <joaquinhui1995@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
148
skills/agent-eval/SKILL.md
Normal file
148
skills/agent-eval/SKILL.md
Normal file
@@ -0,0 +1,148 @@
|
||||
---
|
||||
name: agent-eval
|
||||
description: Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics
|
||||
origin: ECC
|
||||
tools: Read, Write, Edit, Bash, Grep, Glob
|
||||
---
|
||||
|
||||
# Agent Eval Skill
|
||||
|
||||
A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.
|
||||
|
||||
## When to Activate
|
||||
|
||||
- Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
|
||||
- Measuring agent performance before adopting a new tool or model
|
||||
- Running regression checks when an agent updates its model or tooling
|
||||
- Producing data-backed agent selection decisions for a team
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# pinned to v0.1.0 — latest stable commit
|
||||
pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b
|
||||
```
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### YAML Task Definitions
|
||||
|
||||
Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:
|
||||
|
||||
```yaml
|
||||
name: add-retry-logic
|
||||
description: Add exponential backoff retry to the HTTP client
|
||||
repo: ./my-project
|
||||
files:
|
||||
- src/http_client.py
|
||||
prompt: |
|
||||
Add retry logic with exponential backoff to all HTTP requests.
|
||||
Max 3 retries. Initial delay 1s, max delay 30s.
|
||||
judge:
|
||||
- type: pytest
|
||||
command: pytest tests/test_http_client.py -v
|
||||
- type: grep
|
||||
pattern: "exponential_backoff|retry"
|
||||
files: src/http_client.py
|
||||
commit: "abc1234" # pin to specific commit for reproducibility
|
||||
```
|
||||
|
||||
### Git Worktree Isolation
|
||||
|
||||
Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.
|
||||
|
||||
### Metrics Collected
|
||||
|
||||
| Metric | What It Measures |
|
||||
|--------|-----------------|
|
||||
| Pass rate | Did the agent produce code that passes the judge? |
|
||||
| Cost | API spend per task (when available) |
|
||||
| Time | Wall-clock seconds to completion |
|
||||
| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |
|
||||
|
||||
## Workflow
|
||||
|
||||
### 1. Define Tasks
|
||||
|
||||
Create a `tasks/` directory with YAML files, one per task:
|
||||
|
||||
```bash
|
||||
mkdir tasks
|
||||
# Write task definitions (see template above)
|
||||
```
|
||||
|
||||
### 2. Run Agents
|
||||
|
||||
Execute agents against your tasks:
|
||||
|
||||
```bash
|
||||
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
|
||||
```
|
||||
|
||||
Each run:
|
||||
1. Creates a fresh git worktree from the specified commit
|
||||
2. Hands the prompt to the agent
|
||||
3. Runs the judge criteria
|
||||
4. Records pass/fail, cost, and time
|
||||
|
||||
### 3. Compare Results
|
||||
|
||||
Generate a comparison report:
|
||||
|
||||
```bash
|
||||
agent-eval report --format table
|
||||
```
|
||||
|
||||
```
|
||||
Task: add-retry-logic (3 runs each)
|
||||
┌──────────────┬───────────┬────────┬────────┬─────────────┐
|
||||
│ Agent │ Pass Rate │ Cost │ Time │ Consistency │
|
||||
├──────────────┼───────────┼────────┼────────┼─────────────┤
|
||||
│ claude-code │ 3/3 │ $0.12 │ 45s │ 100% │
|
||||
│ aider │ 2/3 │ $0.08 │ 38s │ 67% │
|
||||
└──────────────┴───────────┴────────┴────────┴─────────────┘
|
||||
```
|
||||
|
||||
## Judge Types
|
||||
|
||||
### Code-Based (deterministic)
|
||||
|
||||
```yaml
|
||||
judge:
|
||||
- type: pytest
|
||||
command: pytest tests/ -v
|
||||
- type: command
|
||||
command: npm run build
|
||||
```
|
||||
|
||||
### Pattern-Based
|
||||
|
||||
```yaml
|
||||
judge:
|
||||
- type: grep
|
||||
pattern: "class.*Retry"
|
||||
files: src/**/*.py
|
||||
```
|
||||
|
||||
### Model-Based (LLM-as-judge)
|
||||
|
||||
```yaml
|
||||
judge:
|
||||
- type: llm
|
||||
prompt: |
|
||||
Does this implementation correctly handle exponential backoff?
|
||||
Check for: max retries, increasing delays, jitter.
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
- **Start with 3-5 tasks** that represent your real workload, not toy examples
|
||||
- **Run at least 3 trials** per agent to capture variance — agents are non-deterministic
|
||||
- **Pin the commit** in your task YAML so results are reproducible across days/weeks
|
||||
- **Include at least one deterministic judge** (tests, build) per task — LLM judges add noise
|
||||
- **Track cost alongside pass rate** — a 95% agent at 10x the cost may not be the right choice
|
||||
- **Version your task definitions** — they are test fixtures, treat them as code
|
||||
|
||||
## Links
|
||||
|
||||
- Repository: [github.com/joaquinhuigomez/agent-eval](https://github.com/joaquinhuigomez/agent-eval)
|
||||
Reference in New Issue
Block a user