Files
everything-claude-code/skills/agent-eval/SKILL.md
Affaan Mustafa 6b2de1baff security: remove supply chain risks, external promotions, and unauthorized credits
- Remove zenith.chat references and @DRodriguezFX shoutout from README
- Remove Inspiration Credits section (already in CHANGELOG.md)
- Remove awesome-agent-skills reference from Links
- Remove Plankton H3 section by @alxfazio (skill stays in skills/)
- Remove brand names (InsAIts, VideoDB, Evos) from v1.9.0 notes
- Remove @ericcai0814 individual credit from README (kept in CHANGELOG)
- Add Security Guide to Links section
- Replace curl-pipe-to-bash in autonomous-loops with review warning
- Replace git clone in plankton-code-quality with review warning
- Replace pip install git+ in agent-eval with review warning
- Replace npm install -g in dmux-workflows with review warning
- Add commercial API notice to nutrient-document-processing
- Remove VideoDB maintainer credit from videodb skill
- Replace skill-creator.app link with ECC-Tools GitHub App reference
2026-03-21 18:10:05 -07:00

4.4 KiB

name, description, origin, tools
name description origin tools
agent-eval Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics ECC Read, Write, Edit, Bash, Grep, Glob

Agent Eval Skill

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

When to Activate

  • Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
  • Measuring agent performance before adopting a new tool or model
  • Running regression checks when an agent updates its model or tooling
  • Producing data-backed agent selection decisions for a team

Installation

Note: Install agent-eval from its repository after reviewing the source.

Core Concepts

YAML Task Definitions

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

Git Worktree Isolation

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

Metrics Collected

Metric What It Measures
Pass rate Did the agent produce code that passes the judge?
Cost API spend per task (when available)
Time Wall-clock seconds to completion
Consistency Pass rate across repeated runs (e.g., 3/3 = 100%)

Workflow

1. Define Tasks

Create a tasks/ directory with YAML files, one per task:

mkdir tasks
# Write task definitions (see template above)

2. Run Agents

Execute agents against your tasks:

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

Each run:

  1. Creates a fresh git worktree from the specified commit
  2. Hands the prompt to the agent
  3. Runs the judge criteria
  4. Records pass/fail, cost, and time

3. Compare Results

Generate a comparison report:

agent-eval report --format table
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

Judge Types

Code-Based (deterministic)

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

Pattern-Based

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

Model-Based (LLM-as-judge)

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

Best Practices

  • Start with 3-5 tasks that represent your real workload, not toy examples
  • Run at least 3 trials per agent to capture variance — agents are non-deterministic
  • Pin the commit in your task YAML so results are reproducible across days/weeks
  • Include at least one deterministic judge (tests, build) per task — LLM judges add noise
  • Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice
  • Version your task definitions — they are test fixtures, treat them as code