mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-03-30 21:53:28 +08:00
1.6 KiB
1.6 KiB
description, agent
| description | agent |
|---|---|
| Run evaluation against acceptance criteria | build |
Eval Command
Evaluate implementation against acceptance criteria: $ARGUMENTS
Your Task
Run structured evaluation to verify the implementation meets requirements.
Evaluation Framework
Grader Types
-
Binary Grader - Pass/Fail
- Does it work? Yes/No
- Good for: feature completion, bug fixes
-
Scalar Grader - Score 0-100
- How well does it work?
- Good for: performance, quality metrics
-
Rubric Grader - Category scores
- Multiple dimensions evaluated
- Good for: comprehensive review
Evaluation Process
Step 1: Define Criteria
Acceptance Criteria:
1. [Criterion 1] - [weight]
2. [Criterion 2] - [weight]
3. [Criterion 3] - [weight]
Step 2: Run Tests
For each criterion:
- Execute relevant test
- Collect evidence
- Score result
Step 3: Calculate Score
Final Score = Σ (criterion_score × weight) / total_weight
Step 4: Report
Evaluation Report
Overall: [PASS/FAIL] (Score: X/100)
Criterion Breakdown
| Criterion | Score | Weight | Weighted |
|---|---|---|---|
| [Criterion 1] | X/10 | 30% | X |
| [Criterion 2] | X/10 | 40% | X |
| [Criterion 3] | X/10 | 30% | X |
Evidence
Criterion 1: [Name]
- Test: [what was tested]
- Result: [outcome]
- Evidence: [screenshot, log, output]
Recommendations
[If not passing, what needs to change]
Pass@K Metrics
For non-deterministic evaluations:
- Run K times
- Calculate pass rate
- Report: "Pass@K = X/K"
TIP: Use eval for acceptance testing before marking features complete.