mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-06-23 00:21:27 +08:00
fix: address self-evaluation review comments
- Clarify that agent-evaluator reads skills/agent-self-evaluation/SKILL.md directly - Standardize on Conciseness terminology, including helper names - Remove invalid Stop hook matcher and avoid unsupported command-expression matcher examples - Add explicit hook-integration reference path in SKILL.md - Add summary and self-check fields to evaluate.py output, template, and agent spec - Refactor evaluate.py clarity and input parsing helpers - Remove unused task parameter from check_completeness Validation: - python3 -m py_compile skills/agent-self-evaluation/scripts/evaluate.py - evaluate.py high/low example smoke tests - node scripts/ci/validate-agents.js - node scripts/ci/validate-skills.js - node scripts/ci/validate-hooks.js - node scripts/ci/validate-no-personal-paths.js
This commit is contained in:
@@ -13,7 +13,7 @@ You are a quality evaluator for AI agent output. Your job is to assess agent res
|
||||
- Every score below 5 MUST cite specific evidence from the output
|
||||
- Provide concrete, actionable improvement suggestions
|
||||
- Maintain objectivity — evaluate the output, not the agent's effort or intent
|
||||
- Load the `agent-self-evaluation` skill for the detailed scoring rubric
|
||||
- Read `skills/agent-self-evaluation/SKILL.md` for the detailed scoring rubric. Example input is a standard ECC `SKILL.md` file with YAML frontmatter and Markdown sections such as `## When to Activate`, `## Core Concepts`, and `## Best Practices`.
|
||||
|
||||
- DO NOT re-perform the original task
|
||||
- DO NOT suggest alternative approaches unless the current approach is factually wrong
|
||||
@@ -60,6 +60,7 @@ Use this exact format (matches `scripts/evaluate.py` output):
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
Summary: Overall score X.X/5 across 5 quality axes.
|
||||
|
||||
Accuracy █████ 5/5
|
||||
+ [Evidence: passing tests, verified claims]
|
||||
@@ -87,6 +88,8 @@ CRITICAL ISSUES (axes ≤ 2):
|
||||
[Axis] Score N/5 — specific fix needed
|
||||
(or "None" if no axis ≤ 2)
|
||||
|
||||
Self-check: Would the user agree with this assessment? [Yes/No + brief justification]
|
||||
|
||||
TOP IMPROVEMENTS:
|
||||
1. [Highest impact fix]
|
||||
2. [Second highest]
|
||||
@@ -96,7 +99,7 @@ VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
|
||||
|
||||
## Output Format
|
||||
|
||||
Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT" (not "AGENT EVALUATION REPORT").
|
||||
Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT".
|
||||
|
||||
## Examples
|
||||
|
||||
@@ -108,6 +111,7 @@ Task: Add retry logic to HTTP client. 3 retries, exponential backoff.
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
Summary: Overall score X.X/5 across 5 quality axes.
|
||||
|
||||
Accuracy █████ 5/5
|
||||
+ Tests passing
|
||||
@@ -138,6 +142,8 @@ AGENT SELF-EVALUATION REPORT
|
||||
CRITICAL ISSUES (axes ≤ 2):
|
||||
None
|
||||
|
||||
Self-check: Would the user agree with this assessment? Yes — the scores cite passing tests, grep verification, and the remaining gaps are minor.
|
||||
|
||||
TOP IMPROVEMENTS:
|
||||
1. [Completeness] Add connection pool exhaustion to edge cases doc
|
||||
2. [Conciseness] Consolidate verification commands into a single script
|
||||
@@ -153,6 +159,7 @@ Task: Same as above.
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
Summary: Overall score X.X/5 across 5 quality axes.
|
||||
|
||||
Accuracy ██░░░ 2/5
|
||||
+ Code block present
|
||||
@@ -188,6 +195,8 @@ CRITICAL ISSUES (axes ≤ 2):
|
||||
[Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry.
|
||||
[Actionability] Score 2/5 — No deliverable. Create a PR with test file.
|
||||
|
||||
Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable.
|
||||
|
||||
TOP IMPROVEMENTS:
|
||||
1. [Accuracy] Switch to httpx.Retry — grep the codebase first
|
||||
2. [Actionability] Create a PR with src/api_client.py + tests
|
||||
|
||||
Reference in New Issue
Block a user