mirror of https://github.com/affaan-m/everything-claude-code.git synced 2026-04-01 22:53:27 +08:00

Files

haochen806 4cdfe709ab feat: add GAN-style generator-evaluator harness (#1029 )

Implements Anthropic's March 2026 harness design pattern — a multi-agent
architecture that separates generation from evaluation, creating an
adversarial feedback loop that produces production-quality applications.

Components:
- 3 agent definitions (planner, generator, evaluator)
- 1 skill with full documentation (skills/gan-style-harness/)
- 2 commands (gan-build for full apps, gan-design for frontend)
- 1 shell orchestrator (scripts/gan-harness.sh)
- Examples and configuration reference

Based on: https://www.anthropic.com/engineering/harness-design-long-running-apps

Co-authored-by: Hao Chen <haochen806@gmail.com>

2026-03-31 14:06:20 -07:00

README.md

feat: add GAN-style generator-evaluator harness (#1029 )

2026-03-31 14:06:20 -07:00

README.md

GAN-Style Harness Examples

Examples showing how to use the Generator-Evaluator harness for different project types.

Quick Start

# Full-stack web app (uses all three agents)
./scripts/gan-harness.sh "Build a project management app with Kanban boards and team collaboration"

# Frontend design (skip planner, focus on design iterations)
GAN_SKIP_PLANNER=true ./scripts/gan-harness.sh "Create a stunning landing page for a crypto portfolio tracker"

# API-only (no browser testing needed)
GAN_EVAL_MODE=code-only ./scripts/gan-harness.sh "Build a REST API for a recipe sharing platform with search and ratings"

# Tight budget (fewer iterations, lower threshold)
GAN_MAX_ITERATIONS=5 GAN_PASS_THRESHOLD=6.5 ./scripts/gan-harness.sh "Build a todo app with categories and due dates"

Example: Using the Command

# In Claude Code interactive mode:
/project:gan-build "Build a music streaming dashboard with playlists, visualizer, and social features"

# With options:
/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5 --eval-mode screenshot

Example: Manual Three-Agent Run

For maximum control, run each agent separately:

# Step 1: Plan (produces spec.md)
claude -p --model opus "$(cat agents/gan-planner.md)

Your brief: 'Build a retro game maker with sprite editor and level designer'

Write the full spec to gan-harness/spec.md and eval rubric to gan-harness/eval-rubric.md."

# Step 2: Generate (iteration 1)
claude -p --model opus "$(cat agents/gan-generator.md)

Iteration 1. Read gan-harness/spec.md. Build the initial application.
Start dev server on port 3000. Commit as iteration-001."

# Step 3: Evaluate (iteration 1)
claude -p --model opus "$(cat agents/gan-evaluator.md)

Iteration 1. Read gan-harness/eval-rubric.md.
Test http://localhost:3000. Write feedback to gan-harness/feedback/feedback-001.md.
Be ruthlessly strict."

# Step 4: Generate (iteration 2 — reads feedback)
claude -p --model opus "$(cat agents/gan-generator.md)

Iteration 2. Read gan-harness/feedback/feedback-001.md FIRST.
Address every issue. Then read gan-harness/spec.md for remaining features.
Commit as iteration-002."

# Repeat steps 3-4 until satisfied

Example: Custom Evaluation Criteria

For non-visual projects (APIs, CLIs, libraries), customize the rubric:

mkdir -p gan-harness
cat > gan-harness/eval-rubric.md << 'EOF'
# API Evaluation Rubric

### Correctness (weight: 0.4)
- Do all endpoints return expected data?
- Are edge cases handled (empty inputs, large payloads)?
- Do error responses have proper status codes?

### Performance (weight: 0.2)
- Response times under 100ms for simple queries?
- Database queries optimized (no N+1)?
- Pagination implemented for list endpoints?

### Security (weight: 0.2)
- Input validation on all endpoints?
- SQL injection prevention?
- Rate limiting implemented?
- Authentication properly enforced?

### Documentation (weight: 0.2)
- OpenAPI spec generated?
- All endpoints documented?
- Example requests/responses provided?
EOF

GAN_SKIP_PLANNER=true GAN_EVAL_MODE=code-only ./scripts/gan-harness.sh "Build a REST API for task management"

Project Types and Recommended Settings

Project Type	Eval Mode	Iterations	Threshold	Est. Cost
Full-stack web app	playwright	10-15	7.0	$100-200
Landing page	screenshot	5-8	7.5	$30-60
REST API	code-only	5-8	7.0	$30-60
CLI tool	code-only	3-5	6.5	$15-30
Data dashboard	playwright	8-12	7.0	$60-120
Game	playwright	10-15	7.0	$100-200

Understanding the Output

After each run, check:

gan-harness/build-report.md — Final summary with score progression
gan-harness/feedback/ — All evaluation feedback (useful for understanding quality evolution)
gan-harness/spec.md — The full spec (useful if you want to continue manually)
Score progression — Should show steady improvement. Plateaus indicate the model has hit its ceiling.

Tips

Start with a clear brief — "Build X with Y and Z" beats "make something cool"
Don't go below 5 iterations — The first 2-3 iterations are usually below threshold
Use playwright mode for UI projects — Screenshot-only misses interaction bugs
Review feedback files — Even if the final score passes, the feedback contains valuable insights
Iterate on the spec — If results are disappointing, improve spec.md and run again with --skip-planner