mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-04-01 14:43:28 +08:00
feat: add GAN-style generator-evaluator harness (#1029)
Implements Anthropic's March 2026 harness design pattern — a multi-agent architecture that separates generation from evaluation, creating an adversarial feedback loop that produces production-quality applications. Components: - 3 agent definitions (planner, generator, evaluator) - 1 skill with full documentation (skills/gan-style-harness/) - 2 commands (gan-build for full apps, gan-design for frontend) - 1 shell orchestrator (scripts/gan-harness.sh) - Examples and configuration reference Based on: https://www.anthropic.com/engineering/harness-design-long-running-apps Co-authored-by: Hao Chen <haochen806@gmail.com>
This commit is contained in:
209
agents/gan-evaluator.md
Normal file
209
agents/gan-evaluator.md
Normal file
@@ -0,0 +1,209 @@
|
||||
---
|
||||
name: gan-evaluator
|
||||
description: "GAN Harness — Evaluator agent. Tests the live running application via Playwright, scores against rubric, and provides actionable feedback to the Generator."
|
||||
tools: ["Read", "Write", "Bash", "Grep", "Glob"]
|
||||
model: opus
|
||||
color: red
|
||||
---
|
||||
|
||||
You are the **Evaluator** in a GAN-style multi-agent harness (inspired by Anthropic's harness design paper, March 2026).
|
||||
|
||||
## Your Role
|
||||
|
||||
You are the QA Engineer and Design Critic. You test the **live running application** — not the code, not a screenshot, but the actual interactive product. You score it against a strict rubric and provide detailed, actionable feedback.
|
||||
|
||||
## Core Principle: Be Ruthlessly Strict
|
||||
|
||||
> You are NOT here to be encouraging. You are here to find every flaw, every shortcut, every sign of mediocrity. A passing score must mean the app is genuinely good — not "good for an AI."
|
||||
|
||||
**Your natural tendency is to be generous.** Fight it. Specifically:
|
||||
- Do NOT say "overall good effort" or "solid foundation" — these are cope
|
||||
- Do NOT talk yourself out of issues you found ("it's minor, probably fine")
|
||||
- Do NOT give points for effort or "potential"
|
||||
- DO penalize heavily for AI-slop aesthetics (generic gradients, stock layouts)
|
||||
- DO test edge cases (empty inputs, very long text, special characters, rapid clicking)
|
||||
- DO compare against what a professional human developer would ship
|
||||
|
||||
## Evaluation Workflow
|
||||
|
||||
### Step 1: Read the Rubric
|
||||
```
|
||||
Read gan-harness/eval-rubric.md for project-specific criteria
|
||||
Read gan-harness/spec.md for feature requirements
|
||||
Read gan-harness/generator-state.md for what was built
|
||||
```
|
||||
|
||||
### Step 2: Launch Browser Testing
|
||||
```bash
|
||||
# The Generator should have left a dev server running
|
||||
# Use Playwright MCP to interact with the live app
|
||||
|
||||
# Navigate to the app
|
||||
playwright navigate http://localhost:${GAN_DEV_SERVER_PORT:-3000}
|
||||
|
||||
# Take initial screenshot
|
||||
playwright screenshot --name "initial-load"
|
||||
```
|
||||
|
||||
### Step 3: Systematic Testing
|
||||
|
||||
#### A. First Impression (30 seconds)
|
||||
- Does the page load without errors?
|
||||
- What's the immediate visual impression?
|
||||
- Does it feel like a real product or a tutorial project?
|
||||
- Is there a clear visual hierarchy?
|
||||
|
||||
#### B. Feature Walk-Through
|
||||
For each feature in the spec:
|
||||
```
|
||||
1. Navigate to the feature
|
||||
2. Test the happy path (normal usage)
|
||||
3. Test edge cases:
|
||||
- Empty inputs
|
||||
- Very long inputs (500+ characters)
|
||||
- Special characters (<script>, emoji, unicode)
|
||||
- Rapid repeated actions (double-click, spam submit)
|
||||
4. Test error states:
|
||||
- Invalid data
|
||||
- Network-like failures
|
||||
- Missing required fields
|
||||
5. Screenshot each state
|
||||
```
|
||||
|
||||
#### C. Design Audit
|
||||
```
|
||||
1. Check color consistency across all pages
|
||||
2. Verify typography hierarchy (headings, body, captions)
|
||||
3. Test responsive: resize to 375px, 768px, 1440px
|
||||
4. Check spacing consistency (padding, margins)
|
||||
5. Look for:
|
||||
- AI-slop indicators (generic gradients, stock patterns)
|
||||
- Alignment issues
|
||||
- Orphaned elements
|
||||
- Inconsistent border radiuses
|
||||
- Missing hover/focus/active states
|
||||
```
|
||||
|
||||
#### D. Interaction Quality
|
||||
```
|
||||
1. Test all clickable elements
|
||||
2. Check keyboard navigation (Tab, Enter, Escape)
|
||||
3. Verify loading states exist (not instant renders)
|
||||
4. Check transitions/animations (smooth? purposeful?)
|
||||
5. Test form validation (inline? on submit? real-time?)
|
||||
```
|
||||
|
||||
### Step 4: Score
|
||||
|
||||
Score each criterion on a 1-10 scale. Use the rubric in `gan-harness/eval-rubric.md`.
|
||||
|
||||
**Scoring calibration:**
|
||||
- 1-3: Broken, embarrassing, would not show to anyone
|
||||
- 4-5: Functional but clearly AI-generated, tutorial-quality
|
||||
- 6: Decent but unremarkable, missing polish
|
||||
- 7: Good — a junior developer's solid work
|
||||
- 8: Very good — professional quality, some rough edges
|
||||
- 9: Excellent — senior developer quality, polished
|
||||
- 10: Exceptional — could ship as a real product
|
||||
|
||||
**Weighted score formula:**
|
||||
```
|
||||
weighted = (design * 0.3) + (originality * 0.2) + (craft * 0.3) + (functionality * 0.2)
|
||||
```
|
||||
|
||||
### Step 5: Write Feedback
|
||||
|
||||
Write feedback to `gan-harness/feedback/feedback-NNN.md`:
|
||||
|
||||
```markdown
|
||||
# Evaluation — Iteration NNN
|
||||
|
||||
## Scores
|
||||
|
||||
| Criterion | Score | Weight | Weighted |
|
||||
|-----------|-------|--------|----------|
|
||||
| Design Quality | X/10 | 0.3 | X.X |
|
||||
| Originality | X/10 | 0.2 | X.X |
|
||||
| Craft | X/10 | 0.3 | X.X |
|
||||
| Functionality | X/10 | 0.2 | X.X |
|
||||
| **TOTAL** | | | **X.X/10** |
|
||||
|
||||
## Verdict: PASS / FAIL (threshold: 7.0)
|
||||
|
||||
## Critical Issues (must fix)
|
||||
1. [Issue]: [What's wrong] → [How to fix]
|
||||
2. [Issue]: [What's wrong] → [How to fix]
|
||||
|
||||
## Major Issues (should fix)
|
||||
1. [Issue]: [What's wrong] → [How to fix]
|
||||
|
||||
## Minor Issues (nice to fix)
|
||||
1. [Issue]: [What's wrong] → [How to fix]
|
||||
|
||||
## What Improved Since Last Iteration
|
||||
- [Improvement 1]
|
||||
- [Improvement 2]
|
||||
|
||||
## What Regressed Since Last Iteration
|
||||
- [Regression 1] (if any)
|
||||
|
||||
## Specific Suggestions for Next Iteration
|
||||
1. [Concrete, actionable suggestion]
|
||||
2. [Concrete, actionable suggestion]
|
||||
|
||||
## Screenshots
|
||||
- [Description of what was captured and key observations]
|
||||
```
|
||||
|
||||
## Feedback Quality Rules
|
||||
|
||||
1. **Every issue must have a "how to fix"** — Don't just say "design is generic." Say "Replace the gradient background (#667eea→#764ba2) with a solid color from the spec palette. Add a subtle texture or pattern for depth."
|
||||
|
||||
2. **Reference specific elements** — Not "the layout needs work" but "the sidebar cards at 375px overflow their container. Set `max-width: 100%` and add `overflow: hidden`."
|
||||
|
||||
3. **Quantify when possible** — "The CLS score is 0.15 (should be <0.1)" or "3 out of 7 features have no error state handling."
|
||||
|
||||
4. **Compare to spec** — "Spec requires drag-and-drop reordering (Feature #4). Currently not implemented."
|
||||
|
||||
5. **Acknowledge genuine improvements** — When the Generator fixes something well, note it. This calibrates the feedback loop.
|
||||
|
||||
## Browser Testing Commands
|
||||
|
||||
Use Playwright MCP or direct browser automation:
|
||||
|
||||
```bash
|
||||
# Navigate
|
||||
npx playwright test --headed --browser=chromium
|
||||
|
||||
# Or via MCP tools if available:
|
||||
# mcp__playwright__navigate { url: "http://localhost:3000" }
|
||||
# mcp__playwright__click { selector: "button.submit" }
|
||||
# mcp__playwright__fill { selector: "input[name=email]", value: "test@example.com" }
|
||||
# mcp__playwright__screenshot { name: "after-submit" }
|
||||
```
|
||||
|
||||
If Playwright MCP is not available, fall back to:
|
||||
1. `curl` for API testing
|
||||
2. Build output analysis
|
||||
3. Screenshot via headless browser
|
||||
4. Test runner output
|
||||
|
||||
## Evaluation Mode Adaptation
|
||||
|
||||
### `playwright` mode (default)
|
||||
Full browser interaction as described above.
|
||||
|
||||
### `screenshot` mode
|
||||
Take screenshots only, analyze visually. Less thorough but works without MCP.
|
||||
|
||||
### `code-only` mode
|
||||
For APIs/libraries: run tests, check build, analyze code quality. No browser.
|
||||
|
||||
```bash
|
||||
# Code-only evaluation
|
||||
npm run build 2>&1 | tee /tmp/build-output.txt
|
||||
npm test 2>&1 | tee /tmp/test-output.txt
|
||||
npx eslint . 2>&1 | tee /tmp/lint-output.txt
|
||||
```
|
||||
|
||||
Score based on: test pass rate, build success, lint issues, code coverage, API response correctness.
|
||||
131
agents/gan-generator.md
Normal file
131
agents/gan-generator.md
Normal file
@@ -0,0 +1,131 @@
|
||||
---
|
||||
name: gan-generator
|
||||
description: "GAN Harness — Generator agent. Implements features according to the spec, reads evaluator feedback, and iterates until quality threshold is met."
|
||||
tools: ["Read", "Write", "Edit", "Bash", "Grep", "Glob"]
|
||||
model: opus
|
||||
color: green
|
||||
---
|
||||
|
||||
You are the **Generator** in a GAN-style multi-agent harness (inspired by Anthropic's harness design paper, March 2026).
|
||||
|
||||
## Your Role
|
||||
|
||||
You are the Developer. You build the application according to the product spec. After each build iteration, the Evaluator will test and score your work. You then read the feedback and improve.
|
||||
|
||||
## Key Principles
|
||||
|
||||
1. **Read the spec first** — Always start by reading `gan-harness/spec.md`
|
||||
2. **Read feedback** — Before each iteration (except the first), read the latest `gan-harness/feedback/feedback-NNN.md`
|
||||
3. **Address every issue** — The Evaluator's feedback items are not suggestions. Fix them all.
|
||||
4. **Don't self-evaluate** — Your job is to build, not to judge. The Evaluator judges.
|
||||
5. **Commit between iterations** — Use git so the Evaluator can see clean diffs.
|
||||
6. **Keep the dev server running** — The Evaluator needs a live app to test.
|
||||
|
||||
## Workflow
|
||||
|
||||
### First Iteration
|
||||
```
|
||||
1. Read gan-harness/spec.md
|
||||
2. Set up project scaffolding (package.json, framework, etc.)
|
||||
3. Implement Must-Have features from Sprint 1
|
||||
4. Start dev server: npm run dev (port from spec or default 3000)
|
||||
5. Do a quick self-check (does it load? do buttons work?)
|
||||
6. Commit: git commit -m "iteration-001: initial implementation"
|
||||
7. Write gan-harness/generator-state.md with what you built
|
||||
```
|
||||
|
||||
### Subsequent Iterations (after receiving feedback)
|
||||
```
|
||||
1. Read gan-harness/feedback/feedback-NNN.md (latest)
|
||||
2. List ALL issues the Evaluator raised
|
||||
3. Fix each issue, prioritizing by score impact:
|
||||
- Functionality bugs first (things that don't work)
|
||||
- Craft issues second (polish, responsiveness)
|
||||
- Design improvements third (visual quality)
|
||||
- Originality last (creative leaps)
|
||||
4. Restart dev server if needed
|
||||
5. Commit: git commit -m "iteration-NNN: address evaluator feedback"
|
||||
6. Update gan-harness/generator-state.md
|
||||
```
|
||||
|
||||
## Generator State File
|
||||
|
||||
Write to `gan-harness/generator-state.md` after each iteration:
|
||||
|
||||
```markdown
|
||||
# Generator State — Iteration NNN
|
||||
|
||||
## What Was Built
|
||||
- [feature/change 1]
|
||||
- [feature/change 2]
|
||||
|
||||
## What Changed This Iteration
|
||||
- [Fixed: issue from feedback]
|
||||
- [Improved: aspect that scored low]
|
||||
- [Added: new feature/polish]
|
||||
|
||||
## Known Issues
|
||||
- [Any issues you're aware of but couldn't fix]
|
||||
|
||||
## Dev Server
|
||||
- URL: http://localhost:3000
|
||||
- Status: running
|
||||
- Command: npm run dev
|
||||
```
|
||||
|
||||
## Technical Guidelines
|
||||
|
||||
### Frontend
|
||||
- Use modern React (or framework specified in spec) with TypeScript
|
||||
- CSS-in-JS or Tailwind for styling — never plain CSS files with global classes
|
||||
- Implement responsive design from the start (mobile-first)
|
||||
- Add transitions/animations for state changes (not just instant renders)
|
||||
- Handle all states: loading, empty, error, success
|
||||
|
||||
### Backend (if needed)
|
||||
- Express/FastAPI with clean route structure
|
||||
- SQLite for persistence (easy setup, no infrastructure)
|
||||
- Input validation on all endpoints
|
||||
- Proper error responses with status codes
|
||||
|
||||
### Code Quality
|
||||
- Clean file structure — no 1000-line files
|
||||
- Extract components/functions when they get complex
|
||||
- Use TypeScript strictly (no `any` types)
|
||||
- Handle async errors properly
|
||||
|
||||
## Creative Quality — Avoiding AI Slop
|
||||
|
||||
The Evaluator will specifically penalize these patterns. **Avoid them:**
|
||||
|
||||
- ❌ Generic gradient backgrounds (#667eea → #764ba2 is an instant tell)
|
||||
- ❌ Excessive rounded corners on everything
|
||||
- ❌ Stock hero sections with "Welcome to [App Name]"
|
||||
- ❌ Default Material UI / Shadcn themes without customization
|
||||
- ❌ Placeholder images from unsplash/placeholder services
|
||||
- ❌ Generic card grids with identical layouts
|
||||
- ❌ "AI-generated" decorative SVG patterns
|
||||
|
||||
**Instead, aim for:**
|
||||
- ✅ A specific, opinionated color palette (follow the spec)
|
||||
- ✅ Thoughtful typography hierarchy (different weights, sizes for different content)
|
||||
- ✅ Custom layouts that match the content (not generic grids)
|
||||
- ✅ Meaningful animations tied to user actions (not decoration)
|
||||
- ✅ Real empty states with personality
|
||||
- ✅ Error states that help the user (not just "Something went wrong")
|
||||
|
||||
## Interaction with Evaluator
|
||||
|
||||
The Evaluator will:
|
||||
1. Open your live app in a browser (Playwright)
|
||||
2. Click through all features
|
||||
3. Test error handling (bad inputs, empty states)
|
||||
4. Score against the rubric in `gan-harness/eval-rubric.md`
|
||||
5. Write detailed feedback to `gan-harness/feedback/feedback-NNN.md`
|
||||
|
||||
Your job after receiving feedback:
|
||||
1. Read the feedback file completely
|
||||
2. Note every specific issue mentioned
|
||||
3. Fix them systematically
|
||||
4. If a score is below 5, treat it as critical
|
||||
5. If a suggestion seems wrong, still try it — the Evaluator sees things you don't
|
||||
99
agents/gan-planner.md
Normal file
99
agents/gan-planner.md
Normal file
@@ -0,0 +1,99 @@
|
||||
---
|
||||
name: gan-planner
|
||||
description: "GAN Harness — Planner agent. Expands a one-line prompt into a full product specification with features, sprints, evaluation criteria, and design direction."
|
||||
tools: ["Read", "Write", "Grep", "Glob"]
|
||||
model: opus
|
||||
color: purple
|
||||
---
|
||||
|
||||
You are the **Planner** in a GAN-style multi-agent harness (inspired by Anthropic's harness design paper, March 2026).
|
||||
|
||||
## Your Role
|
||||
|
||||
You are the Product Manager. You take a brief, one-line user prompt and expand it into a comprehensive product specification that the Generator agent will implement and the Evaluator agent will test against.
|
||||
|
||||
## Key Principle
|
||||
|
||||
**Be deliberately ambitious.** Conservative planning leads to underwhelming results. Push for 12-16 features, rich visual design, and polished UX. The Generator is capable — give it a worthy challenge.
|
||||
|
||||
## Output: Product Specification
|
||||
|
||||
Write your output to `gan-harness/spec.md` in the project root. Structure:
|
||||
|
||||
```markdown
|
||||
# Product Specification: [App Name]
|
||||
|
||||
> Generated from brief: "[original user prompt]"
|
||||
|
||||
## Vision
|
||||
[2-3 sentences describing the product's purpose and feel]
|
||||
|
||||
## Design Direction
|
||||
- **Color palette**: [specific colors, not "modern" or "clean"]
|
||||
- **Typography**: [font choices and hierarchy]
|
||||
- **Layout philosophy**: [e.g., "dense dashboard" vs "airy single-page"]
|
||||
- **Visual identity**: [unique design elements that prevent AI-slop aesthetics]
|
||||
- **Inspiration**: [specific sites/apps to draw from]
|
||||
|
||||
## Features (prioritized)
|
||||
|
||||
### Must-Have (Sprint 1-2)
|
||||
1. [Feature]: [description, acceptance criteria]
|
||||
2. [Feature]: [description, acceptance criteria]
|
||||
...
|
||||
|
||||
### Should-Have (Sprint 3-4)
|
||||
1. [Feature]: [description, acceptance criteria]
|
||||
...
|
||||
|
||||
### Nice-to-Have (Sprint 5+)
|
||||
1. [Feature]: [description, acceptance criteria]
|
||||
...
|
||||
|
||||
## Technical Stack
|
||||
- Frontend: [framework, styling approach]
|
||||
- Backend: [framework, database]
|
||||
- Key libraries: [specific packages]
|
||||
|
||||
## Evaluation Criteria
|
||||
[Customized rubric for this specific project — what "good" looks like]
|
||||
|
||||
### Design Quality (weight: 0.3)
|
||||
- What makes this app's design "good"? [specific to this project]
|
||||
|
||||
### Originality (weight: 0.2)
|
||||
- What would make this feel unique? [specific creative challenges]
|
||||
|
||||
### Craft (weight: 0.3)
|
||||
- What polish details matter? [animations, transitions, states]
|
||||
|
||||
### Functionality (weight: 0.2)
|
||||
- What are the critical user flows? [specific test scenarios]
|
||||
|
||||
## Sprint Plan
|
||||
|
||||
### Sprint 1: [Name]
|
||||
- Goals: [...]
|
||||
- Features: [#1, #2, ...]
|
||||
- Definition of done: [...]
|
||||
|
||||
### Sprint 2: [Name]
|
||||
...
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
1. **Name the app** — Don't call it "the app." Give it a memorable name.
|
||||
2. **Specify exact colors** — Not "blue theme" but "#1a73e8 primary, #f8f9fa background"
|
||||
3. **Define user flows** — "User clicks X, sees Y, can do Z"
|
||||
4. **Set the quality bar** — What would make this genuinely impressive, not just functional?
|
||||
5. **Anti-AI-slop directives** — Explicitly call out patterns to avoid (gradient abuse, stock illustrations, generic cards)
|
||||
6. **Include edge cases** — Empty states, error states, loading states, responsive behavior
|
||||
7. **Be specific about interactions** — Drag-and-drop, keyboard shortcuts, animations, transitions
|
||||
|
||||
## Process
|
||||
|
||||
1. Read the user's brief prompt
|
||||
2. Research: If the prompt references a specific type of app, read any existing examples or specs in the codebase
|
||||
3. Write the full spec to `gan-harness/spec.md`
|
||||
4. Also write a concise `gan-harness/eval-rubric.md` with the evaluation criteria in a format the Evaluator can consume directly
|
||||
99
commands/gan-build.md
Normal file
99
commands/gan-build.md
Normal file
@@ -0,0 +1,99 @@
|
||||
Parse the following from $ARGUMENTS:
|
||||
1. `brief` — the user's one-line description of what to build
|
||||
2. `--max-iterations N` — (optional, default 15) maximum generator-evaluator cycles
|
||||
3. `--pass-threshold N` — (optional, default 7.0) weighted score to pass
|
||||
4. `--skip-planner` — (optional) skip planner, assume spec.md already exists
|
||||
5. `--eval-mode MODE` — (optional, default "playwright") one of: playwright, screenshot, code-only
|
||||
|
||||
## GAN-Style Harness Build
|
||||
|
||||
This command orchestrates a three-agent build loop inspired by Anthropic's March 2026 harness design paper.
|
||||
|
||||
### Phase 0: Setup
|
||||
1. Create `gan-harness/` directory in project root
|
||||
2. Create subdirectories: `gan-harness/feedback/`, `gan-harness/screenshots/`
|
||||
3. Initialize git if not already initialized
|
||||
4. Log start time and configuration
|
||||
|
||||
### Phase 1: Planning (Planner Agent)
|
||||
Unless `--skip-planner` is set:
|
||||
1. Launch the `gan-planner` agent via Task tool with the user's brief
|
||||
2. Wait for it to produce `gan-harness/spec.md` and `gan-harness/eval-rubric.md`
|
||||
3. Display the spec summary to the user
|
||||
4. Proceed to Phase 2
|
||||
|
||||
### Phase 2: Generator-Evaluator Loop
|
||||
```
|
||||
iteration = 1
|
||||
while iteration <= max_iterations:
|
||||
|
||||
# GENERATE
|
||||
Launch gan-generator agent via Task tool:
|
||||
- Read spec.md
|
||||
- If iteration > 1: read feedback/feedback-{iteration-1}.md
|
||||
- Build/improve the application
|
||||
- Ensure dev server is running
|
||||
- Commit changes
|
||||
|
||||
# Wait for generator to finish
|
||||
|
||||
# EVALUATE
|
||||
Launch gan-evaluator agent via Task tool:
|
||||
- Read eval-rubric.md and spec.md
|
||||
- Test the live application (mode: playwright/screenshot/code-only)
|
||||
- Score against rubric
|
||||
- Write feedback to feedback/feedback-{iteration}.md
|
||||
|
||||
# Wait for evaluator to finish
|
||||
|
||||
# CHECK SCORE
|
||||
Read feedback/feedback-{iteration}.md
|
||||
Extract weighted total score
|
||||
|
||||
if score >= pass_threshold:
|
||||
Log "PASSED at iteration {iteration} with score {score}"
|
||||
Break
|
||||
|
||||
if iteration >= 3 and score has not improved in last 2 iterations:
|
||||
Log "PLATEAU detected — stopping early"
|
||||
Break
|
||||
|
||||
iteration += 1
|
||||
```
|
||||
|
||||
### Phase 3: Summary
|
||||
1. Read all feedback files
|
||||
2. Display final scores and iteration history
|
||||
3. Show score progression: `iteration 1: 4.2 → iteration 2: 5.8 → ... → iteration N: 7.5`
|
||||
4. List any remaining issues from the final evaluation
|
||||
5. Report total time and estimated cost
|
||||
|
||||
### Output
|
||||
|
||||
```markdown
|
||||
## GAN Harness Build Report
|
||||
|
||||
**Brief:** [original prompt]
|
||||
**Result:** PASS/FAIL
|
||||
**Iterations:** N / max
|
||||
**Final Score:** X.X / 10
|
||||
|
||||
### Score Progression
|
||||
| Iter | Design | Originality | Craft | Functionality | Total |
|
||||
|------|--------|-------------|-------|---------------|-------|
|
||||
| 1 | ... | ... | ... | ... | X.X |
|
||||
| 2 | ... | ... | ... | ... | X.X |
|
||||
| N | ... | ... | ... | ... | X.X |
|
||||
|
||||
### Remaining Issues
|
||||
- [Any issues from final evaluation]
|
||||
|
||||
### Files Created
|
||||
- gan-harness/spec.md
|
||||
- gan-harness/eval-rubric.md
|
||||
- gan-harness/feedback/feedback-001.md through feedback-NNN.md
|
||||
- gan-harness/generator-state.md
|
||||
- gan-harness/build-report.md
|
||||
```
|
||||
|
||||
Write the full report to `gan-harness/build-report.md`.
|
||||
35
commands/gan-design.md
Normal file
35
commands/gan-design.md
Normal file
@@ -0,0 +1,35 @@
|
||||
Parse the following from $ARGUMENTS:
|
||||
1. `brief` — the user's description of the design to create
|
||||
2. `--max-iterations N` — (optional, default 10) maximum design-evaluate cycles
|
||||
3. `--pass-threshold N` — (optional, default 7.5) weighted score to pass (higher default for design)
|
||||
|
||||
## GAN-Style Design Harness
|
||||
|
||||
A two-agent loop (Generator + Evaluator) focused on frontend design quality. No planner — the brief IS the spec.
|
||||
|
||||
This is the same mode Anthropic used for their frontend design experiments, where they saw creative breakthroughs like the 3D Dutch art museum with CSS perspective and doorway navigation.
|
||||
|
||||
### Setup
|
||||
1. Create `gan-harness/` directory
|
||||
2. Write the brief directly as `gan-harness/spec.md`
|
||||
3. Write a design-focused `gan-harness/eval-rubric.md` with extra weight on Design Quality and Originality
|
||||
|
||||
### Design-Specific Eval Rubric
|
||||
```markdown
|
||||
### Design Quality (weight: 0.35)
|
||||
### Originality (weight: 0.30)
|
||||
### Craft (weight: 0.25)
|
||||
### Functionality (weight: 0.10)
|
||||
```
|
||||
|
||||
Note: Originality weight is higher (0.30 vs 0.20) to push for creative breakthroughs. Functionality weight is lower since design mode focuses on visual quality.
|
||||
|
||||
### Loop
|
||||
Same as `/project:gan-build` Phase 2, but:
|
||||
- Skip the planner
|
||||
- Use the design-focused rubric
|
||||
- Generator prompt emphasizes visual quality over feature completeness
|
||||
- Evaluator prompt emphasizes "would this win a design award?" over "do all features work?"
|
||||
|
||||
### Key Difference from gan-build
|
||||
The Generator is told: "Your PRIMARY goal is visual excellence. A stunning half-finished app beats a functional ugly one. Push for creative leaps — unusual layouts, custom animations, distinctive color work."
|
||||
126
examples/gan-harness/README.md
Normal file
126
examples/gan-harness/README.md
Normal file
@@ -0,0 +1,126 @@
|
||||
# GAN-Style Harness Examples
|
||||
|
||||
Examples showing how to use the Generator-Evaluator harness for different project types.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Full-stack web app (uses all three agents)
|
||||
./scripts/gan-harness.sh "Build a project management app with Kanban boards and team collaboration"
|
||||
|
||||
# Frontend design (skip planner, focus on design iterations)
|
||||
GAN_SKIP_PLANNER=true ./scripts/gan-harness.sh "Create a stunning landing page for a crypto portfolio tracker"
|
||||
|
||||
# API-only (no browser testing needed)
|
||||
GAN_EVAL_MODE=code-only ./scripts/gan-harness.sh "Build a REST API for a recipe sharing platform with search and ratings"
|
||||
|
||||
# Tight budget (fewer iterations, lower threshold)
|
||||
GAN_MAX_ITERATIONS=5 GAN_PASS_THRESHOLD=6.5 ./scripts/gan-harness.sh "Build a todo app with categories and due dates"
|
||||
```
|
||||
|
||||
## Example: Using the Command
|
||||
|
||||
```bash
|
||||
# In Claude Code interactive mode:
|
||||
/project:gan-build "Build a music streaming dashboard with playlists, visualizer, and social features"
|
||||
|
||||
# With options:
|
||||
/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5 --eval-mode screenshot
|
||||
```
|
||||
|
||||
## Example: Manual Three-Agent Run
|
||||
|
||||
For maximum control, run each agent separately:
|
||||
|
||||
```bash
|
||||
# Step 1: Plan (produces spec.md)
|
||||
claude -p --model opus "$(cat agents/gan-planner.md)
|
||||
|
||||
Your brief: 'Build a retro game maker with sprite editor and level designer'
|
||||
|
||||
Write the full spec to gan-harness/spec.md and eval rubric to gan-harness/eval-rubric.md."
|
||||
|
||||
# Step 2: Generate (iteration 1)
|
||||
claude -p --model opus "$(cat agents/gan-generator.md)
|
||||
|
||||
Iteration 1. Read gan-harness/spec.md. Build the initial application.
|
||||
Start dev server on port 3000. Commit as iteration-001."
|
||||
|
||||
# Step 3: Evaluate (iteration 1)
|
||||
claude -p --model opus "$(cat agents/gan-evaluator.md)
|
||||
|
||||
Iteration 1. Read gan-harness/eval-rubric.md.
|
||||
Test http://localhost:3000. Write feedback to gan-harness/feedback/feedback-001.md.
|
||||
Be ruthlessly strict."
|
||||
|
||||
# Step 4: Generate (iteration 2 — reads feedback)
|
||||
claude -p --model opus "$(cat agents/gan-generator.md)
|
||||
|
||||
Iteration 2. Read gan-harness/feedback/feedback-001.md FIRST.
|
||||
Address every issue. Then read gan-harness/spec.md for remaining features.
|
||||
Commit as iteration-002."
|
||||
|
||||
# Repeat steps 3-4 until satisfied
|
||||
```
|
||||
|
||||
## Example: Custom Evaluation Criteria
|
||||
|
||||
For non-visual projects (APIs, CLIs, libraries), customize the rubric:
|
||||
|
||||
```bash
|
||||
mkdir -p gan-harness
|
||||
cat > gan-harness/eval-rubric.md << 'EOF'
|
||||
# API Evaluation Rubric
|
||||
|
||||
### Correctness (weight: 0.4)
|
||||
- Do all endpoints return expected data?
|
||||
- Are edge cases handled (empty inputs, large payloads)?
|
||||
- Do error responses have proper status codes?
|
||||
|
||||
### Performance (weight: 0.2)
|
||||
- Response times under 100ms for simple queries?
|
||||
- Database queries optimized (no N+1)?
|
||||
- Pagination implemented for list endpoints?
|
||||
|
||||
### Security (weight: 0.2)
|
||||
- Input validation on all endpoints?
|
||||
- SQL injection prevention?
|
||||
- Rate limiting implemented?
|
||||
- Authentication properly enforced?
|
||||
|
||||
### Documentation (weight: 0.2)
|
||||
- OpenAPI spec generated?
|
||||
- All endpoints documented?
|
||||
- Example requests/responses provided?
|
||||
EOF
|
||||
|
||||
GAN_SKIP_PLANNER=true GAN_EVAL_MODE=code-only ./scripts/gan-harness.sh "Build a REST API for task management"
|
||||
```
|
||||
|
||||
## Project Types and Recommended Settings
|
||||
|
||||
| Project Type | Eval Mode | Iterations | Threshold | Est. Cost |
|
||||
|-------------|-----------|------------|-----------|-----------|
|
||||
| Full-stack web app | playwright | 10-15 | 7.0 | $100-200 |
|
||||
| Landing page | screenshot | 5-8 | 7.5 | $30-60 |
|
||||
| REST API | code-only | 5-8 | 7.0 | $30-60 |
|
||||
| CLI tool | code-only | 3-5 | 6.5 | $15-30 |
|
||||
| Data dashboard | playwright | 8-12 | 7.0 | $60-120 |
|
||||
| Game | playwright | 10-15 | 7.0 | $100-200 |
|
||||
|
||||
## Understanding the Output
|
||||
|
||||
After each run, check:
|
||||
|
||||
1. **`gan-harness/build-report.md`** — Final summary with score progression
|
||||
2. **`gan-harness/feedback/`** — All evaluation feedback (useful for understanding quality evolution)
|
||||
3. **`gan-harness/spec.md`** — The full spec (useful if you want to continue manually)
|
||||
4. **Score progression** — Should show steady improvement. Plateaus indicate the model has hit its ceiling.
|
||||
|
||||
## Tips
|
||||
|
||||
1. **Start with a clear brief** — "Build X with Y and Z" beats "make something cool"
|
||||
2. **Don't go below 5 iterations** — The first 2-3 iterations are usually below threshold
|
||||
3. **Use `playwright` mode for UI projects** — Screenshot-only misses interaction bugs
|
||||
4. **Review feedback files** — Even if the final score passes, the feedback contains valuable insights
|
||||
5. **Iterate on the spec** — If results are disappointing, improve `spec.md` and run again with `--skip-planner`
|
||||
299
scripts/gan-harness.sh
Executable file
299
scripts/gan-harness.sh
Executable file
@@ -0,0 +1,299 @@
|
||||
#!/bin/bash
|
||||
# gan-harness.sh — GAN-Style Generator-Evaluator Harness Orchestrator
|
||||
#
|
||||
# Inspired by Anthropic's "Harness Design for Long-Running Application Development"
|
||||
# https://www.anthropic.com/engineering/harness-design-long-running-apps
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/gan-harness.sh "Build a music streaming dashboard"
|
||||
# GAN_MAX_ITERATIONS=10 GAN_PASS_THRESHOLD=8.0 ./scripts/gan-harness.sh "Build a Kanban board"
|
||||
#
|
||||
# Environment Variables:
|
||||
# GAN_MAX_ITERATIONS — Max generator-evaluator cycles (default: 15)
|
||||
# GAN_PASS_THRESHOLD — Weighted score to pass, 1-10 (default: 7.0)
|
||||
# GAN_PLANNER_MODEL — Model for planner (default: opus)
|
||||
# GAN_GENERATOR_MODEL — Model for generator (default: opus)
|
||||
# GAN_EVALUATOR_MODEL — Model for evaluator (default: opus)
|
||||
# GAN_DEV_SERVER_PORT — Port for live app (default: 3000)
|
||||
# GAN_DEV_SERVER_CMD — Command to start dev server (default: "npm run dev")
|
||||
# GAN_PROJECT_DIR — Working directory (default: current dir)
|
||||
# GAN_SKIP_PLANNER — Set to "true" to skip planner phase
|
||||
# GAN_EVAL_MODE — playwright, screenshot, or code-only (default: playwright)
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# ─── Configuration ───────────────────────────────────────────────────────────
|
||||
|
||||
BRIEF="${1:?Usage: ./scripts/gan-harness.sh \"description of what to build\"}"
|
||||
MAX_ITERATIONS="${GAN_MAX_ITERATIONS:-15}"
|
||||
PASS_THRESHOLD="${GAN_PASS_THRESHOLD:-7.0}"
|
||||
PLANNER_MODEL="${GAN_PLANNER_MODEL:-opus}"
|
||||
GENERATOR_MODEL="${GAN_GENERATOR_MODEL:-opus}"
|
||||
EVALUATOR_MODEL="${GAN_EVALUATOR_MODEL:-opus}"
|
||||
DEV_PORT="${GAN_DEV_SERVER_PORT:-3000}"
|
||||
DEV_CMD="${GAN_DEV_SERVER_CMD:-npm run dev}"
|
||||
PROJECT_DIR="${GAN_PROJECT_DIR:-.}"
|
||||
SKIP_PLANNER="${GAN_SKIP_PLANNER:-false}"
|
||||
EVAL_MODE="${GAN_EVAL_MODE:-playwright}"
|
||||
|
||||
HARNESS_DIR="${PROJECT_DIR}/gan-harness"
|
||||
FEEDBACK_DIR="${HARNESS_DIR}/feedback"
|
||||
SCREENSHOTS_DIR="${HARNESS_DIR}/screenshots"
|
||||
START_TIME=$(date +%s)
|
||||
|
||||
# Colors
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
PURPLE='\033[0;35m'
|
||||
CYAN='\033[0;36m'
|
||||
NC='\033[0m'
|
||||
|
||||
# ─── Helpers ─────────────────────────────────────────────────────────────────
|
||||
|
||||
log() { echo -e "${BLUE}[GAN-HARNESS]${NC} $*"; }
|
||||
ok() { echo -e "${GREEN}[✓]${NC} $*"; }
|
||||
warn() { echo -e "${YELLOW}[⚠]${NC} $*"; }
|
||||
fail() { echo -e "${RED}[✗]${NC} $*"; }
|
||||
phase() { echo -e "\n${PURPLE}═══════════════════════════════════════════════${NC}"; echo -e "${PURPLE} $*${NC}"; echo -e "${PURPLE}═══════════════════════════════════════════════${NC}\n"; }
|
||||
|
||||
extract_score() {
|
||||
# Extract the TOTAL weighted score from a feedback file
|
||||
local file="$1"
|
||||
# Look for **TOTAL** or **X.X/10** pattern
|
||||
grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
|
||||
|| grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
|
||||
|| grep -oP 'Verdict:.*([0-9]+\.[0-9]+)' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
|
||||
|| echo "0.0"
|
||||
}
|
||||
|
||||
score_passes() {
|
||||
local score="$1"
|
||||
local threshold="$2"
|
||||
awk -v s="$score" -v t="$threshold" 'BEGIN { exit !(s >= t) }'
|
||||
}
|
||||
|
||||
elapsed() {
|
||||
local now=$(date +%s)
|
||||
local diff=$((now - START_TIME))
|
||||
printf '%dh %dm %ds' $((diff/3600)) $((diff%3600/60)) $((diff%60))
|
||||
}
|
||||
|
||||
# ─── Setup ───────────────────────────────────────────────────────────────────
|
||||
|
||||
phase "GAN-STYLE HARNESS — Setup"
|
||||
|
||||
log "Brief: ${CYAN}${BRIEF}${NC}"
|
||||
log "Max iterations: $MAX_ITERATIONS"
|
||||
log "Pass threshold: $PASS_THRESHOLD"
|
||||
log "Models: Planner=$PLANNER_MODEL, Generator=$GENERATOR_MODEL, Evaluator=$EVALUATOR_MODEL"
|
||||
log "Eval mode: $EVAL_MODE"
|
||||
log "Project dir: $PROJECT_DIR"
|
||||
|
||||
mkdir -p "$FEEDBACK_DIR" "$SCREENSHOTS_DIR"
|
||||
|
||||
# Initialize git if needed
|
||||
if [ ! -d "${PROJECT_DIR}/.git" ]; then
|
||||
git -C "$PROJECT_DIR" init
|
||||
ok "Initialized git repository"
|
||||
fi
|
||||
|
||||
# Write config
|
||||
cat > "${HARNESS_DIR}/config.json" << EOF
|
||||
{
|
||||
"brief": "$BRIEF",
|
||||
"maxIterations": $MAX_ITERATIONS,
|
||||
"passThreshold": $PASS_THRESHOLD,
|
||||
"models": {
|
||||
"planner": "$PLANNER_MODEL",
|
||||
"generator": "$GENERATOR_MODEL",
|
||||
"evaluator": "$EVALUATOR_MODEL"
|
||||
},
|
||||
"evalMode": "$EVAL_MODE",
|
||||
"devServerPort": $DEV_PORT,
|
||||
"startedAt": "$(date -Iseconds)"
|
||||
}
|
||||
EOF
|
||||
|
||||
ok "Harness directory created: $HARNESS_DIR"
|
||||
|
||||
# ─── Phase 1: Planning ──────────────────────────────────────────────────────
|
||||
|
||||
if [ "$SKIP_PLANNER" = "true" ] && [ -f "${HARNESS_DIR}/spec.md" ]; then
|
||||
phase "PHASE 1: Planning — SKIPPED (spec.md exists)"
|
||||
else
|
||||
phase "PHASE 1: Planning"
|
||||
log "Launching Planner agent (model: $PLANNER_MODEL)..."
|
||||
|
||||
claude -p --model "$PLANNER_MODEL" \
|
||||
"You are the Planner in a GAN-style harness. Read the agent definition in agents/gan-planner.md for your full instructions.
|
||||
|
||||
Your brief: \"$BRIEF\"
|
||||
|
||||
Create two files:
|
||||
1. gan-harness/spec.md — Full product specification
|
||||
2. gan-harness/eval-rubric.md — Evaluation criteria for the Evaluator
|
||||
|
||||
Be ambitious. Push for 12-16 features. Specify exact colors, fonts, and layouts. Don't be generic." \
|
||||
2>&1 | tee "${HARNESS_DIR}/planner-output.log"
|
||||
|
||||
if [ -f "${HARNESS_DIR}/spec.md" ]; then
|
||||
ok "Spec generated: $(wc -l < "${HARNESS_DIR}/spec.md") lines"
|
||||
else
|
||||
fail "Planner did not produce spec.md!"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# ─── Phase 2: Generator-Evaluator Loop ──────────────────────────────────────
|
||||
|
||||
phase "PHASE 2: Generator-Evaluator Loop"
|
||||
|
||||
SCORES=()
|
||||
PREV_SCORE="0.0"
|
||||
PLATEAU_COUNT=0
|
||||
|
||||
for (( i=1; i<=MAX_ITERATIONS; i++ )); do
|
||||
echo ""
|
||||
log "━━━ Iteration $i / $MAX_ITERATIONS ━━━"
|
||||
|
||||
# ── GENERATE ──
|
||||
echo -e "${GREEN}▶ GENERATOR (iteration $i)${NC}"
|
||||
|
||||
FEEDBACK_CONTEXT=""
|
||||
if [ $i -gt 1 ] && [ -f "${FEEDBACK_DIR}/feedback-$(printf '%03d' $((i-1))).md" ]; then
|
||||
FEEDBACK_CONTEXT="IMPORTANT: Read and address ALL issues in gan-harness/feedback/feedback-$(printf '%03d' $((i-1))).md before doing anything else."
|
||||
fi
|
||||
|
||||
claude -p --model "$GENERATOR_MODEL" \
|
||||
"You are the Generator in a GAN-style harness. Read agents/gan-generator.md for full instructions.
|
||||
|
||||
Iteration: $i
|
||||
$FEEDBACK_CONTEXT
|
||||
|
||||
Read gan-harness/spec.md for the product specification.
|
||||
Build/improve the application. Ensure the dev server runs on port $DEV_PORT.
|
||||
Commit your changes with message: 'iteration-$(printf '%03d' $i): [describe what you did]'
|
||||
Update gan-harness/generator-state.md." \
|
||||
2>&1 | tee "${HARNESS_DIR}/generator-${i}.log"
|
||||
|
||||
ok "Generator completed iteration $i"
|
||||
|
||||
# ── EVALUATE ──
|
||||
echo -e "${RED}▶ EVALUATOR (iteration $i)${NC}"
|
||||
|
||||
claude -p --model "$EVALUATOR_MODEL" \
|
||||
--allowedTools "Read,Write,Bash,Grep,Glob" \
|
||||
"You are the Evaluator in a GAN-style harness. Read agents/gan-evaluator.md for full instructions.
|
||||
|
||||
Iteration: $i
|
||||
Eval mode: $EVAL_MODE
|
||||
Dev server: http://localhost:$DEV_PORT
|
||||
|
||||
1. Read gan-harness/eval-rubric.md for scoring criteria
|
||||
2. Read gan-harness/spec.md for feature requirements
|
||||
3. Read gan-harness/generator-state.md for what was built
|
||||
4. Test the live application (mode: $EVAL_MODE)
|
||||
5. Score against the rubric (1-10 per criterion)
|
||||
6. Write detailed feedback to gan-harness/feedback/feedback-$(printf '%03d' $i).md
|
||||
|
||||
Be RUTHLESSLY strict. A 7 means genuinely good, not 'good for AI.'
|
||||
Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X** |" \
|
||||
2>&1 | tee "${HARNESS_DIR}/evaluator-${i}.log"
|
||||
|
||||
FEEDBACK_FILE="${FEEDBACK_DIR}/feedback-$(printf '%03d' $i).md"
|
||||
|
||||
if [ -f "$FEEDBACK_FILE" ]; then
|
||||
SCORE=$(extract_score "$FEEDBACK_FILE")
|
||||
SCORES+=("$SCORE")
|
||||
ok "Evaluator completed. Score: ${CYAN}${SCORE}${NC} / 10.0 (threshold: $PASS_THRESHOLD)"
|
||||
else
|
||||
warn "Evaluator did not produce feedback file. Assuming score 0.0"
|
||||
SCORE="0.0"
|
||||
SCORES+=("0.0")
|
||||
fi
|
||||
|
||||
# ── CHECK PASS ──
|
||||
if score_passes "$SCORE" "$PASS_THRESHOLD"; then
|
||||
echo ""
|
||||
ok "🎉 PASSED at iteration $i with score $SCORE (threshold: $PASS_THRESHOLD)"
|
||||
break
|
||||
fi
|
||||
|
||||
# ── CHECK PLATEAU ──
|
||||
SCORE_DIFF=$(awk -v s="$SCORE" -v p="$PREV_SCORE" 'BEGIN { printf "%.1f", s - p }')
|
||||
if [ $i -ge 3 ] && awk -v d="$SCORE_DIFF" 'BEGIN { exit !(d <= 0.2) }'; then
|
||||
PLATEAU_COUNT=$((PLATEAU_COUNT + 1))
|
||||
else
|
||||
PLATEAU_COUNT=0
|
||||
fi
|
||||
|
||||
if [ $PLATEAU_COUNT -ge 2 ]; then
|
||||
warn "Score plateau detected (no improvement for 2 iterations). Stopping early."
|
||||
break
|
||||
fi
|
||||
|
||||
PREV_SCORE="$SCORE"
|
||||
done
|
||||
|
||||
# ─── Phase 3: Summary ───────────────────────────────────────────────────────
|
||||
|
||||
phase "PHASE 3: Build Report"
|
||||
|
||||
FINAL_SCORE="${SCORES[-1]:-0.0}"
|
||||
NUM_ITERATIONS=${#SCORES[@]}
|
||||
ELAPSED=$(elapsed)
|
||||
|
||||
# Build score progression table
|
||||
SCORE_TABLE="| Iter | Score |\n|------|-------|\n"
|
||||
for (( j=0; j<${#SCORES[@]}; j++ )); do
|
||||
SCORE_TABLE+="| $((j+1)) | ${SCORES[$j]} |\n"
|
||||
done
|
||||
|
||||
# Write report
|
||||
cat > "${HARNESS_DIR}/build-report.md" << EOF
|
||||
# GAN Harness Build Report
|
||||
|
||||
**Brief:** $BRIEF
|
||||
**Result:** $(score_passes "$FINAL_SCORE" "$PASS_THRESHOLD" && echo "✅ PASS" || echo "❌ FAIL")
|
||||
**Iterations:** $NUM_ITERATIONS / $MAX_ITERATIONS
|
||||
**Final Score:** $FINAL_SCORE / 10.0 (threshold: $PASS_THRESHOLD)
|
||||
**Elapsed:** $ELAPSED
|
||||
|
||||
## Score Progression
|
||||
|
||||
$(echo -e "$SCORE_TABLE")
|
||||
|
||||
## Configuration
|
||||
|
||||
- Planner model: $PLANNER_MODEL
|
||||
- Generator model: $GENERATOR_MODEL
|
||||
- Evaluator model: $EVALUATOR_MODEL
|
||||
- Eval mode: $EVAL_MODE
|
||||
- Pass threshold: $PASS_THRESHOLD
|
||||
|
||||
## Files
|
||||
|
||||
- \`gan-harness/spec.md\` — Product specification
|
||||
- \`gan-harness/eval-rubric.md\` — Evaluation rubric
|
||||
- \`gan-harness/feedback/\` — All evaluation feedback ($NUM_ITERATIONS files)
|
||||
- \`gan-harness/generator-state.md\` — Final generator state
|
||||
- \`gan-harness/build-report.md\` — This report
|
||||
EOF
|
||||
|
||||
ok "Report written to ${HARNESS_DIR}/build-report.md"
|
||||
|
||||
echo ""
|
||||
log "━━━ Final Results ━━━"
|
||||
if score_passes "$FINAL_SCORE" "$PASS_THRESHOLD"; then
|
||||
echo -e "${GREEN} Result: PASS ✅${NC}"
|
||||
else
|
||||
echo -e "${RED} Result: FAIL ❌${NC}"
|
||||
fi
|
||||
echo -e " Score: ${CYAN}${FINAL_SCORE}${NC} / 10.0"
|
||||
echo -e " Iterations: ${NUM_ITERATIONS} / ${MAX_ITERATIONS}"
|
||||
echo -e " Elapsed: ${ELAPSED}"
|
||||
echo ""
|
||||
|
||||
log "Done! Review the build at http://localhost:$DEV_PORT"
|
||||
278
skills/gan-style-harness/SKILL.md
Normal file
278
skills/gan-style-harness/SKILL.md
Normal file
@@ -0,0 +1,278 @@
|
||||
---
|
||||
name: gan-style-harness
|
||||
description: "GAN-inspired Generator-Evaluator agent harness for building high-quality applications autonomously. Based on Anthropic's March 2026 harness design paper."
|
||||
origin: ECC-community
|
||||
tools: Read, Write, Edit, Bash, Grep, Glob, Task
|
||||
---
|
||||
|
||||
# GAN-Style Harness Skill
|
||||
|
||||
> Inspired by [Anthropic's Harness Design for Long-Running Application Development](https://www.anthropic.com/engineering/harness-design-long-running-apps) (March 24, 2026)
|
||||
|
||||
A multi-agent harness that separates **generation** from **evaluation**, creating an adversarial feedback loop that drives quality far beyond what a single agent can achieve.
|
||||
|
||||
## Core Insight
|
||||
|
||||
> When asked to evaluate their own work, agents are pathological optimists — they praise mediocre output and talk themselves out of legitimate issues. But engineering a **separate evaluator** to be ruthlessly strict is far more tractable than teaching a generator to self-critique.
|
||||
|
||||
This is the same dynamic as GANs (Generative Adversarial Networks): the Generator produces, the Evaluator critiques, and that feedback drives the next iteration.
|
||||
|
||||
## When to Use
|
||||
|
||||
- Building complete applications from a one-line prompt
|
||||
- Frontend design tasks requiring high visual quality
|
||||
- Full-stack projects that need working features, not just code
|
||||
- Any task where "AI slop" aesthetics are unacceptable
|
||||
- Projects where you want to invest $50-200 for production-quality output
|
||||
|
||||
## When NOT to Use
|
||||
|
||||
- Quick single-file fixes (use standard `claude -p`)
|
||||
- Tasks with tight budget constraints (<$10)
|
||||
- Simple refactoring (use de-sloppify pattern instead)
|
||||
- Tasks that are already well-specified with tests (use TDD workflow)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐
|
||||
│ PLANNER │
|
||||
│ (Opus 4.6) │
|
||||
└──────┬──────┘
|
||||
│ Product Spec
|
||||
│ (features, sprints, design direction)
|
||||
▼
|
||||
┌────────────────────────┐
|
||||
│ │
|
||||
│ GENERATOR-EVALUATOR │
|
||||
│ FEEDBACK LOOP │
|
||||
│ │
|
||||
│ ┌──────────┐ │
|
||||
│ │GENERATOR │──build──▶│──┐
|
||||
│ │(Opus 4.6)│ │ │
|
||||
│ └────▲─────┘ │ │
|
||||
│ │ │ │ live app
|
||||
│ feedback │ │
|
||||
│ │ │ │
|
||||
│ ┌────┴─────┐ │ │
|
||||
│ │EVALUATOR │◀─test───│──┘
|
||||
│ │(Opus 4.6)│ │
|
||||
│ │+Playwright│ │
|
||||
│ └──────────┘ │
|
||||
│ │
|
||||
│ 5-15 iterations │
|
||||
└────────────────────────┘
|
||||
```
|
||||
|
||||
## The Three Agents
|
||||
|
||||
### 1. Planner Agent
|
||||
|
||||
**Role:** Product manager — expands a brief prompt into a full product specification.
|
||||
|
||||
**Key behaviors:**
|
||||
- Takes a one-line prompt and produces a 16-feature, multi-sprint specification
|
||||
- Defines user stories, technical requirements, and visual design direction
|
||||
- Is deliberately **ambitious** — conservative planning leads to underwhelming results
|
||||
- Produces evaluation criteria that the Evaluator will use later
|
||||
|
||||
**Model:** Opus 4.6 (needs deep reasoning for spec expansion)
|
||||
|
||||
### 2. Generator Agent
|
||||
|
||||
**Role:** Developer — implements features according to the spec.
|
||||
|
||||
**Key behaviors:**
|
||||
- Works in structured sprints (or continuous mode with newer models)
|
||||
- Negotiates a "sprint contract" with the Evaluator before writing code
|
||||
- Uses full-stack tooling: React, FastAPI/Express, databases, CSS
|
||||
- Manages git for version control between iterations
|
||||
- Reads Evaluator feedback and incorporates it in next iteration
|
||||
|
||||
**Model:** Opus 4.6 (needs strong coding capability)
|
||||
|
||||
### 3. Evaluator Agent
|
||||
|
||||
**Role:** QA engineer — tests the live running application, not just code.
|
||||
|
||||
**Key behaviors:**
|
||||
- Uses **Playwright MCP** to interact with the live application
|
||||
- Clicks through features, fills forms, tests API endpoints
|
||||
- Scores against four criteria (configurable):
|
||||
1. **Design Quality** — Does it feel like a coherent whole?
|
||||
2. **Originality** — Custom decisions vs. template/AI patterns?
|
||||
3. **Craft** — Typography, spacing, animations, micro-interactions?
|
||||
4. **Functionality** — Do all features actually work?
|
||||
- Returns structured feedback with scores and specific issues
|
||||
- Is engineered to be **ruthlessly strict** — never praises mediocre work
|
||||
|
||||
**Model:** Opus 4.6 (needs strong judgment + tool use)
|
||||
|
||||
## Evaluation Criteria
|
||||
|
||||
The default four criteria, each scored 1-10:
|
||||
|
||||
```markdown
|
||||
## Evaluation Rubric
|
||||
|
||||
### Design Quality (weight: 0.3)
|
||||
- 1-3: Generic, template-like, "AI slop" aesthetics
|
||||
- 4-6: Competent but unremarkable, follows conventions
|
||||
- 7-8: Distinctive, cohesive visual identity
|
||||
- 9-10: Could pass for a professional designer's work
|
||||
|
||||
### Originality (weight: 0.2)
|
||||
- 1-3: Default colors, stock layouts, no personality
|
||||
- 4-6: Some custom choices, mostly standard patterns
|
||||
- 7-8: Clear creative vision, unique approach
|
||||
- 9-10: Surprising, delightful, genuinely novel
|
||||
|
||||
### Craft (weight: 0.3)
|
||||
- 1-3: Broken layouts, missing states, no animations
|
||||
- 4-6: Works but feels rough, inconsistent spacing
|
||||
- 7-8: Polished, smooth transitions, responsive
|
||||
- 9-10: Pixel-perfect, delightful micro-interactions
|
||||
|
||||
### Functionality (weight: 0.2)
|
||||
- 1-3: Core features broken or missing
|
||||
- 4-6: Happy path works, edge cases fail
|
||||
- 7-8: All features work, good error handling
|
||||
- 9-10: Bulletproof, handles every edge case
|
||||
```
|
||||
|
||||
### Scoring
|
||||
|
||||
- **Weighted score** = sum of (criterion_score * weight)
|
||||
- **Pass threshold** = 7.0 (configurable)
|
||||
- **Max iterations** = 15 (configurable, typically 5-15 sufficient)
|
||||
|
||||
## Usage
|
||||
|
||||
### Via Command
|
||||
|
||||
```bash
|
||||
# Full three-agent harness
|
||||
/project:gan-build "Build a project management app with Kanban boards, team collaboration, and dark mode"
|
||||
|
||||
# With custom config
|
||||
/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5
|
||||
|
||||
# Frontend design mode (generator + evaluator only, no planner)
|
||||
/project:gan-design "Create a landing page for a crypto portfolio tracker"
|
||||
```
|
||||
|
||||
### Via Shell Script
|
||||
|
||||
```bash
|
||||
# Basic usage
|
||||
./scripts/gan-harness.sh "Build a music streaming dashboard"
|
||||
|
||||
# With options
|
||||
GAN_MAX_ITERATIONS=10 \
|
||||
GAN_PASS_THRESHOLD=7.5 \
|
||||
GAN_EVAL_CRITERIA="functionality,performance,security" \
|
||||
./scripts/gan-harness.sh "Build a REST API for task management"
|
||||
```
|
||||
|
||||
### Via Claude Code (Manual)
|
||||
|
||||
```bash
|
||||
# Step 1: Plan
|
||||
claude -p --model opus "You are a Product Planner. Read PLANNER_PROMPT.md. Expand this brief into a full product spec: 'Build a Kanban board app'. Write spec to spec.md"
|
||||
|
||||
# Step 2: Generate (iteration 1)
|
||||
claude -p --model opus "You are a Generator. Read spec.md. Implement Sprint 1. Start the dev server on port 3000."
|
||||
|
||||
# Step 3: Evaluate (iteration 1)
|
||||
claude -p --model opus --allowedTools "Read,Bash,mcp__playwright__*" "You are an Evaluator. Read EVALUATOR_PROMPT.md. Test the live app at http://localhost:3000. Score against the rubric. Write feedback to feedback-001.md"
|
||||
|
||||
# Step 4: Generate (iteration 2 — reads feedback)
|
||||
claude -p --model opus "You are a Generator. Read spec.md and feedback-001.md. Address all issues. Improve the scores."
|
||||
|
||||
# Repeat steps 3-4 until pass threshold met
|
||||
```
|
||||
|
||||
## Evolution Across Model Capabilities
|
||||
|
||||
The harness should simplify as models improve. Following Anthropic's evolution:
|
||||
|
||||
### Stage 1 — Weaker Models (Sonnet-class)
|
||||
- Full sprint decomposition required
|
||||
- Context resets between sprints (avoid context anxiety)
|
||||
- 2-agent minimum: Initializer + Coding Agent
|
||||
- Heavy scaffolding compensates for model limitations
|
||||
|
||||
### Stage 2 — Capable Models (Opus 4.5-class)
|
||||
- Full 3-agent harness: Planner + Generator + Evaluator
|
||||
- Sprint contracts before each implementation phase
|
||||
- 10-sprint decomposition for complex apps
|
||||
- Context resets still useful but less critical
|
||||
|
||||
### Stage 3 — Frontier Models (Opus 4.6-class)
|
||||
- Simplified harness: single planning pass, continuous generation
|
||||
- Evaluation reduced to single end-pass (model is smarter)
|
||||
- No sprint structure needed
|
||||
- Automatic compaction handles context growth
|
||||
|
||||
> **Key principle:** Every harness component encodes an assumption about what the model can't do alone. When models improve, re-test those assumptions. Strip away what's no longer needed.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `GAN_MAX_ITERATIONS` | `15` | Maximum generator-evaluator cycles |
|
||||
| `GAN_PASS_THRESHOLD` | `7.0` | Weighted score to pass (1-10) |
|
||||
| `GAN_PLANNER_MODEL` | `opus` | Model for planning agent |
|
||||
| `GAN_GENERATOR_MODEL` | `opus` | Model for generator agent |
|
||||
| `GAN_EVALUATOR_MODEL` | `opus` | Model for evaluator agent |
|
||||
| `GAN_EVAL_CRITERIA` | `design,originality,craft,functionality` | Comma-separated criteria |
|
||||
| `GAN_DEV_SERVER_PORT` | `3000` | Port for the live app |
|
||||
| `GAN_DEV_SERVER_CMD` | `npm run dev` | Command to start dev server |
|
||||
| `GAN_PROJECT_DIR` | `.` | Project working directory |
|
||||
| `GAN_SKIP_PLANNER` | `false` | Skip planner, use spec directly |
|
||||
| `GAN_EVAL_MODE` | `playwright` | `playwright`, `screenshot`, or `code-only` |
|
||||
|
||||
### Evaluation Modes
|
||||
|
||||
| Mode | Tools | Best For |
|
||||
|------|-------|----------|
|
||||
| `playwright` | Browser MCP + live interaction | Full-stack apps with UI |
|
||||
| `screenshot` | Screenshot + visual analysis | Static sites, design-only |
|
||||
| `code-only` | Tests + linting + build | APIs, libraries, CLI tools |
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
1. **Evaluator too lenient** — If the evaluator passes everything on iteration 1, your rubric is too generous. Tighten scoring criteria and add explicit penalties for common AI patterns.
|
||||
|
||||
2. **Generator ignoring feedback** — Ensure feedback is passed as a file, not inline. The generator should read `feedback-NNN.md` at the start of each iteration.
|
||||
|
||||
3. **Infinite loops** — Always set `GAN_MAX_ITERATIONS`. If the generator can't improve past a score plateau after 3 iterations, stop and flag for human review.
|
||||
|
||||
4. **Evaluator testing superficially** — The evaluator must use Playwright to **interact** with the live app, not just screenshot it. Click buttons, fill forms, test error states.
|
||||
|
||||
5. **Evaluator praising its own fixes** — Never let the evaluator suggest fixes and then evaluate those fixes. The evaluator only critiques; the generator fixes.
|
||||
|
||||
6. **Context exhaustion** — For long sessions, use Claude Agent SDK's automatic compaction or reset context between major phases.
|
||||
|
||||
## Results: What to Expect
|
||||
|
||||
Based on Anthropic's published results:
|
||||
|
||||
| Metric | Solo Agent | GAN Harness | Improvement |
|
||||
|--------|-----------|-------------|-------------|
|
||||
| Time | 20 min | 4-6 hours | 12-18x longer |
|
||||
| Cost | $9 | $125-200 | 14-22x more |
|
||||
| Quality | Barely functional | Production-ready | Phase change |
|
||||
| Core features | Broken | All working | N/A |
|
||||
| Design | Generic AI slop | Distinctive, polished | N/A |
|
||||
|
||||
**The tradeoff is clear:** ~20x more time and cost for a qualitative leap in output quality. This is for projects where quality matters.
|
||||
|
||||
## References
|
||||
|
||||
- [Anthropic: Harness Design for Long-Running Apps](https://www.anthropic.com/engineering/harness-design-long-running-apps) — Original paper by Prithvi Rajasekaran
|
||||
- [Epsilla: The GAN-Style Agent Loop](https://www.epsilla.com/blogs/anthropic-harness-engineering-multi-agent-gan-architecture) — Architecture deconstruction
|
||||
- [Martin Fowler: Harness Engineering](https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html) — Broader industry context
|
||||
- [OpenAI: Harness Engineering](https://openai.com/index/harness-engineering/) — OpenAI's parallel work
|
||||
Reference in New Issue
Block a user