feat: add GAN-style generator-evaluator harness (#1029)

Implements Anthropic's March 2026 harness design pattern — a multi-agent
architecture that separates generation from evaluation, creating an
adversarial feedback loop that produces production-quality applications.

Components:
- 3 agent definitions (planner, generator, evaluator)
- 1 skill with full documentation (skills/gan-style-harness/)
- 2 commands (gan-build for full apps, gan-design for frontend)
- 1 shell orchestrator (scripts/gan-harness.sh)
- Examples and configuration reference

Based on: https://www.anthropic.com/engineering/harness-design-long-running-apps

Co-authored-by: Hao Chen <haochen806@gmail.com>
This commit is contained in:
haochen806
2026-03-31 14:06:20 -07:00
committed by GitHub
parent 0c9b024746
commit 4cdfe709ab
8 changed files with 1276 additions and 0 deletions

99
commands/gan-build.md Normal file
View File

@@ -0,0 +1,99 @@
Parse the following from $ARGUMENTS:
1. `brief` — the user's one-line description of what to build
2. `--max-iterations N` — (optional, default 15) maximum generator-evaluator cycles
3. `--pass-threshold N` — (optional, default 7.0) weighted score to pass
4. `--skip-planner` — (optional) skip planner, assume spec.md already exists
5. `--eval-mode MODE` — (optional, default "playwright") one of: playwright, screenshot, code-only
## GAN-Style Harness Build
This command orchestrates a three-agent build loop inspired by Anthropic's March 2026 harness design paper.
### Phase 0: Setup
1. Create `gan-harness/` directory in project root
2. Create subdirectories: `gan-harness/feedback/`, `gan-harness/screenshots/`
3. Initialize git if not already initialized
4. Log start time and configuration
### Phase 1: Planning (Planner Agent)
Unless `--skip-planner` is set:
1. Launch the `gan-planner` agent via Task tool with the user's brief
2. Wait for it to produce `gan-harness/spec.md` and `gan-harness/eval-rubric.md`
3. Display the spec summary to the user
4. Proceed to Phase 2
### Phase 2: Generator-Evaluator Loop
```
iteration = 1
while iteration <= max_iterations:
# GENERATE
Launch gan-generator agent via Task tool:
- Read spec.md
- If iteration > 1: read feedback/feedback-{iteration-1}.md
- Build/improve the application
- Ensure dev server is running
- Commit changes
# Wait for generator to finish
# EVALUATE
Launch gan-evaluator agent via Task tool:
- Read eval-rubric.md and spec.md
- Test the live application (mode: playwright/screenshot/code-only)
- Score against rubric
- Write feedback to feedback/feedback-{iteration}.md
# Wait for evaluator to finish
# CHECK SCORE
Read feedback/feedback-{iteration}.md
Extract weighted total score
if score >= pass_threshold:
Log "PASSED at iteration {iteration} with score {score}"
Break
if iteration >= 3 and score has not improved in last 2 iterations:
Log "PLATEAU detected — stopping early"
Break
iteration += 1
```
### Phase 3: Summary
1. Read all feedback files
2. Display final scores and iteration history
3. Show score progression: `iteration 1: 4.2 → iteration 2: 5.8 → ... → iteration N: 7.5`
4. List any remaining issues from the final evaluation
5. Report total time and estimated cost
### Output
```markdown
## GAN Harness Build Report
**Brief:** [original prompt]
**Result:** PASS/FAIL
**Iterations:** N / max
**Final Score:** X.X / 10
### Score Progression
| Iter | Design | Originality | Craft | Functionality | Total |
|------|--------|-------------|-------|---------------|-------|
| 1 | ... | ... | ... | ... | X.X |
| 2 | ... | ... | ... | ... | X.X |
| N | ... | ... | ... | ... | X.X |
### Remaining Issues
- [Any issues from final evaluation]
### Files Created
- gan-harness/spec.md
- gan-harness/eval-rubric.md
- gan-harness/feedback/feedback-001.md through feedback-NNN.md
- gan-harness/generator-state.md
- gan-harness/build-report.md
```
Write the full report to `gan-harness/build-report.md`.

35
commands/gan-design.md Normal file
View File

@@ -0,0 +1,35 @@
Parse the following from $ARGUMENTS:
1. `brief` — the user's description of the design to create
2. `--max-iterations N` — (optional, default 10) maximum design-evaluate cycles
3. `--pass-threshold N` — (optional, default 7.5) weighted score to pass (higher default for design)
## GAN-Style Design Harness
A two-agent loop (Generator + Evaluator) focused on frontend design quality. No planner — the brief IS the spec.
This is the same mode Anthropic used for their frontend design experiments, where they saw creative breakthroughs like the 3D Dutch art museum with CSS perspective and doorway navigation.
### Setup
1. Create `gan-harness/` directory
2. Write the brief directly as `gan-harness/spec.md`
3. Write a design-focused `gan-harness/eval-rubric.md` with extra weight on Design Quality and Originality
### Design-Specific Eval Rubric
```markdown
### Design Quality (weight: 0.35)
### Originality (weight: 0.30)
### Craft (weight: 0.25)
### Functionality (weight: 0.10)
```
Note: Originality weight is higher (0.30 vs 0.20) to push for creative breakthroughs. Functionality weight is lower since design mode focuses on visual quality.
### Loop
Same as `/project:gan-build` Phase 2, but:
- Skip the planner
- Use the design-focused rubric
- Generator prompt emphasizes visual quality over feature completeness
- Evaluator prompt emphasizes "would this win a design award?" over "do all features work?"
### Key Difference from gan-build
The Generator is told: "Your PRIMARY goal is visual excellence. A stunning half-finished app beats a functional ugly one. Push for creative leaps — unusual layouts, custom animations, distinctive color work."