feat: add GAN-style generator-evaluator harness (#1029)

Implements Anthropic's March 2026 harness design pattern — a multi-agent architecture that separates generation from evaluation, creating an adversarial feedback loop that produces production-quality applications. Components: - 3 agent definitions (planner, generator, evaluator) - 1 skill with full documentation (skills/gan-style-harness/) - 2 commands (gan-build for full apps, gan-design for frontend) - 1 shell orchestrator (scripts/gan-harness.sh) - Examples and configuration reference Based on: https://www.anthropic.com/engineering/harness-design-long-running-apps Co-authored-by: Hao Chen <haochen806@gmail.com>
2026-05-17 06:13:08 +08:00 · 2026-03-31 14:06:20 -07:00
parent 0c9b024746
commit 4cdfe709ab
8 changed files with 1276 additions and 0 deletions
--- a/commands/gan-build.md
+++ b/commands/gan-build.md
@@ -0,0 +1,99 @@
+Parse the following from $ARGUMENTS:
+1. `brief` — the user's one-line description of what to build
+2. `--max-iterations N` — (optional, default 15) maximum generator-evaluator cycles
+3. `--pass-threshold N` — (optional, default 7.0) weighted score to pass
+4. `--skip-planner` — (optional) skip planner, assume spec.md already exists
+5. `--eval-mode MODE` — (optional, default "playwright") one of: playwright, screenshot, code-only
+
+## GAN-Style Harness Build
+
+This command orchestrates a three-agent build loop inspired by Anthropic's March 2026 harness design paper.
+
+### Phase 0: Setup
+1. Create `gan-harness/` directory in project root
+2. Create subdirectories: `gan-harness/feedback/`, `gan-harness/screenshots/`
+3. Initialize git if not already initialized
+4. Log start time and configuration
+
+### Phase 1: Planning (Planner Agent)
+Unless `--skip-planner` is set:
+1. Launch the `gan-planner` agent via Task tool with the user's brief
+2. Wait for it to produce `gan-harness/spec.md` and `gan-harness/eval-rubric.md`
+3. Display the spec summary to the user
+4. Proceed to Phase 2
+
+### Phase 2: Generator-Evaluator Loop
+```
+iteration = 1
+while iteration <= max_iterations:
+
+    # GENERATE
+    Launch gan-generator agent via Task tool:
+    - Read spec.md
+    - If iteration > 1: read feedback/feedback-{iteration-1}.md
+    - Build/improve the application
+    - Ensure dev server is running
+    - Commit changes
+
+    # Wait for generator to finish
+
+    # EVALUATE
+    Launch gan-evaluator agent via Task tool:
+    - Read eval-rubric.md and spec.md
+    - Test the live application (mode: playwright/screenshot/code-only)
+    - Score against rubric
+    - Write feedback to feedback/feedback-{iteration}.md
+
+    # Wait for evaluator to finish
+
+    # CHECK SCORE
+    Read feedback/feedback-{iteration}.md
+    Extract weighted total score
+
+    if score >= pass_threshold:
+        Log "PASSED at iteration {iteration} with score {score}"
+        Break
+
+    if iteration >= 3 and score has not improved in last 2 iterations:
+        Log "PLATEAU detected — stopping early"
+        Break
+
+    iteration += 1
+```
+
+### Phase 3: Summary
+1. Read all feedback files
+2. Display final scores and iteration history
+3. Show score progression: `iteration 1: 4.2 → iteration 2: 5.8 → ... → iteration N: 7.5`
+4. List any remaining issues from the final evaluation
+5. Report total time and estimated cost
+
+### Output
+
+```markdown
+## GAN Harness Build Report
+
+**Brief:** [original prompt]
+**Result:** PASS/FAIL
+**Iterations:** N / max
+**Final Score:** X.X / 10
+
+### Score Progression
+| Iter | Design | Originality | Craft | Functionality | Total |
+|------|--------|-------------|-------|---------------|-------|
+| 1 | ... | ... | ... | ... | X.X |
+| 2 | ... | ... | ... | ... | X.X |
+| N | ... | ... | ... | ... | X.X |
+
+### Remaining Issues
+- [Any issues from final evaluation]
+
+### Files Created
+- gan-harness/spec.md
+- gan-harness/eval-rubric.md
+- gan-harness/feedback/feedback-001.md through feedback-NNN.md
+- gan-harness/generator-state.md
+- gan-harness/build-report.md
+```
+
+Write the full report to `gan-harness/build-report.md`.
--- a/commands/gan-design.md
+++ b/commands/gan-design.md
@@ -0,0 +1,35 @@
+Parse the following from $ARGUMENTS:
+1. `brief` — the user's description of the design to create
+2. `--max-iterations N` — (optional, default 10) maximum design-evaluate cycles
+3. `--pass-threshold N` — (optional, default 7.5) weighted score to pass (higher default for design)
+
+## GAN-Style Design Harness
+
+A two-agent loop (Generator + Evaluator) focused on frontend design quality. No planner — the brief IS the spec.
+
+This is the same mode Anthropic used for their frontend design experiments, where they saw creative breakthroughs like the 3D Dutch art museum with CSS perspective and doorway navigation.
+
+### Setup
+1. Create `gan-harness/` directory
+2. Write the brief directly as `gan-harness/spec.md`
+3. Write a design-focused `gan-harness/eval-rubric.md` with extra weight on Design Quality and Originality
+
+### Design-Specific Eval Rubric
+```markdown
+### Design Quality (weight: 0.35)
+### Originality (weight: 0.30)
+### Craft (weight: 0.25)
+### Functionality (weight: 0.10)
+```
+
+Note: Originality weight is higher (0.30 vs 0.20) to push for creative breakthroughs. Functionality weight is lower since design mode focuses on visual quality.
+
+### Loop
+Same as `/project:gan-build` Phase 2, but:
+- Skip the planner
+- Use the design-focused rubric
+- Generator prompt emphasizes visual quality over feature completeness
+- Evaluator prompt emphasizes "would this win a design award?" over "do all features work?"
+
+### Key Difference from gan-build
+The Generator is told: "Your PRIMARY goal is visual excellence. A stunning half-finished app beats a functional ugly one. Push for creative leaps — unusual layouts, custom animations, distinctive color work."