mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-04-01 22:53:27 +08:00
feat: add GAN-style generator-evaluator harness (#1029)
Implements Anthropic's March 2026 harness design pattern — a multi-agent architecture that separates generation from evaluation, creating an adversarial feedback loop that produces production-quality applications. Components: - 3 agent definitions (planner, generator, evaluator) - 1 skill with full documentation (skills/gan-style-harness/) - 2 commands (gan-build for full apps, gan-design for frontend) - 1 shell orchestrator (scripts/gan-harness.sh) - Examples and configuration reference Based on: https://www.anthropic.com/engineering/harness-design-long-running-apps Co-authored-by: Hao Chen <haochen806@gmail.com>
This commit is contained in:
99
commands/gan-build.md
Normal file
99
commands/gan-build.md
Normal file
@@ -0,0 +1,99 @@
|
||||
Parse the following from $ARGUMENTS:
|
||||
1. `brief` — the user's one-line description of what to build
|
||||
2. `--max-iterations N` — (optional, default 15) maximum generator-evaluator cycles
|
||||
3. `--pass-threshold N` — (optional, default 7.0) weighted score to pass
|
||||
4. `--skip-planner` — (optional) skip planner, assume spec.md already exists
|
||||
5. `--eval-mode MODE` — (optional, default "playwright") one of: playwright, screenshot, code-only
|
||||
|
||||
## GAN-Style Harness Build
|
||||
|
||||
This command orchestrates a three-agent build loop inspired by Anthropic's March 2026 harness design paper.
|
||||
|
||||
### Phase 0: Setup
|
||||
1. Create `gan-harness/` directory in project root
|
||||
2. Create subdirectories: `gan-harness/feedback/`, `gan-harness/screenshots/`
|
||||
3. Initialize git if not already initialized
|
||||
4. Log start time and configuration
|
||||
|
||||
### Phase 1: Planning (Planner Agent)
|
||||
Unless `--skip-planner` is set:
|
||||
1. Launch the `gan-planner` agent via Task tool with the user's brief
|
||||
2. Wait for it to produce `gan-harness/spec.md` and `gan-harness/eval-rubric.md`
|
||||
3. Display the spec summary to the user
|
||||
4. Proceed to Phase 2
|
||||
|
||||
### Phase 2: Generator-Evaluator Loop
|
||||
```
|
||||
iteration = 1
|
||||
while iteration <= max_iterations:
|
||||
|
||||
# GENERATE
|
||||
Launch gan-generator agent via Task tool:
|
||||
- Read spec.md
|
||||
- If iteration > 1: read feedback/feedback-{iteration-1}.md
|
||||
- Build/improve the application
|
||||
- Ensure dev server is running
|
||||
- Commit changes
|
||||
|
||||
# Wait for generator to finish
|
||||
|
||||
# EVALUATE
|
||||
Launch gan-evaluator agent via Task tool:
|
||||
- Read eval-rubric.md and spec.md
|
||||
- Test the live application (mode: playwright/screenshot/code-only)
|
||||
- Score against rubric
|
||||
- Write feedback to feedback/feedback-{iteration}.md
|
||||
|
||||
# Wait for evaluator to finish
|
||||
|
||||
# CHECK SCORE
|
||||
Read feedback/feedback-{iteration}.md
|
||||
Extract weighted total score
|
||||
|
||||
if score >= pass_threshold:
|
||||
Log "PASSED at iteration {iteration} with score {score}"
|
||||
Break
|
||||
|
||||
if iteration >= 3 and score has not improved in last 2 iterations:
|
||||
Log "PLATEAU detected — stopping early"
|
||||
Break
|
||||
|
||||
iteration += 1
|
||||
```
|
||||
|
||||
### Phase 3: Summary
|
||||
1. Read all feedback files
|
||||
2. Display final scores and iteration history
|
||||
3. Show score progression: `iteration 1: 4.2 → iteration 2: 5.8 → ... → iteration N: 7.5`
|
||||
4. List any remaining issues from the final evaluation
|
||||
5. Report total time and estimated cost
|
||||
|
||||
### Output
|
||||
|
||||
```markdown
|
||||
## GAN Harness Build Report
|
||||
|
||||
**Brief:** [original prompt]
|
||||
**Result:** PASS/FAIL
|
||||
**Iterations:** N / max
|
||||
**Final Score:** X.X / 10
|
||||
|
||||
### Score Progression
|
||||
| Iter | Design | Originality | Craft | Functionality | Total |
|
||||
|------|--------|-------------|-------|---------------|-------|
|
||||
| 1 | ... | ... | ... | ... | X.X |
|
||||
| 2 | ... | ... | ... | ... | X.X |
|
||||
| N | ... | ... | ... | ... | X.X |
|
||||
|
||||
### Remaining Issues
|
||||
- [Any issues from final evaluation]
|
||||
|
||||
### Files Created
|
||||
- gan-harness/spec.md
|
||||
- gan-harness/eval-rubric.md
|
||||
- gan-harness/feedback/feedback-001.md through feedback-NNN.md
|
||||
- gan-harness/generator-state.md
|
||||
- gan-harness/build-report.md
|
||||
```
|
||||
|
||||
Write the full report to `gan-harness/build-report.md`.
|
||||
35
commands/gan-design.md
Normal file
35
commands/gan-design.md
Normal file
@@ -0,0 +1,35 @@
|
||||
Parse the following from $ARGUMENTS:
|
||||
1. `brief` — the user's description of the design to create
|
||||
2. `--max-iterations N` — (optional, default 10) maximum design-evaluate cycles
|
||||
3. `--pass-threshold N` — (optional, default 7.5) weighted score to pass (higher default for design)
|
||||
|
||||
## GAN-Style Design Harness
|
||||
|
||||
A two-agent loop (Generator + Evaluator) focused on frontend design quality. No planner — the brief IS the spec.
|
||||
|
||||
This is the same mode Anthropic used for their frontend design experiments, where they saw creative breakthroughs like the 3D Dutch art museum with CSS perspective and doorway navigation.
|
||||
|
||||
### Setup
|
||||
1. Create `gan-harness/` directory
|
||||
2. Write the brief directly as `gan-harness/spec.md`
|
||||
3. Write a design-focused `gan-harness/eval-rubric.md` with extra weight on Design Quality and Originality
|
||||
|
||||
### Design-Specific Eval Rubric
|
||||
```markdown
|
||||
### Design Quality (weight: 0.35)
|
||||
### Originality (weight: 0.30)
|
||||
### Craft (weight: 0.25)
|
||||
### Functionality (weight: 0.10)
|
||||
```
|
||||
|
||||
Note: Originality weight is higher (0.30 vs 0.20) to push for creative breakthroughs. Functionality weight is lower since design mode focuses on visual quality.
|
||||
|
||||
### Loop
|
||||
Same as `/project:gan-build` Phase 2, but:
|
||||
- Skip the planner
|
||||
- Use the design-focused rubric
|
||||
- Generator prompt emphasizes visual quality over feature completeness
|
||||
- Evaluator prompt emphasizes "would this win a design award?" over "do all features work?"
|
||||
|
||||
### Key Difference from gan-build
|
||||
The Generator is told: "Your PRIMARY goal is visual excellence. A stunning half-finished app beats a functional ugly one. Push for creative leaps — unusual layouts, custom animations, distinctive color work."
|
||||
Reference in New Issue
Block a user