feat: add 6 gap-closing skills — browser QA, design system, product lens, canary watch, benchmark, safety guard

Closes competitive gaps with gstack:
- browser-qa: automated visual testing via browser MCP
- design-system: generate, audit, and detect AI slop in UI
- product-lens: product diagnostic, founder review, feature prioritization
- canary-watch: post-deploy monitoring with alert thresholds
- benchmark: performance baseline and regression detection
- safety-guard: prevent destructive operations in autonomous sessions
This commit is contained in:
Affaan Mustafa
2026-03-23 04:31:10 -07:00
parent bacc585b87
commit df4f2df297
6 changed files with 485 additions and 0 deletions

87
skills/benchmark/SKILL.md Normal file
View File

@@ -0,0 +1,87 @@
# Benchmark — Performance Baseline & Regression Detection
## When to Use
- Before and after a PR to measure performance impact
- Setting up performance baselines for a project
- When users report "it feels slow"
- Before a launch — ensure you meet performance targets
- Comparing your stack against alternatives
## How It Works
### Mode 1: Page Performance
Measures real browser metrics via browser MCP:
```
1. Navigate to each target URL
2. Measure Core Web Vitals:
- LCP (Largest Contentful Paint) — target < 2.5s
- CLS (Cumulative Layout Shift) — target < 0.1
- INP (Interaction to Next Paint) — target < 200ms
- FCP (First Contentful Paint) — target < 1.8s
- TTFB (Time to First Byte) — target < 800ms
3. Measure resource sizes:
- Total page weight (target < 1MB)
- JS bundle size (target < 200KB gzipped)
- CSS size
- Image weight
- Third-party script weight
4. Count network requests
5. Check for render-blocking resources
```
### Mode 2: API Performance
Benchmarks API endpoints:
```
1. Hit each endpoint 100 times
2. Measure: p50, p95, p99 latency
3. Track: response size, status codes
4. Test under load: 10 concurrent requests
5. Compare against SLA targets
```
### Mode 3: Build Performance
Measures development feedback loop:
```
1. Cold build time
2. Hot reload time (HMR)
3. Test suite duration
4. TypeScript check time
5. Lint time
6. Docker build time
```
### Mode 4: Before/After Comparison
Run before and after a change to measure impact:
```
/benchmark baseline # saves current metrics
# ... make changes ...
/benchmark compare # compares against baseline
```
Output:
```
| Metric | Before | After | Delta | Verdict |
|--------|--------|-------|-------|---------|
| LCP | 1.2s | 1.4s | +200ms | ⚠ WARN |
| Bundle | 180KB | 175KB | -5KB | ✓ BETTER |
| Build | 12s | 14s | +2s | ⚠ WARN |
```
## Output
Stores baselines in `.ecc/benchmarks/` as JSON. Git-tracked so the team shares baselines.
## Integration
- CI: run `/benchmark compare` on every PR
- Pair with `/canary-watch` for post-deploy monitoring
- Pair with `/browser-qa` for full pre-ship checklist

View File

@@ -0,0 +1,81 @@
# Browser QA — Automated Visual Testing & Interaction
## When to Use
- After deploying a feature to staging/preview
- When you need to verify UI behavior across pages
- Before shipping — confirm layouts, forms, interactions actually work
- When reviewing PRs that touch frontend code
- Accessibility audits and responsive testing
## How It Works
Uses the browser automation MCP (claude-in-chrome, Playwright, or Puppeteer) to interact with live pages like a real user.
### Phase 1: Smoke Test
```
1. Navigate to target URL
2. Check for console errors (filter noise: analytics, third-party)
3. Verify no 4xx/5xx in network requests
4. Screenshot above-the-fold on desktop + mobile viewport
5. Check Core Web Vitals: LCP < 2.5s, CLS < 0.1, INP < 200ms
```
### Phase 2: Interaction Test
```
1. Click every nav link — verify no dead links
2. Submit forms with valid data — verify success state
3. Submit forms with invalid data — verify error state
4. Test auth flow: login → protected page → logout
5. Test critical user journeys (checkout, onboarding, search)
```
### Phase 3: Visual Regression
```
1. Screenshot key pages at 3 breakpoints (375px, 768px, 1440px)
2. Compare against baseline screenshots (if stored)
3. Flag layout shifts > 5px, missing elements, overflow
4. Check dark mode if applicable
```
### Phase 4: Accessibility
```
1. Run axe-core or equivalent on each page
2. Flag WCAG AA violations (contrast, labels, focus order)
3. Verify keyboard navigation works end-to-end
4. Check screen reader landmarks
```
## Output Format
```markdown
## QA Report — [URL] — [timestamp]
### Smoke Test
- Console errors: 0 critical, 2 warnings (analytics noise)
- Network: all 200/304, no failures
- Core Web Vitals: LCP 1.2s ✓, CLS 0.02 ✓, INP 89ms ✓
### Interactions
- [✓] Nav links: 12/12 working
- [✗] Contact form: missing error state for invalid email
- [✓] Auth flow: login/logout working
### Visual
- [✗] Hero section overflows on 375px viewport
- [✓] Dark mode: all pages consistent
### Accessibility
- 2 AA violations: missing alt text on hero image, low contrast on footer links
### Verdict: SHIP WITH FIXES (2 issues, 0 blockers)
```
## Integration
Works with any browser MCP:
- `mChild__claude-in-chrome__*` tools (preferred — uses your actual Chrome)
- Playwright via `mcp__browserbase__*`
- Direct Puppeteer scripts
Pair with `/canary-watch` for post-deploy monitoring.

View File

@@ -0,0 +1,93 @@
# Canary Watch — Post-Deploy Monitoring
## When to Use
- After deploying to production or staging
- After merging a risky PR
- When you want to verify a fix actually fixed it
- Continuous monitoring during a launch window
- After dependency upgrades
## How It Works
Monitors a deployed URL for regressions. Runs in a loop until stopped or until the watch window expires.
### What It Watches
```
1. HTTP Status — is the page returning 200?
2. Console Errors — new errors that weren't there before?
3. Network Failures — failed API calls, 5xx responses?
4. Performance — LCP/CLS/INP regression vs baseline?
5. Content — did key elements disappear? (h1, nav, footer, CTA)
6. API Health — are critical endpoints responding within SLA?
```
### Watch Modes
**Quick check** (default): single pass, report results
```
/canary-watch https://myapp.com
```
**Sustained watch**: check every N minutes for M hours
```
/canary-watch https://myapp.com --interval 5m --duration 2h
```
**Diff mode**: compare staging vs production
```
/canary-watch --compare https://staging.myapp.com https://myapp.com
```
### Alert Thresholds
```yaml
critical: # immediate alert
- HTTP status != 200
- Console error count > 5 (new errors only)
- LCP > 4s
- API endpoint returns 5xx
warning: # flag in report
- LCP increased > 500ms from baseline
- CLS > 0.1
- New console warnings
- Response time > 2x baseline
info: # log only
- Minor performance variance
- New network requests (third-party scripts added?)
```
### Notifications
When a critical threshold is crossed:
- Desktop notification (macOS/Linux)
- Optional: Slack/Discord webhook
- Log to `~/.claude/canary-watch.log`
## Output
```markdown
## Canary Report — myapp.com — 2026-03-23 03:15 PST
### Status: HEALTHY ✓
| Check | Result | Baseline | Delta |
|-------|--------|----------|-------|
| HTTP | 200 ✓ | 200 | — |
| Console errors | 0 ✓ | 0 | — |
| LCP | 1.8s ✓ | 1.6s | +200ms |
| CLS | 0.01 ✓ | 0.01 | — |
| API /health | 145ms ✓ | 120ms | +25ms |
### No regressions detected. Deploy is clean.
```
## Integration
Pair with:
- `/browser-qa` for pre-deploy verification
- Hooks: add as a PostToolUse hook on `git push` to auto-check after deploys
- CI: run in GitHub Actions after deploy step

View File

@@ -0,0 +1,76 @@
# Design System — Generate & Audit Visual Systems
## When to Use
- Starting a new project that needs a design system
- Auditing an existing codebase for visual consistency
- Before a redesign — understand what you have
- When the UI looks "off" but you can't pinpoint why
- Reviewing PRs that touch styling
## How It Works
### Mode 1: Generate Design System
Analyzes your codebase and generates a cohesive design system:
```
1. Scan CSS/Tailwind/styled-components for existing patterns
2. Extract: colors, typography, spacing, border-radius, shadows, breakpoints
3. Research 3 competitor sites for inspiration (via browser MCP)
4. Propose a design token set (JSON + CSS custom properties)
5. Generate DESIGN.md with rationale for each decision
6. Create an interactive HTML preview page (self-contained, no deps)
```
Output: `DESIGN.md` + `design-tokens.json` + `design-preview.html`
### Mode 2: Visual Audit
Scores your UI across 10 dimensions (0-10 each):
```
1. Color consistency — are you using your palette or random hex values?
2. Typography hierarchy — clear h1 > h2 > h3 > body > caption?
3. Spacing rhythm — consistent scale (4px/8px/16px) or arbitrary?
4. Component consistency — do similar elements look similar?
5. Responsive behavior — fluid or broken at breakpoints?
6. Dark mode — complete or half-done?
7. Animation — purposeful or gratuitous?
8. Accessibility — contrast ratios, focus states, touch targets
9. Information density — cluttered or clean?
10. Polish — hover states, transitions, loading states, empty states
```
Each dimension gets a score, specific examples, and a fix with exact file:line.
### Mode 3: AI Slop Detection
Identifies generic AI-generated design patterns:
```
- Gratuitous gradients on everything
- Purple-to-blue defaults
- "Glass morphism" cards with no purpose
- Rounded corners on things that shouldn't be rounded
- Excessive animations on scroll
- Generic hero with centered text over stock gradient
- Sans-serif font stack with no personality
```
## Examples
**Generate for a SaaS app:**
```
/design-system generate --style minimal --palette earth-tones
```
**Audit existing UI:**
```
/design-system audit --url http://localhost:3000 --pages / /pricing /docs
```
**Check for AI slop:**
```
/design-system slop-check
```

View File

@@ -0,0 +1,79 @@
# Product Lens — Think Before You Build
## When to Use
- Before starting any feature — validate the "why"
- Weekly product review — are we building the right thing?
- When stuck choosing between features
- Before a launch — sanity check the user journey
- When converting a vague idea into a spec
## How It Works
### Mode 1: Product Diagnostic
Like YC office hours but automated. Asks the hard questions:
```
1. Who is this for? (specific person, not "developers")
2. What's the pain? (quantify: how often, how bad, what do they do today?)
3. Why now? (what changed that makes this possible/necessary?)
4. What's the 10-star version? (if money/time were unlimited)
5. What's the MVP? (smallest thing that proves the thesis)
6. What's the anti-goal? (what are you explicitly NOT building?)
7. How do you know it's working? (metric, not vibes)
```
Output: a `PRODUCT-BRIEF.md` with answers, risks, and a go/no-go recommendation.
### Mode 2: Founder Review
Reviews your current project through a founder lens:
```
1. Read README, CLAUDE.md, package.json, recent commits
2. Infer: what is this trying to be?
3. Score: product-market fit signals (0-10)
- Usage growth trajectory
- Retention indicators (repeat contributors, return users)
- Revenue signals (pricing page, billing code, Stripe integration)
- Competitive moat (what's hard to copy?)
4. Identify: the one thing that would 10x this
5. Flag: things you're building that don't matter
```
### Mode 3: User Journey Audit
Maps the actual user experience:
```
1. Clone/install the product as a new user
2. Document every friction point (confusing steps, errors, missing docs)
3. Time each step
4. Compare to competitor onboarding
5. Score: time-to-value (how long until the user gets their first win?)
6. Recommend: top 3 fixes for onboarding
```
### Mode 4: Feature Prioritization
When you have 10 ideas and need to pick 2:
```
1. List all candidate features
2. Score each on: impact (1-5) × confidence (1-5) ÷ effort (1-5)
3. Rank by ICE score
4. Apply constraints: runway, team size, dependencies
5. Output: prioritized roadmap with rationale
```
## Output
All modes output actionable docs, not essays. Every recommendation has a specific next step.
## Integration
Pair with:
- `/browser-qa` to verify the user journey audit findings
- `/design-system audit` for visual polish assessment
- `/canary-watch` for post-launch monitoring

View File

@@ -0,0 +1,69 @@
# Safety Guard — Prevent Destructive Operations
## When to Use
- When working on production systems
- When agents are running autonomously (full-auto mode)
- When you want to restrict edits to a specific directory
- During sensitive operations (migrations, deploys, data changes)
## How It Works
Three modes of protection:
### Mode 1: Careful Mode
Intercepts destructive commands before execution and warns:
```
Watched patterns:
- rm -rf (especially /, ~, or project root)
- git push --force
- git reset --hard
- git checkout . (discard all changes)
- DROP TABLE / DROP DATABASE
- docker system prune
- kubectl delete
- chmod 777
- sudo rm
- npm publish (accidental publishes)
- Any command with --no-verify
```
When detected: shows what the command does, asks for confirmation, suggests safer alternative.
### Mode 2: Freeze Mode
Locks file edits to a specific directory tree:
```
/safety-guard freeze src/components/
```
Any Write/Edit outside `src/components/` is blocked with an explanation. Useful when you want an agent to focus on one area without touching unrelated code.
### Mode 3: Guard Mode (Careful + Freeze combined)
Both protections active. Maximum safety for autonomous agents.
```
/safety-guard guard --dir src/api/ --allow-read-all
```
Agents can read anything but only write to `src/api/`. Destructive commands are blocked everywhere.
### Unlock
```
/safety-guard off
```
## Implementation
Uses PreToolUse hooks to intercept Bash, Write, Edit, and MultiEdit tool calls. Checks the command/path against the active rules before allowing execution.
## Integration
- Enable by default for `codex -a never` sessions
- Pair with observability risk scoring in ECC 2.0
- Logs all blocked actions to `~/.claude/safety-guard.log`