feat: deliver v1.8.0 harness reliability and parity updates

This commit is contained in:
Affaan Mustafa
2026-03-04 14:48:06 -08:00
parent 32e9c293f0
commit 48b883d741
84 changed files with 2990 additions and 725 deletions

View File

@@ -0,0 +1,73 @@
---
name: agent-harness-construction
description: Design and optimize AI agent action spaces, tool definitions, and observation formatting for higher completion rates.
origin: ECC
---
# Agent Harness Construction
Use this skill when you are improving how an agent plans, calls tools, recovers from errors, and converges on completion.
## Core Model
Agent output quality is constrained by:
1. Action space quality
2. Observation quality
3. Recovery quality
4. Context budget quality
## Action Space Design
1. Use stable, explicit tool names.
2. Keep inputs schema-first and narrow.
3. Return deterministic output shapes.
4. Avoid catch-all tools unless isolation is impossible.
## Granularity Rules
- Use micro-tools for high-risk operations (deploy, migration, permissions).
- Use medium tools for common edit/read/search loops.
- Use macro-tools only when round-trip overhead is the dominant cost.
## Observation Design
Every tool response should include:
- `status`: success|warning|error
- `summary`: one-line result
- `next_actions`: actionable follow-ups
- `artifacts`: file paths / IDs
## Error Recovery Contract
For every error path, include:
- root cause hint
- safe retry instruction
- explicit stop condition
## Context Budgeting
1. Keep system prompt minimal and invariant.
2. Move large guidance into skills loaded on demand.
3. Prefer references to files over inlining long documents.
4. Compact at phase boundaries, not arbitrary token thresholds.
## Architecture Pattern Guidance
- ReAct: best for exploratory tasks with uncertain path.
- Function-calling: best for structured deterministic flows.
- Hybrid (recommended): ReAct planning + typed tool execution.
## Benchmarking
Track:
- completion rate
- retries per task
- pass@1 and pass@3
- cost per successful task
## Anti-Patterns
- Too many tools with overlapping semantics.
- Opaque tool output with no recovery hints.
- Error-only output without next steps.
- Context overloading with irrelevant references.

View File

@@ -0,0 +1,63 @@
---
name: agentic-engineering
description: Operate as an agentic engineer using eval-first execution, decomposition, and cost-aware model routing.
origin: ECC
---
# Agentic Engineering
Use this skill for engineering workflows where AI agents perform most implementation work and humans enforce quality and risk controls.
## Operating Principles
1. Define completion criteria before execution.
2. Decompose work into agent-sized units.
3. Route model tiers by task complexity.
4. Measure with evals and regression checks.
## Eval-First Loop
1. Define capability eval and regression eval.
2. Run baseline and capture failure signatures.
3. Execute implementation.
4. Re-run evals and compare deltas.
## Task Decomposition
Apply the 15-minute unit rule:
- each unit should be independently verifiable
- each unit should have a single dominant risk
- each unit should expose a clear done condition
## Model Routing
- Haiku: classification, boilerplate transforms, narrow edits
- Sonnet: implementation and refactors
- Opus: architecture, root-cause analysis, multi-file invariants
## Session Strategy
- Continue session for closely-coupled units.
- Start fresh session after major phase transitions.
- Compact after milestone completion, not during active debugging.
## Review Focus for AI-Generated Code
Prioritize:
- invariants and edge cases
- error boundaries
- security and auth assumptions
- hidden coupling and rollout risk
Do not waste review cycles on style-only disagreements when automated format/lint already enforce style.
## Cost Discipline
Track per task:
- model
- token estimate
- retries
- wall-clock time
- success/failure
Escalate model tier only when lower tier fails with a clear reasoning gap.

View File

@@ -0,0 +1,51 @@
---
name: ai-first-engineering
description: Engineering operating model for teams where AI agents generate a large share of implementation output.
origin: ECC
---
# AI-First Engineering
Use this skill when designing process, reviews, and architecture for teams shipping with AI-assisted code generation.
## Process Shifts
1. Planning quality matters more than typing speed.
2. Eval coverage matters more than anecdotal confidence.
3. Review focus shifts from syntax to system behavior.
## Architecture Requirements
Prefer architectures that are agent-friendly:
- explicit boundaries
- stable contracts
- typed interfaces
- deterministic tests
Avoid implicit behavior spread across hidden conventions.
## Code Review in AI-First Teams
Review for:
- behavior regressions
- security assumptions
- data integrity
- failure handling
- rollout safety
Minimize time spent on style issues already covered by automation.
## Hiring and Evaluation Signals
Strong AI-first engineers:
- decompose ambiguous work cleanly
- define measurable acceptance criteria
- produce high-signal prompts and evals
- enforce risk controls under delivery pressure
## Testing Standard
Raise testing bar for generated code:
- required regression coverage for touched domains
- explicit edge-case assertions
- integration checks for interface boundaries

View File

@@ -6,6 +6,11 @@ origin: ECC
# Autonomous Loops Skill
> Compatibility note (v1.8.0): `autonomous-loops` is retained for one release.
> The canonical skill name is now `continuous-agent-loop`. New loop guidance
> should be authored there, while this skill remains available to avoid
> breaking existing workflows.
Patterns, architectures, and reference implementations for running Claude Code autonomously in loops. Covers everything from simple `claude -p` pipelines to full RFC-driven multi-agent DAG orchestration.
## When to Use

View File

@@ -0,0 +1,45 @@
---
name: continuous-agent-loop
description: Patterns for continuous autonomous agent loops with quality gates, evals, and recovery controls.
origin: ECC
---
# Continuous Agent Loop
This is the v1.8+ canonical loop skill name. It supersedes `autonomous-loops` while keeping compatibility for one release.
## Loop Selection Flow
```text
Start
|
+-- Need strict CI/PR control? -- yes --> continuous-pr
|
+-- Need RFC decomposition? -- yes --> rfc-dag
|
+-- Need exploratory parallel generation? -- yes --> infinite
|
+-- default --> sequential
```
## Combined Pattern
Recommended production stack:
1. RFC decomposition (`ralphinho-rfc-pipeline`)
2. quality gates (`plankton-code-quality` + `/quality-gate`)
3. eval loop (`eval-harness`)
4. session persistence (`nanoclaw-repl`)
## Failure Modes
- loop churn without measurable progress
- repeated retries with same root cause
- merge queue stalls
- cost drift from unbounded escalation
## Recovery
- freeze loop
- run `/harness-audit`
- reduce scope to failing unit
- replay with explicit acceptance criteria

View File

@@ -0,0 +1,133 @@
#!/usr/bin/env bash
# Continuous Learning v2 - Observer background loop
set +e
unset CLAUDECODE
SLEEP_PID=""
USR1_FIRED=0
cleanup() {
[ -n "$SLEEP_PID" ] && kill "$SLEEP_PID" 2>/dev/null
if [ -f "$PID_FILE" ] && [ "$(cat "$PID_FILE" 2>/dev/null)" = "$$" ]; then
rm -f "$PID_FILE"
fi
exit 0
}
trap cleanup TERM INT
analyze_observations() {
if [ ! -f "$OBSERVATIONS_FILE" ]; then
return
fi
obs_count=$(wc -l < "$OBSERVATIONS_FILE" 2>/dev/null || echo 0)
if [ "$obs_count" -lt "$MIN_OBSERVATIONS" ]; then
return
fi
echo "[$(date)] Analyzing $obs_count observations for project ${PROJECT_NAME}..." >> "$LOG_FILE"
if [ "${CLV2_IS_WINDOWS:-false}" = "true" ] && [ "${ECC_OBSERVER_ALLOW_WINDOWS:-false}" != "true" ]; then
echo "[$(date)] Skipping claude analysis on Windows due to known non-interactive hang issue (#295). Set ECC_OBSERVER_ALLOW_WINDOWS=true to override." >> "$LOG_FILE"
return
fi
if ! command -v claude >/dev/null 2>&1; then
echo "[$(date)] claude CLI not found, skipping analysis" >> "$LOG_FILE"
return
fi
prompt_file="$(mktemp "${TMPDIR:-/tmp}/ecc-observer-prompt.XXXXXX")"
cat > "$prompt_file" <<PROMPT
Read ${OBSERVATIONS_FILE} and identify patterns for the project ${PROJECT_NAME} (user corrections, error resolutions, repeated workflows, tool preferences).
If you find 3+ occurrences of the same pattern, create an instinct file in ${INSTINCTS_DIR}/<id>.md.
CRITICAL: Every instinct file MUST use this exact format:
---
id: kebab-case-name
trigger: when <specific condition>
confidence: <0.3-0.85 based on frequency: 3-5 times=0.5, 6-10=0.7, 11+=0.85>
domain: <one of: code-style, testing, git, debugging, workflow, file-patterns>
source: session-observation
scope: project
project_id: ${PROJECT_ID}
project_name: ${PROJECT_NAME}
---
# Title
## Action
<what to do, one clear sentence>
## Evidence
- Observed N times in session <id>
- Pattern: <description>
- Last observed: <date>
Rules:
- Be conservative, only clear patterns with 3+ observations
- Use narrow, specific triggers
- Never include actual code snippets, only describe patterns
- If a similar instinct already exists in ${INSTINCTS_DIR}/, update it instead of creating a duplicate
- The YAML frontmatter (between --- markers) with id field is MANDATORY
- If a pattern seems universal (not project-specific), set scope to global instead of project
- Examples of global patterns: always validate user input, prefer explicit error handling
- Examples of project patterns: use React functional components, follow Django REST framework conventions
PROMPT
timeout_seconds="${ECC_OBSERVER_TIMEOUT_SECONDS:-120}"
exit_code=0
claude --model haiku --max-turns 3 --print < "$prompt_file" >> "$LOG_FILE" 2>&1 &
claude_pid=$!
(
sleep "$timeout_seconds"
if kill -0 "$claude_pid" 2>/dev/null; then
echo "[$(date)] Claude analysis timed out after ${timeout_seconds}s; terminating process" >> "$LOG_FILE"
kill "$claude_pid" 2>/dev/null || true
fi
) &
watchdog_pid=$!
wait "$claude_pid"
exit_code=$?
kill "$watchdog_pid" 2>/dev/null || true
rm -f "$prompt_file"
if [ "$exit_code" -ne 0 ]; then
echo "[$(date)] Claude analysis failed (exit $exit_code)" >> "$LOG_FILE"
fi
if [ -f "$OBSERVATIONS_FILE" ]; then
archive_dir="${PROJECT_DIR}/observations.archive"
mkdir -p "$archive_dir"
mv "$OBSERVATIONS_FILE" "$archive_dir/processed-$(date +%Y%m%d-%H%M%S)-$$.jsonl" 2>/dev/null || true
fi
}
on_usr1() {
[ -n "$SLEEP_PID" ] && kill "$SLEEP_PID" 2>/dev/null
SLEEP_PID=""
USR1_FIRED=1
analyze_observations
}
trap on_usr1 USR1
echo "$$" > "$PID_FILE"
echo "[$(date)] Observer started for ${PROJECT_NAME} (PID: $$)" >> "$LOG_FILE"
while true; do
sleep "$OBSERVER_INTERVAL_SECONDS" &
SLEEP_PID=$!
wait "$SLEEP_PID" 2>/dev/null
SLEEP_PID=""
if [ "$USR1_FIRED" -eq 1 ]; then
USR1_FIRED=0
else
analyze_observations
fi
done

View File

@@ -23,6 +23,7 @@ set -e
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
SKILL_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
OBSERVER_LOOP_SCRIPT="${SCRIPT_DIR}/observer-loop.sh"
# Source shared project detection helper
# This sets: PROJECT_ID, PROJECT_NAME, PROJECT_ROOT, PROJECT_DIR
@@ -74,6 +75,13 @@ OBSERVER_INTERVAL_SECONDS=$((OBSERVER_INTERVAL_MINUTES * 60))
echo "Project: ${PROJECT_NAME} (${PROJECT_ID})"
echo "Storage: ${PROJECT_DIR}"
# Windows/Git-Bash detection (Issue #295)
UNAME_LOWER="$(uname -s 2>/dev/null | tr '[:upper:]' '[:lower:]')"
IS_WINDOWS=false
case "$UNAME_LOWER" in
*mingw*|*msys*|*cygwin*) IS_WINDOWS=true ;;
esac
case "${1:-start}" in
stop)
if [ -f "$PID_FILE" ]; then
@@ -135,8 +143,13 @@ case "${1:-start}" in
echo "Starting observer agent for ${PROJECT_NAME}..."
if [ ! -x "$OBSERVER_LOOP_SCRIPT" ]; then
echo "Observer loop script not found or not executable: $OBSERVER_LOOP_SCRIPT"
exit 1
fi
# The observer loop — fully detached with nohup, IO redirected to log.
# Variables passed safely via env to avoid shell injection from special chars in paths.
# Variables are passed via env; observer-loop.sh handles analysis/retry flow.
nohup env \
CONFIG_DIR="$CONFIG_DIR" \
PID_FILE="$PID_FILE" \
@@ -148,116 +161,8 @@ case "${1:-start}" in
PROJECT_ID="$PROJECT_ID" \
MIN_OBSERVATIONS="$MIN_OBSERVATIONS" \
OBSERVER_INTERVAL_SECONDS="$OBSERVER_INTERVAL_SECONDS" \
/bin/bash -c '
set +e
unset CLAUDECODE
SLEEP_PID=""
USR1_FIRED=0
cleanup() {
[ -n "$SLEEP_PID" ] && kill "$SLEEP_PID" 2>/dev/null
# Only remove PID file if it still belongs to this process
if [ -f "$PID_FILE" ] && [ "$(cat "$PID_FILE" 2>/dev/null)" = "$$" ]; then
rm -f "$PID_FILE"
fi
exit 0
}
trap cleanup TERM INT
analyze_observations() {
if [ ! -f "$OBSERVATIONS_FILE" ]; then
return
fi
obs_count=$(wc -l < "$OBSERVATIONS_FILE" 2>/dev/null || echo 0)
if [ "$obs_count" -lt "$MIN_OBSERVATIONS" ]; then
return
fi
echo "[$(date)] Analyzing $obs_count observations for project ${PROJECT_NAME}..." >> "$LOG_FILE"
# Use Claude Code with Haiku to analyze observations
# The prompt specifies project-scoped instinct creation
if command -v claude &> /dev/null; then
exit_code=0
claude --model haiku --max-turns 3 --print \
"Read $OBSERVATIONS_FILE and identify patterns for the project '${PROJECT_NAME}' (user corrections, error resolutions, repeated workflows, tool preferences).
If you find 3+ occurrences of the same pattern, create an instinct file in $INSTINCTS_DIR/<id>.md.
CRITICAL: Every instinct file MUST use this exact format:
---
id: kebab-case-name
trigger: \"when <specific condition>\"
confidence: <0.3-0.85 based on frequency: 3-5 times=0.5, 6-10=0.7, 11+=0.85>
domain: <one of: code-style, testing, git, debugging, workflow, file-patterns>
source: session-observation
scope: project
project_id: ${PROJECT_ID}
project_name: ${PROJECT_NAME}
---
# Title
## Action
<what to do, one clear sentence>
## Evidence
- Observed N times in session <id>
- Pattern: <description>
- Last observed: <date>
Rules:
- Be conservative, only clear patterns with 3+ observations
- Use narrow, specific triggers
- Never include actual code snippets, only describe patterns
- If a similar instinct already exists in $INSTINCTS_DIR/, update it instead of creating a duplicate
- The YAML frontmatter (between --- markers) with id field is MANDATORY
- If a pattern seems universal (not project-specific), set scope to 'global' instead of 'project'
- Examples of global patterns: 'always validate user input', 'prefer explicit error handling'
- Examples of project patterns: 'use React functional components', 'follow Django REST framework conventions'" \
>> "$LOG_FILE" 2>&1 || exit_code=$?
if [ "$exit_code" -ne 0 ]; then
echo "[$(date)] Claude analysis failed (exit $exit_code)" >> "$LOG_FILE"
fi
else
echo "[$(date)] claude CLI not found, skipping analysis" >> "$LOG_FILE"
fi
if [ -f "$OBSERVATIONS_FILE" ]; then
archive_dir="${PROJECT_DIR}/observations.archive"
mkdir -p "$archive_dir"
mv "$OBSERVATIONS_FILE" "$archive_dir/processed-$(date +%Y%m%d-%H%M%S)-$$.jsonl" 2>/dev/null || true
fi
}
on_usr1() {
# Kill pending sleep to avoid leak, then analyze
[ -n "$SLEEP_PID" ] && kill "$SLEEP_PID" 2>/dev/null
SLEEP_PID=""
USR1_FIRED=1
analyze_observations
}
trap on_usr1 USR1
echo "$$" > "$PID_FILE"
echo "[$(date)] Observer started for ${PROJECT_NAME} (PID: $$)" >> "$LOG_FILE"
while true; do
# Interruptible sleep — allows USR1 trap to fire immediately
sleep "$OBSERVER_INTERVAL_SECONDS" &
SLEEP_PID=$!
wait $SLEEP_PID 2>/dev/null
SLEEP_PID=""
# Skip scheduled analysis if USR1 already ran it
if [ "$USR1_FIRED" -eq 1 ]; then
USR1_FIRED=0
else
analyze_observations
fi
done
' >> "$LOG_FILE" 2>&1 &
CLV2_IS_WINDOWS="$IS_WINDOWS" \
"$OBSERVER_LOOP_SCRIPT" >> "$LOG_FILE" 2>&1 &
# Wait for PID file
sleep 2

View File

@@ -116,4 +116,4 @@ Homunculus v2 takes a more sophisticated approach:
4. **Domain tagging** - code-style, testing, git, debugging, etc.
5. **Evolution path** - Cluster related instincts into skills/commands
See: `/Users/affoon/Documents/tasks/12-continuous-learning-v2.md` for full spec.
See: `docs/continuous-learning-v2-spec.md` for full spec.

View File

@@ -0,0 +1,50 @@
---
name: enterprise-agent-ops
description: Operate long-lived agent workloads with observability, security boundaries, and lifecycle management.
origin: ECC
---
# Enterprise Agent Ops
Use this skill for cloud-hosted or continuously running agent systems that need operational controls beyond single CLI sessions.
## Operational Domains
1. runtime lifecycle (start, pause, stop, restart)
2. observability (logs, metrics, traces)
3. safety controls (scopes, permissions, kill switches)
4. change management (rollout, rollback, audit)
## Baseline Controls
- immutable deployment artifacts
- least-privilege credentials
- environment-level secret injection
- hard timeout and retry budgets
- audit log for high-risk actions
## Metrics to Track
- success rate
- mean retries per task
- time to recovery
- cost per successful task
- failure class distribution
## Incident Pattern
When failure spikes:
1. freeze new rollout
2. capture representative traces
3. isolate failing route
4. patch with smallest safe change
5. run regression + security checks
6. resume gradually
## Deployment Integrations
This skill pairs with:
- PM2 workflows
- systemd services
- container orchestrators
- CI/CD gates

View File

@@ -234,3 +234,37 @@ Capability: 5/5 passed (pass@3: 100%)
Regression: 3/3 passed (pass^3: 100%)
Status: SHIP IT
```
## Product Evals (v1.8)
Use product evals when behavior quality cannot be captured by unit tests alone.
### Grader Types
1. Code grader (deterministic assertions)
2. Rule grader (regex/schema constraints)
3. Model grader (LLM-as-judge rubric)
4. Human grader (manual adjudication for ambiguous outputs)
### pass@k Guidance
- `pass@1`: direct reliability
- `pass@3`: practical reliability under controlled retries
- `pass^3`: stability test (all 3 runs must pass)
Recommended thresholds:
- Capability evals: pass@3 >= 0.90
- Regression evals: pass^3 = 1.00 for release-critical paths
### Eval Anti-Patterns
- Overfitting prompts to known eval examples
- Measuring only happy-path outputs
- Ignoring cost and latency drift while chasing pass rates
- Allowing flaky graders in release gates
### Minimal Eval Artifact Layout
- `.claude/evals/<feature>.md` definition
- `.claude/evals/<feature>.log` run history
- `docs/releases/<version>/eval-summary.md` release snapshot

View File

@@ -0,0 +1,33 @@
---
name: nanoclaw-repl
description: Operate and extend NanoClaw v2, ECC's zero-dependency session-aware REPL built on claude -p.
origin: ECC
---
# NanoClaw REPL
Use this skill when running or extending `scripts/claw.js`.
## Capabilities
- persistent markdown-backed sessions
- model switching with `/model`
- dynamic skill loading with `/load`
- session branching with `/branch`
- cross-session search with `/search`
- history compaction with `/compact`
- export to md/json/txt with `/export`
- session metrics with `/metrics`
## Operating Guidance
1. Keep sessions task-focused.
2. Branch before high-risk changes.
3. Compact after major milestones.
4. Export before sharing or archival.
## Extension Rules
- keep zero external runtime dependencies
- preserve markdown-as-database compatibility
- keep command handlers deterministic and local

View File

@@ -194,3 +194,46 @@ Plankton's `.claude/hooks/config.json` controls all behavior:
- Plankton (credit: @alxfazio)
- Plankton REFERENCE.md — Full architecture documentation (credit: @alxfazio)
- Plankton SETUP.md — Detailed installation guide (credit: @alxfazio)
## ECC v1.8 Additions
### Copyable Hook Profile
Set strict quality behavior:
```bash
export ECC_HOOK_PROFILE=strict
export ECC_QUALITY_GATE_FIX=true
export ECC_QUALITY_GATE_STRICT=true
```
### Language Gate Table
- TypeScript/JavaScript: Biome preferred, Prettier fallback
- Python: Ruff format/check
- Go: gofmt
### Config Tamper Guard
During quality enforcement, flag changes to config files in same iteration:
- `biome.json`, `.eslintrc*`, `prettier.config*`, `tsconfig.json`, `pyproject.toml`
If config is changed to suppress violations, require explicit review before merge.
### CI Integration Pattern
Use the same commands in CI as local hooks:
1. run formatter checks
2. run lint/type checks
3. fail fast on strict mode
4. publish remediation summary
### Health Metrics
Track:
- edits flagged by gates
- average remediation time
- repeat violations by category
- merge blocks due to gate failures

View File

@@ -0,0 +1,67 @@
---
name: ralphinho-rfc-pipeline
description: RFC-driven multi-agent DAG execution pattern with quality gates, merge queues, and work unit orchestration.
origin: ECC
---
# Ralphinho RFC Pipeline
Inspired by [humanplane](https://github.com/humanplane) style RFC decomposition patterns and multi-unit orchestration workflows.
Use this skill when a feature is too large for a single agent pass and must be split into independently verifiable work units.
## Pipeline Stages
1. RFC intake
2. DAG decomposition
3. Unit assignment
4. Unit implementation
5. Unit validation
6. Merge queue and integration
7. Final system verification
## Unit Spec Template
Each work unit should include:
- `id`
- `depends_on`
- `scope`
- `acceptance_tests`
- `risk_level`
- `rollback_plan`
## Complexity Tiers
- Tier 1: isolated file edits, deterministic tests
- Tier 2: multi-file behavior changes, moderate integration risk
- Tier 3: schema/auth/perf/security changes
## Quality Pipeline per Unit
1. research
2. implementation plan
3. implementation
4. tests
5. review
6. merge-ready report
## Merge Queue Rules
- Never merge a unit with unresolved dependency failures.
- Always rebase unit branches on latest integration branch.
- Re-run integration tests after each queued merge.
## Recovery
If a unit stalls:
- evict from active queue
- snapshot findings
- regenerate narrowed unit scope
- retry with updated constraints
## Outputs
- RFC execution log
- unit scorecards
- dependency graph snapshot
- integration risk summary