Removed skills/evalview-agent-testing/ which required `pip install evalview`
from an unvetted third-party package. ECC skills must be self-contained
and not require installing external packages to function.
If we need agent regression testing, we build it natively in ECC.
Adds integration skill for ORCH (@oxgeneral/orch) — a TypeScript CLI runtime
that coordinates Claude Code, OpenCode, Codex, and Cursor agents as a typed
engineering team with formal state machine, auto-retry, and inter-agent messaging.
Use this skill when ECC tasks need to survive multiple sessions, require a review
gate before completion, or involve a persistent specialized agent team.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Affaan Mustafa <me@affaanmustafa.com>
* feat(skills): add evalview-agent-testing skill and MCP server
Add EvalView as a regression testing skill for AI agents. EvalView
snapshots agent behavior (tool calls, parameters, output), then diffs
against baselines after every change — catching regressions before they
ship.
Skill covers:
- CLI workflow (init → snapshot → check → monitor)
- Python API (gate() / gate_async() for autonomous loops)
- Quick mode (no LLM judge, $0, sub-second)
- CI/CD integration (GitHub Actions with PR comments)
- MCP integration (8 tools for Claude Code)
- Multi-turn test cases
- OpenClaw integration for autonomous agents
Also adds evalview MCP server to mcp-servers.json.
* fix(skills): pin action SHA and remove unvetted external links
- Pin hidai25/eval-view action to commit SHA instead of @main
- Replace external GitHub links with PyPI package link (vetted registry)
Addresses cubic-dev-ai review feedback.
* fix(skills): replace third-party action with pip install + CLI
Use plain pip install + evalview CLI instead of a third-party GitHub
Action. No external actions, no secrets passed to unvetted code.
Addresses cubic-dev-ai supply-chain review feedback.
* fix(skills): add destructive revert warning for gate_or_revert
Add prominent warning that gate_or_revert runs git checkout,
discarding uncommitted changes. Documents the revert_cmd override
for safer alternatives like git stash.
Addresses cubic-dev-ai review feedback.
* fix(skills): pin pip version range and document fail-on tradeoffs
- Pin evalview to >=0.5,<1 to prevent breaking CI on major upgrades
- Document --fail-on REGRESSION vs --strict tradeoff so users
understand what gates and what passes through
Addresses greptile-apps review feedback.
* fix: use python3 -m evalview for venv compatibility in MCP config
Follows the same pattern as insaits entry. Resolves correctly even
when evalview is installed in a virtual environment that isn't on
the system PATH.
* fix: align MCP install command with mcp-servers.json pattern
Use python3 -m evalview mcp serve consistently across both the
skill docs and the MCP config catalog.
* fix: use evalview CLI entry point for MCP command
pip install evalview installs the evalview binary to PATH, so using
it directly is consistent with the install docs and avoids python3
version mismatch issues.
* fix: pin install version to match CI section
* fix: pin all pip install references consistently
* fix: add API key placeholder and pin install version in MCP config
Add OPENAI_API_KEY env placeholder matching other entries. Note that
the key is optional — deterministic checks work without it. Pin
install version to match skill docs.
* fix: guard score_delta format for non-scored statuses
---------
Co-authored-by: Affaan Mustafa <me@affaanmustafa.com>
* feat: add PRP workflow commands adapted from PRPs-agentic-eng
Add 5 new PRP workflow commands and extend 2 existing commands:
New commands:
- prp-prd.md: Interactive PRD generator with 8 phases
- prp-plan.md: Deep implementation planning with codebase analysis
- prp-implement.md: Plan executor with rigorous validation loops
- prp-commit.md: Quick commit with natural language file targeting
- prp-pr.md: GitHub PR creation from current branch
Extended commands:
- code-review.md: Added GitHub PR review mode alongside local review
- plan.md: Added cross-reference to /prp-plan for deeper planning
Adapted from PRPs-agentic-eng by Wirasm. Sub-agents remapped to
inline Claude instructions. ECC conventions applied throughout
(YAML frontmatter, Phase headings, tables, no XML tags).
Artifacts stored in .claude/PRPs/{prds,plans,reports,reviews}/.
* fix: address PR #848 review feedback
- Remove external URLs from all 6 command files (keep attribution text)
- Quote $ARGUMENTS in prp-implement.md to handle paths with spaces
- Fix empty git add expansion in prp-commit.md (use xargs -r)
- Rewrite sub-agent language in prp-prd.md as direct instructions
- Fix code-review.md: add full-file fetch for PR reviews, replace
|| fallback chains with project-type detection, use proper GitHub
API for inline review comments
- Fix nested backticks in prp-plan.md Plan Template (use 4-backtick fence)
- Clarify $ARGUMENTS parsing in prp-pr.md for base branch + flags
- Fix fragile integration test pattern in prp-implement.md (proper
PID tracking, wait-for-ready loop, clean shutdown)
* fix: address second-pass review feedback on PR #848
- Add required 'side' field to GitHub review comments API call (code-review.md)
- Replace GNU-only xargs -r with portable alternative (prp-commit.md)
- Add failure check after server readiness timeout (prp-implement.md)
- Fix unsafe word-splitting in file-fetch loop using read -r (code-review.md)
- Make git reset pathspec tolerant of zero matches (prp-commit.md)
- Quote PRD file path in cat command (prp-plan.md)
- Fix plan filename placeholder inconsistency (prp-plan.md)
- Add PR template directory scan before fixed-path fallbacks (prp-pr.md)
* perf(hooks): batch format+typecheck at Stop instead of per Edit
Fixes#735. The per-edit post:edit:format and post:edit:typecheck hooks
ran synchronously after every Edit call, adding 15-30s of latency per
file — up to 7.5 minutes for a 10-file refactor.
New approach:
- post-edit-accumulator.js (PostToolUse/Edit): lightweight hook that
records each edited JS/TS path to a session-scoped temp file in
os.tmpdir(). No formatters, no tsc — exits in microseconds.
- stop-format-typecheck.js (Stop): reads the accumulator once per
response, groups files by project root and runs the formatter in
one batched invocation per root, then groups .ts/.tsx files by
tsconfig dir and runs tsc once per tsconfig. Clears the accumulator
immediately on read so repeated Stop calls don't double-process.
For a 10-file refactor: was 10 × (15s + 30s) = 7.5 min overhead,
now 1 × (batch format + batch tsc) = ~5-30s total.
* fix(hooks): address race condition, spawn timeout, and Windows path guard
Three issues raised in code review:
1. Race condition: switched accumulator from non-atomic JSON
read-modify-write to appendFileSync (one path per line). Concurrent
Edit hook processes each append independently without clobbering each
other. Deduplication moved to the Stop hook at read time.
2. Effective timeout: added run() export to stop-format-typecheck.js so
run-with-flags.js uses the direct require() path instead of falling
through to spawnSync (which has a hardcoded 30s cap). The 120s
timeout in hooks.json now governs the full batch as intended.
3. Windows path guard: added spaces and parentheses to UNSAFE_PATH_CHARS
so paths like "C:\Users\John Doe\project\file.ts" are caught before
being passed to cmd.exe with shell: true.
* fix(hooks): fix session fallback, stale comment, trim verbose comments
- Replace 'default' session ID fallback with a cwd-based sha1 hash so
concurrent sessions in different projects don't share the same
accumulator file when CLAUDE_SESSION_ID is unset
- Remove stale "JSON file" reference in accumulator header (format is
now newline-delimited plain text)
- Remove redundant/verbose inline comments throughout both files
* fix(hooks): sanitize session ID, fix Windows tsc, proportional timeouts
- Sanitize CLAUDE_SESSION_ID with /[^a-zA-Z0-9_-]/g before embedding in
the temp filename so crafted separators or '..' sequences cannot escape
os.tmpdir() (cubic P1)
- Fix typecheckBatch on Windows: npx.cmd requires shell:true like
formatBatch already does; use spawnSync and extract stdout/stderr from
the result object (coderabbit P1)
- Proportional per-batch timeouts: divide 270s budget across all format
and typecheck batches so sequential runs in monorepos stay within the
Stop hook wall-clock limit (greptile P2)
- Raise Stop hook timeout from 120s to 300s to give large monorepos
adequate headroom (cubic P2)
* fix(hooks): extend accumulator to Write|MultiEdit, fix tests
- Extend matcher from Edit to Edit|Write|MultiEdit so files created with
Write and all files in a MultiEdit batch are included in the Stop-time
format+typecheck pass (cubic P1)
- Handle tool_input.edits[] array in accumulator for MultiEdit support
- Rename misleading 'concurrent writes' test to clarify it tests append
preservation, not true concurrency (cubic P2)
- Add Stop hook dedup test: writes duplicate paths to accumulator and
verifies the hook clears it cleanly (cubic P2)
- Add Write and MultiEdit accumulation tests
* fix(hooks): move timeout to command level, add dedup unit tests
- Move timeout: 300 from the matcher object to the hook command object
where it is actually enforced; the previous position was a no-op
(cubic P2)
- Extract parseAccumulator() and export it so tests can assert dedup
behavior directly without relying only on side effects (cubic P2)
- Add two unit tests for parseAccumulator: deduplication and blank-line
handling; rename the integration test to match its scope
* fix(hooks): replace removed format/typecheck hooks with accumulator in cursor adapter
* fix(hooks): collapse multi-line commands in bash audit logs
Add gsub("\\n"; " ") to jq filters in bash audit log and cost-tracker
hooks so multi-line commands produce single-line log entries, preventing
breakage in downstream line-based parsing.
Fixes#734
* fix: forward stdin to downstream hooks using echo pattern
Addresses review feedback: PostToolUse hooks now preserve stdin
for subsequent hooks by echoing $INPUT back to stdout after
processing. Changed ; to && for proper error propagation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: make stdin passthrough unconditional and broaden secret redaction
- Use semicolons instead of && so printf passthrough always runs
even if jq fails
- Add || true after jq to prevent non-zero exit on parse errors
- Use printf '%s\n' instead of echo for safe binary passthrough
- Fix Authorization pattern to handle 'Bearer <token>' with space
- Add ASIA (STS temp credentials) alongside AKIA redaction
- Add GitHub token patterns (ghp_, gho_, ghs_, github_pat_)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use [: ]* instead of s* for Authorization whitespace matching
jq's ONIG regex engine interprets s* as literal 's' zero-or-more,
not \s* (whitespace). This caused 'Authorization: Bearer <token>'
to only redact 'Authorization:' and leak the actual token.
Using [: ]* avoids the JSON/jq double-escape issue entirely and
correctly matches both 'Authorization: Bearer xyz' and
'Authorization:xyz' patterns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements Anthropic's March 2026 harness design pattern — a multi-agent
architecture that separates generation from evaluation, creating an
adversarial feedback loop that produces production-quality applications.
Components:
- 3 agent definitions (planner, generator, evaluator)
- 1 skill with full documentation (skills/gan-style-harness/)
- 2 commands (gan-build for full apps, gan-design for frontend)
- 1 shell orchestrator (scripts/gan-harness.sh)
- Examples and configuration reference
Based on: https://www.anthropic.com/engineering/harness-design-long-running-apps
Co-authored-by: Hao Chen <haochen806@gmail.com>
The script lives inside .kiro/, so SCRIPT_DIR already resolves to the .kiro directory. Appending /.kiro again produced an invalid path (.kiro/.kiro) causing the installer to find no source files to copy.
* fix: filter session-start injection by cwd/project to prevent cross-project contamination
The SessionStart hook previously selected the most recent session file
purely by timestamp, ignoring the current working directory. This caused
Claude to receive a previous project's session context when switching
between projects, leading to incorrect file reads and project analysis.
session-end.js already writes **Project:** and **Worktree:** header
fields into each session file. This commit adds selectMatchingSession()
which uses those fields with the following priority:
1. Exact worktree (cwd) match — most recent
2. Same project name match — most recent
3. Fallback to overall most recent (preserves backward compatibility)
No new dependencies. Gracefully falls back to original behavior when
no matching session exists.
* fix: address review feedback — eliminate duplicate I/O, add null guards, improve docstrings
- Return { session, content, matchReason } from selectMatchingSession()
to avoid reading the same file twice (coderabbitai, greptile P2)
- Add empty array guard: return null when sessions.length === 0 (coderabbitai)
- Stop mutating input objects — no more session._matchReason (coderabbitai)
- Add null check on result before accessing properties (coderabbitai)
- Only log "selected" after confirming content is readable (cubic-dev-ai P3)
- Add full JSDoc with @param/@returns (docstring coverage)
* fix: track fallback session object to prevent session/content mismatch
When sessions[0] is unreadable, fallbackContent came from a later
session (e.g. sessions[1]) while the returned session object still
pointed to sessions[0]. This caused misleading logs and injected
content from the wrong session — the exact problem this PR fixes.
Now tracks fallbackSession alongside fallbackContent so the returned
pair is always consistent.
Addresses greptile-apps P1 review feedback.
* fix: normalize worktree paths to handle symlinks and case differences
On macOS /var is a symlink to /private/var, and on Windows paths may
differ in casing (C:\repo vs c:\repo). Use fs.realpathSync() to
resolve both sides before comparison so worktree matching is reliable
across symlinked and case-insensitive filesystems.
cwd is normalized once outside the loop to avoid repeated syscalls.
Addresses coderabbitai Major review feedback.
---------
Co-authored-by: kuqili <kuqili@tencent.com>
* feat(commands): add santa-loop adversarial review command
Adds /santa-loop, a convergence loop command built on the santa-method
skill. Two independent reviewers (Claude Opus + external model) must
both return NICE before code ships. Supports Codex CLI (GPT-5.4),
Gemini CLI (3.1 Pro), or Claude-only fallback. Fixes are committed
per round and the loop repeats until convergence or escalation.
* fix: address all PR review findings for santa-loop command
- Add YAML frontmatter with description (coderabbit)
- Add Purpose, Usage, Output sections per CONTRIBUTING.md template (coderabbit)
- Fix literal <prompt> placeholder in Gemini CLI invocation (greptile P1)
- Use mktemp for unique temp file instead of fixed /tmp path (greptile P1, cubic P1)
- Use --sandbox read-only instead of --full-auto to prevent repo mutation (cubic P1)
- Use git push -u origin HEAD instead of bare git push (greptile P2, cubic P1)
- Clarify verdict protocol: reviewers return PASS/FAIL, gate maps to NICE/NAUGHTY (greptile P2, coderabbit)
- Specify parallel execution mechanism via Agent tool (coderabbit nitpick)
- Add escalation format for max-iterations case (coderabbit nitpick)
- Fix model IDs: gpt-5.4 for Codex, gemini-2.5-pro for Gemini
Inline `node -e "..."` in hooks.json contained `!` characters (e.g.
`!org.isDirectory()`) that bash history expansion in certain shell
environments would misinterpret, producing syntax errors and the
"SessionStart:startup hook error" banner in the Claude Code CLI header.
Extract the bootstrap logic to `scripts/hooks/session-start-bootstrap.js`
so the shell never sees the JS source. Behaviour is identical: the script
reads stdin, resolves the ECC plugin root via CLAUDE_PLUGIN_ROOT or a set
of well-known fallback paths, then delegates to run-with-flags.js.
Update the test that asserted the old inline pattern to verify the new
file-based approach instead.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>