From 1031d312ccd16925c903174293b4cec4ecba1001 Mon Sep 17 00:00:00 2001 From: JongHyeok Park Date: Tue, 30 Jun 2026 07:50:41 +0900 Subject: [PATCH] feat(workflows): add orch-review native Workflow pilot (#2363) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * feat(workflows): add orch-review native Workflow pilot Port orch-pipeline Phase 5 (Review) to a native Claude Code Workflow script. The gated outer loop stays in the main conversation; this script owns only the autonomous review+verify segment between the two human gates: 1. Review — reviewers fan out in parallel: ecc:code-reviewer always, ecc:-reviewer when args.language maps, ecc:security-reviewer when the orch-pipeline security trigger matches the diff/paths. 2. Dedup — merge findings across dimensions keyed on the normalized evidence snippet, since independent reviewers flag the same line. 3. Verify — each unique CRITICAL/HIGH finding goes to an independent adversarial verifier; MEDIUM/LOW pass through as advisory. The Review->Verify barrier is deliberate: deduping before verification stops the verifier running N times on the same bug (local testing: 11 raw findings collapsed to 4 unique, ~halving verifier cost). Existing ECC reviewer subagents are reused via agentType; reviewer output is validated by JSON schema. args is accepted as an object or a JSON-encoded string. - workflows/orch-review.workflow.js — the workflow script - workflows/README.md — invocation contract, returns shape, follow-ups CI lint is scoped to scripts/ and tests/, so the script (validated with node --check) and the README (passes markdownlint) are untouched. * fix(workflows): fail closed on invalid args and lost review dimensions Addresses the two safety findings from the PR bot review: 1. Lost review dimension (Greptile P1 / CodeRabbit Major): a reviewer agent that returns null or rejects was silently dropped by filter(Boolean), so an unreviewed security dimension could still return APPROVE. Each dimension's outcome is now captured; failures land in failedDimensions and force CHANGES_REQUESTED (incomplete). 2. Invalid args (CodeRabbit Major): an empty diff returned APPROVE and bad JSON / non-array changedFiles threw inconsistently. Input is now validated up front and rejected with a clear error — the gate fails closed instead of approving an unreviewed payload. Docs (header contract + README) updated for the new return fields (incomplete, failedDimensions, stats.failed). Remaining bot nits (evidence minLength, verify-label collision, verified->confirmed rename, contract drift) deferred as follow-ups. * fix(workflows): address remaining orch-review review nits Follow-up to the bot review (deferred items from the safety pass): - evidence: require minLength 1 in the schema, and fall back to a title+line dedup key when evidence is empty, so empty-evidence findings in one file no longer collapse onto a single key and drop (CodeRabbit). - verify label: include a slice of the normalized evidence so two CRITICAL/HIGH findings from the same file get distinct labels and do not alias under resumability (Greptile). - stats.verified -> stats.confirmed to match the "confirmed" wording used in the log and avoid ambiguity vs the refuted count (Greptile); header contract and README updated to match. Verified by running the workflow on a synthetic vulnerable diff: dedup 12 raw -> 5 unique, stats.confirmed populated, fail-closed fields (incomplete/failedDimensions) intact. * fix(workflows): harden verify stage and diff-only verification Addresses the second-round bot review: - Verify stage now has the same failure guard as the review stage: a rejected verifier no longer nulls out its slot (which crashed the later filter). A null return is treated as unconfirmed; a rejection keeps the finding as blocking (fail closed) so an unverifiable CRITICAL is never silently demoted to advisory (CodeRabbit @221). - verifyPrompt now instructs the skeptic to judge solely from the provided diff text and not to refute merely because the referenced file is absent from the working tree (the diff may be an unapplied PR). Fixes the false-refute seen when testing on a synthetic diff. CodeRabbit @81 (evidence minLength) was already addressed in the prior commit; this is a stale re-post on the unresolved thread. * fix(workflows): keep unverifiable blockers blocking; stop leaking error text Second-round bot review (CodeRabbit): - @218 Treat a null/failed verifier as `unverified`, not refuted. A terminal verifier failure or skip no longer demotes a CRITICAL/HIGH to advisory; it stays in `blocking` tagged "could not be verified" (fail closed). Only a genuine isReal=false verdict is refuted. Adds stats.unverified. - @189 Do not return raw subagent error text. Review/verify failures now log the raw message for operators and return only a bounded label (failedDimensions[].error = "review agent failed"). Stale re-posts this round (@81 evidence minLength, @224 verify guard) were already fixed in prior commits. * docs(workflows): enumerate bounded failedDimensions.error labels CodeRabbit (trivial): the public contract implied callers get human-readable error text, but the implementation returns only bounded labels. Enumerate them in the README returns block. --- workflows/README.md | 59 +++++++ workflows/orch-review.workflow.js | 254 ++++++++++++++++++++++++++++++ 2 files changed, 313 insertions(+) create mode 100644 workflows/README.md create mode 100644 workflows/orch-review.workflow.js diff --git a/workflows/README.md b/workflows/README.md new file mode 100644 index 000000000..b67ed2e0f --- /dev/null +++ b/workflows/README.md @@ -0,0 +1,59 @@ +# ECC native workflows (pilot) + +Scripts in this directory are [Claude Code **Workflow** tool](https://docs.claude.com/en/docs/claude-code) scripts — deterministic, multi-agent orchestration that runs in the background and fans out to subagents. + +This is a **pilot**: ECC's orchestration (`orch-*`, `multi-*`, GAN/Santa loops) is currently hand-rolled on top of the `Task`/Agent tool. These scripts port the autonomous, fan-out-heavy segments to the native engine, which gives us barrier-free pipelining, automatic concurrency capping, structured-output validation, and resumability for free. + +## `orch-review.workflow.js` + +A native port of **orch-pipeline Phase 5 (Review)**. + +The gated outer loop (Gate 1 after Plan, Gate 2 before Commit) **stays in the main conversation** — native workflows run autonomously in the background and cannot pause for interactive approval. This script owns only the segment *between* the gates: + +1. **Review** — one reviewer agent per dimension, in parallel: + - `ecc:code-reviewer` (correctness & quality) — always + - the matching `ecc:-reviewer` — when `args.language` maps to one + - `ecc:security-reviewer` — only when the orch-pipeline security trigger matches the diff/paths +2. **Dedup** — independent reviewers routinely flag the same line, so findings are merged across dimensions keyed on the normalized `evidence` snippet (titles and line numbers drift per reviewer; the offending code does not). Each surviving finding records which `dimensions` reported it and keeps the strictest severity. +3. **Verify** — every *unique* `CRITICAL`/`HIGH` finding is handed to an independent adversarial verifier that defaults to *refuted* on uncertainty. `MEDIUM`/`LOW` pass through as advisory. + +The Review→Verify barrier is deliberate: deduping before verification is exactly the case the Workflow guidance calls a justified barrier — it stops the verifier running N times on the same bug (in local testing, 11 raw findings collapsed to 4 unique, roughly halving verifier cost). + +### Invocation + +The main loop computes the diff, then calls the Workflow tool: + +```jsonc +Workflow({ + scriptPath: "workflows/orch-review.workflow.js", + args: { + diff: "", // required + language: "typescript", // optional — selects a language reviewer + changedFiles: ["src/auth.ts"] // optional — feeds the security trigger + } +}) +``` + +Invalid input throws (the gate **fails closed**): a missing/empty `diff`, malformed JSON, or a non-array `changedFiles` is rejected with a clear error rather than silently approving an unreviewed payload. + +### Returns + +```jsonc +{ + "verdict": "APPROVE" | "CHANGES_REQUESTED", // CHANGES_REQUESTED if any blocker OR a dimension failed + "incomplete": false, // true when one or more review dimensions failed to run + "failedDimensions": [ /* { dimension, error } — error is a bounded label, never raw subagent text: + "agent returned null (terminal failure or skip)" | "review agent failed" */ ], + "blocking": [ /* confirmed CRITICAL/HIGH + unverifiable ones — must clear before Gate 2 */ ], + "advisory": [ /* MEDIUM/LOW + adversarially-refuted findings */ ], + "stats": { "dimensions": 3, "failed": 0, "raw": 11, "unique": 4, "confirmed": 4, "unverified": 0, "refuted": 0 } +} +``` + +The main loop presents `blocking` at Gate 2; the human still approves the commit. The gate fails closed at every stage: if a reviewer dies the dimension is recorded in `failedDimensions` (verdict never a clean `APPROVE`), and if a *verifier* dies or returns null the blocker is kept in `blocking` (tagged "could not be verified") rather than demoted to advisory — an unreviewed security dimension or an unverifiable CRITICAL must not pass as approved. + +## Not in this PR (follow-ups) + +- A `/orch-review` command + skill trigger (plus the mirrored i18n docs and surface tests ECC requires for a new command surface). +- Installer / manifest wiring so the script ships to `~/.claude/` on install. +- Porting the **Research** sweep and **Plan** judge-panel segments next. diff --git a/workflows/orch-review.workflow.js b/workflows/orch-review.workflow.js new file mode 100644 index 000000000..209f834ec --- /dev/null +++ b/workflows/orch-review.workflow.js @@ -0,0 +1,254 @@ +export const meta = { + name: 'orch-review', + description: + 'ECC Review phase as a native Claude Code workflow: multi-dimension review (quality + language + conditional security) then adversarial verification of every CRITICAL/HIGH finding. Returns blocking + advisory findings for Gate 2.', + phases: [ + { title: 'Review', detail: 'one reviewer agent per dimension, in parallel' }, + { title: 'Verify', detail: 'adversarially refute each CRITICAL/HIGH finding' } + ] +}; + +// --------------------------------------------------------------------------- +// Pilot port of orch-pipeline Phase 5 (Review). The gated outer loop stays in +// the main conversation; this script owns only the autonomous, fan-out-heavy +// review+verify segment between the two human gates. +// +// Caller contract — pass `args` (the main loop computes the diff and language): +// { +// diff: string, // unified `git diff` text to review (required) +// language?: string, // e.g. "typescript" — selects a language reviewer +// changedFiles?: string[], // paths touched, used for the security trigger +// } +// Invalid input (missing/empty diff, bad JSON, non-array changedFiles) throws — +// the gate fails closed rather than silently approving an unreviewed payload. +// +// Returns: +// { verdict: 'APPROVE' | 'CHANGES_REQUESTED', // CHANGES_REQUESTED if any blocker OR a dimension failed +// incomplete: boolean, // true when one or more review dimensions failed to run +// failedDimensions: { dimension, error }[], +// blocking: Finding[], // confirmed CRITICAL/HIGH + unverifiable ones — must clear before Gate 2 +// advisory: Finding[], // MEDIUM/LOW + refuted findings, informational +// stats: { dimensions, failed, raw, unique, confirmed, unverified, refuted } } +// --------------------------------------------------------------------------- + +// Language → ECC reviewer agent. Mirrors the agents present in agents/. +const LANGUAGE_REVIEWER = { + typescript: 'ecc:typescript-reviewer', + javascript: 'ecc:typescript-reviewer', + python: 'ecc:python-reviewer', + go: 'ecc:go-reviewer', + rust: 'ecc:rust-reviewer', + java: 'ecc:java-reviewer', + kotlin: 'ecc:kotlin-reviewer', + swift: 'ecc:swift-reviewer', + php: 'ecc:php-reviewer', + csharp: 'ecc:csharp-reviewer', + fsharp: 'ecc:fsharp-reviewer', + react: 'ecc:react-reviewer', + vue: 'ecc:vue-reviewer', + flutter: 'ecc:flutter-reviewer', + dart: 'ecc:flutter-reviewer', + django: 'ecc:django-reviewer', + fastapi: 'ecc:fastapi-reviewer', + cpp: 'ecc:cpp-reviewer' +}; + +// orch-pipeline security trigger: auth/authz, user input, db queries, fs paths, +// external calls, crypto, secrets. Matched against the diff text + file paths. +const SECURITY_TRIGGER = + /\b(auth|login|password|passwd|token|secret|credential|api[_-]?key|session|jwt|oauth|cookie|sql|query|exec|eval|crypto|cipher|hash|hmac|sign|fs\.|readFile|writeFile|fetch|axios|request|subprocess|os\.system)\b/i; + +// A reviewer agent must emit findings in this shape — validated at the tool layer. +const FINDINGS_SCHEMA = { + type: 'object', + additionalProperties: false, + required: ['verdict', 'findings'], + properties: { + verdict: { type: 'string', enum: ['APPROVE', 'CHANGES_REQUESTED'] }, + findings: { + type: 'array', + items: { + type: 'object', + additionalProperties: false, + required: ['title', 'severity', 'file', 'evidence'], + properties: { + title: { type: 'string' }, + severity: { type: 'string', enum: ['CRITICAL', 'HIGH', 'MEDIUM', 'LOW'] }, + file: { type: 'string' }, + line: { type: ['integer', 'null'] }, + evidence: { type: 'string', minLength: 1, description: 'the offending snippet or exact location' }, + proof: { type: 'string', description: 'why it is a real problem (required for HIGH/CRITICAL)' }, + fix: { type: 'string', description: 'concrete suggested remediation' } + } + } + } + } +}; + +// Independent skeptic verdict for one finding. +const VERDICT_SCHEMA = { + type: 'object', + additionalProperties: false, + required: ['isReal', 'confidence', 'reasoning'], + properties: { + isReal: { type: 'boolean', description: 'true only if the finding genuinely holds against the diff' }, + confidence: { type: 'number', minimum: 0, maximum: 1 }, + reasoning: { type: 'string' } + } +}; + +const SEVERITY_RANK = { LOW: 0, MEDIUM: 1, HIGH: 2, CRITICAL: 3 }; +const isBlocking = f => f.severity === 'CRITICAL' || f.severity === 'HIGH'; +const normalize = s => (s || '').replace(/\s+/g, ' ').trim().toLowerCase(); + +function reviewPrompt(dimensionLabel, diff) { + return [ + `You are reviewing a unified diff along the "${dimensionLabel}" dimension.`, + 'Apply your standard checklist. Only report issues you are >80% sure are real problems.', + 'For any CRITICAL or HIGH finding you MUST supply concrete `evidence` and a `proof` of impact; if you cannot, demote it or drop it.', + 'Returning zero findings with verdict APPROVE is an acceptable and expected outcome for clean diffs.', + '', + 'DIFF:', + diff + ].join('\n'); +} + +function verifyPrompt(finding, diff) { + return [ + 'You are an independent skeptic. Try to REFUTE the finding below by checking it against the diff text provided here — and ONLY that text.', + 'The diff may be unapplied (a proposed PR), so the referenced file may not exist on disk yet. Do NOT refute a finding merely because the file is absent from the working tree; judge solely from the diff content.', + 'Default to isReal=false when you are uncertain or cannot locate supporting evidence in the diff text.', + '', + `Finding (${finding.severity}) in ${finding.file}: ${finding.title}`, + `Claimed evidence: ${finding.evidence}`, + finding.proof ? `Claimed proof: ${finding.proof}` : '', + '', + 'DIFF:', + diff + ].join('\n'); +} + +// --- main ----------------------------------------------------------------- + +// `args` arrives verbatim. Accept a JSON-encoded string too, so the workflow +// works whether the caller passes an object or a stringified payload. +// Fail CLOSED on invalid input: a review gate must never silently APPROVE a +// payload it could not actually review. +let input; +try { + input = typeof args === 'string' ? JSON.parse(args) : (args ?? {}); +} catch { + throw new Error('orch-review: args must be an object or valid JSON'); +} +if (typeof input !== 'object' || input === null) { + throw new Error('orch-review: args must be an object'); +} +if (typeof input.diff !== 'string' || input.diff.trim() === '') { + throw new Error('orch-review: args.diff must be a non-empty unified diff'); +} +if (input.changedFiles != null && !Array.isArray(input.changedFiles)) { + throw new Error('orch-review: args.changedFiles must be an array of paths'); +} + +const diff = input.diff; +const haystack = `${diff}\n${(input.changedFiles || []).join('\n')}`; + +// Build the review dimensions. Quality always runs; language + security are conditional. +const dimensions = [{ key: 'quality', label: 'correctness & quality', agentType: 'ecc:code-reviewer' }]; + +const langReviewer = input.language && LANGUAGE_REVIEWER[String(input.language).toLowerCase()]; +if (langReviewer) { + dimensions.push({ key: `lang:${input.language}`, label: `${input.language} idioms & pitfalls`, agentType: langReviewer }); +} + +if (SECURITY_TRIGGER.test(haystack)) { + dimensions.push({ key: 'security', label: 'security (OWASP, secrets, injection)', agentType: 'ecc:security-reviewer' }); + log('Security trigger matched — adding security-reviewer dimension.'); +} + +log(`Reviewing across ${dimensions.length} dimension(s): ${dimensions.map(d => d.key).join(', ')}`); + +// Stage 1 — every dimension reviews in parallel. This is a deliberate BARRIER: +// independent reviewers routinely flag the same line, so we need the full set +// before we can dedup. Verifying first and deduping later would waste verifier +// calls on duplicates (e.g. one SQL-injection bug reported by all 3 dimensions). +// A reviewer can fail two ways: agent() returns null on a terminal error/skip, +// or the thunk rejects. Capture both per-dimension so a lost dimension is never +// silently dropped — an unreviewed security dimension must not pass as APPROVE. +const reviews = await parallel( + dimensions.map( + d => () => + agent(reviewPrompt(d.label, diff), { agentType: d.agentType, phase: 'Review', label: `review:${d.key}`, schema: FINDINGS_SCHEMA }) + .then(r => (r === null ? { dim: d.key, ok: false, error: 'agent returned null (terminal failure or skip)', findings: [] } : { dim: d.key, ok: true, findings: r.findings || [] })) + // Log the raw error for operators; never return provider/runtime internals to the caller. + .catch(err => { + log(`Review dimension ${d.key} failed: ${String((err && err.message) || err)}`); + return { dim: d.key, ok: false, error: 'review agent failed', findings: [] }; + }) + ) +); + +const failedDimensions = reviews.filter(r => r && !r.ok).map(r => ({ dimension: r.dim, error: r.error })); +if (failedDimensions.length > 0) { + log(`WARNING: ${failedDimensions.length} review dimension(s) failed: ${failedDimensions.map(f => f.dimension).join(', ')}. Verdict will fail closed.`); +} + +// Dedup across dimensions. The evidence snippet (the offending code) is the most +// stable key — titles are phrased differently and line numbers drift per reviewer. +const tagged = reviews.filter(r => r && r.ok).flatMap(r => r.findings.map(f => ({ ...f, dimension: r.dim }))); +const byKey = new Map(); +for (const f of tagged) { + // Prefer the evidence snippet; fall back to title+line so empty-evidence + // findings in the same file don't all collapse onto one `${file}::` key. + const evidenceKey = normalize(f.evidence); + const key = evidenceKey ? `${f.file}::${evidenceKey}` : `${f.file}::${normalize(f.title)}::${f.line ?? 'na'}`; + const prev = byKey.get(key); + if (!prev) { + byKey.set(key, { ...f, dimensions: [f.dimension] }); + } else { + if (!prev.dimensions.includes(f.dimension)) prev.dimensions.push(f.dimension); + if (SEVERITY_RANK[f.severity] > SEVERITY_RANK[prev.severity]) prev.severity = f.severity; // keep the strictest + } +} +const unique = [...byKey.values()]; +log(`Reviews returned ${tagged.length} findings → ${unique.length} unique after dedup.`); + +// Stage 2 — adversarially verify each unique CRITICAL/HIGH. MEDIUM/LOW are advisory. +const advisory = unique.filter(f => !isBlocking(f)); +const verified = await parallel( + unique.filter(isBlocking).map( + f => () => + agent(verifyPrompt(f, diff), { phase: 'Verify', label: `verify:${f.file}:${normalize(f.evidence).slice(0, 40)}`, schema: VERDICT_SCHEMA }) + // A null return (terminal failure/skip) or a rejection means we could NOT + // verify the finding. Mark it `unverified` rather than refuted so it stays + // blocking (fail closed) — an unverifiable CRITICAL must never be demoted + // to advisory just because the verifier did not run. + .then(v => (v ? { ...f, verdict: v } : { ...f, unverified: true, verdict: { isReal: false, confidence: 0, reasoning: 'verifier returned null (terminal failure or skip)' } })) + .catch(err => { + log(`Verifier failed for ${f.file}: ${String((err && err.message) || err)}`); + return { ...f, unverified: true, verdict: { isReal: false, confidence: 0, reasoning: 'verifier error' } }; + }) + ) +); + +const verifiedClean = verified.filter(Boolean); +const confirmed = verifiedClean.filter(f => !f.unverified && f.verdict && f.verdict.isReal); +const unverified = verifiedClean.filter(f => f.unverified); +const refuted = verifiedClean.filter(f => !f.unverified && !(f.verdict && f.verdict.isReal)); + +// Unverifiable blockers stay in `blocking` (fail closed), tagged so the human +// at Gate 2 knows they were not independently confirmed. +const blocking = [...confirmed, ...unverified.map(f => ({ ...f, note: 'could not be verified — kept as blocking' }))]; + +log(`Done: ${confirmed.length} confirmed, ${unverified.length} unverified (kept blocking), ${refuted.length} refuted, ${advisory.length} advisory.`); + +// Fail closed: APPROVE only when every dimension ran AND nothing blocks. +const incomplete = failedDimensions.length > 0; +return { + verdict: blocking.length > 0 || incomplete ? 'CHANGES_REQUESTED' : 'APPROVE', + incomplete, + failedDimensions, + blocking, + advisory: [...advisory, ...refuted.map(f => ({ ...f, note: 'refuted by adversarial verifier' }))], + stats: { dimensions: dimensions.length, failed: failedDimensions.length, raw: tagged.length, unique: unique.length, confirmed: confirmed.length, unverified: unverified.length, refuted: refuted.length } +};