mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-05-15 13:23:13 +08:00
docs: add evaluator CI failure scenario (#1826)
This commit is contained in:
@@ -56,9 +56,10 @@ As of 2026-05-12:
|
|||||||
`2.0.0-rc.1`.
|
`2.0.0-rc.1`.
|
||||||
- `docs/architecture/evaluator-rag-prototype.md` and
|
- `docs/architecture/evaluator-rag-prototype.md` and
|
||||||
`examples/evaluator-rag-prototype/` define the first read-only
|
`examples/evaluator-rag-prototype/` define the first read-only
|
||||||
self-improving harness prototype: scenario spec, trace, report, candidate
|
self-improving harness prototype: scenario specs, traces, reports,
|
||||||
playbook, verifier result, accepted maintainer-salvage candidate, and
|
candidate playbooks, verifier results, accepted maintainer-salvage,
|
||||||
rejected blind-translation candidate.
|
billing-readiness, and CI-failure-diagnosis candidates, plus rejected
|
||||||
|
unsafe candidates.
|
||||||
- The npm package surface now excludes Python bytecode/cache artifacts through
|
- The npm package surface now excludes Python bytecode/cache artifacts through
|
||||||
package `files` negation rules and a publish-surface regression test.
|
package `files` negation rules and a publish-surface regression test.
|
||||||
- `docs/legacy-artifact-inventory.md` records that no `_legacy-documents-*`
|
- `docs/legacy-artifact-inventory.md` records that no `_legacy-documents-*`
|
||||||
@@ -199,7 +200,7 @@ is not complete unless the evidence column exists and has been freshly verified.
|
|||||||
| AgentShield enterprise iteration | Policy gates, SARIF, packs, provenance, corpus, HTML reports, exception lifecycle audit | PRs #53, #55-#62 landed with test evidence | Needs PDF/export decision or next enterprise signal |
|
| AgentShield enterprise iteration | Policy gates, SARIF, packs, provenance, corpus, HTML reports, exception lifecycle audit | PRs #53, #55-#62 landed with test evidence | Needs PDF/export decision or next enterprise signal |
|
||||||
| ECC Tools next-level app | Billing audit, PR checks, deep analyzer, sync backlog | PRs #26-#39 landed with test evidence | Needs capacity-backed Linear rollout / broader evaluator corpus |
|
| ECC Tools next-level app | Billing audit, PR checks, deep analyzer, sync backlog | PRs #26-#39 landed with test evidence | Needs capacity-backed Linear rollout / broader evaluator corpus |
|
||||||
| GitGuardian/Dependabot/CodeRabbit-style checks | Non-blocking taxonomy and deterministic follow-up checks | ECC-Tools risk taxonomy check plus follow-up signals landed, including Skill Quality, Deep Analyzer Evidence, Analyzer Corpus Evidence, RAG/Evaluator Evidence, and PR Review/Salvage Evidence | Partially complete |
|
| GitGuardian/Dependabot/CodeRabbit-style checks | Non-blocking taxonomy and deterministic follow-up checks | ECC-Tools risk taxonomy check plus follow-up signals landed, including Skill Quality, Deep Analyzer Evidence, Analyzer Corpus Evidence, RAG/Evaluator Evidence, and PR Review/Salvage Evidence | Partially complete |
|
||||||
| Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates plus `docs/architecture/evaluator-rag-prototype.md` and `examples/evaluator-rag-prototype/` define read-only stale-salvage and billing-readiness scenarios with trace, report, playbook, and verifier result artifacts | Needs broader evaluator corpus |
|
| Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates plus `docs/architecture/evaluator-rag-prototype.md` and `examples/evaluator-rag-prototype/` define read-only stale-salvage, billing-readiness, and CI-failure-diagnosis scenarios with trace, report, playbook, and verifier result artifacts | Needs broader evaluator corpus |
|
||||||
| Linear roadmap is detailed | Linear project status plus repo mirror | Repo mirror exists; issue creation was retried on 2026-05-12 and remains blocked by the workspace free issue limit | Needs recurring status updates after each merge batch |
|
| Linear roadmap is detailed | Linear project status plus repo mirror | Repo mirror exists; issue creation was retried on 2026-05-12 and remains blocked by the workspace free issue limit | Needs recurring status updates after each merge batch |
|
||||||
| Flow separation and progress tracking | Flow lanes with owner artifacts and update cadence | This roadmap defines lanes below | Active |
|
| Flow separation and progress tracking | Flow lanes with owner artifacts and update cadence | This roadmap defines lanes below | Active |
|
||||||
| Realtime Linear sync | Project updates while issue limit is blocked; issues later | ECC-Tools #39 implements opt-in Linear API sync for deferred follow-up backlog items | Needs workspace capacity/config rollout |
|
| Realtime Linear sync | Project updates while issue limit is blocked; issues later | ECC-Tools #39 implements opt-in Linear API sync for deferred follow-up backlog items | Needs workspace capacity/config rollout |
|
||||||
|
|||||||
@@ -10,7 +10,9 @@ The fixture set lives in
|
|||||||
It started with the May 2026 stale-PR cleanup and salvage lane because that
|
It started with the May 2026 stale-PR cleanup and salvage lane because that
|
||||||
lane has real inputs, real accepted work, and real rejected work. The corpus now
|
lane has real inputs, real accepted work, and real rejected work. The corpus now
|
||||||
also includes a billing/Marketplace readiness scenario so launch copy cannot
|
also includes a billing/Marketplace readiness scenario so launch copy cannot
|
||||||
treat dry-run release evidence or roadmap intent as live billing state.
|
treat dry-run release evidence or roadmap intent as live billing state. A
|
||||||
|
CI-failure diagnosis scenario adds the log-first workflow needed before an
|
||||||
|
agent proposes fixes for red checks.
|
||||||
|
|
||||||
## Reference Pressure
|
## Reference Pressure
|
||||||
|
|
||||||
@@ -96,6 +98,8 @@ Current corpus:
|
|||||||
maintainer-owned branches with attribution and validation.
|
maintainer-owned branches with attribution and validation.
|
||||||
- `billing-marketplace-readiness`: verifies billing, App, and Marketplace
|
- `billing-marketplace-readiness`: verifies billing, App, and Marketplace
|
||||||
launch claims before public copy says they are live.
|
launch claims before public copy says they are live.
|
||||||
|
- `ci-failure-diagnosis`: requires failed-job logs, changed-file scope, and a
|
||||||
|
named regression command before a CI fix playbook can be promoted.
|
||||||
|
|
||||||
## ECC Tools Mapping
|
## ECC Tools Mapping
|
||||||
|
|
||||||
@@ -129,6 +133,5 @@ A candidate can be promoted only when:
|
|||||||
|
|
||||||
The next evaluator/RAG corpus should add:
|
The next evaluator/RAG corpus should add:
|
||||||
|
|
||||||
- a CI-failure diagnosis scenario with captured logs and a known fix;
|
|
||||||
- a harness-config quality scenario covering MCP/plugin/hook drift;
|
- a harness-config quality scenario covering MCP/plugin/hook drift;
|
||||||
- an AgentShield policy exception scenario with SARIF and report evidence.
|
- an AgentShield policy exception scenario with SARIF and report evidence.
|
||||||
|
|||||||
@@ -0,0 +1,46 @@
|
|||||||
|
# CI Failure Diagnosis Playbook
|
||||||
|
|
||||||
|
Candidate id: `log-backed-minimal-fix`
|
||||||
|
|
||||||
|
Use this playbook when a PR, maintainer branch, or release-readiness branch has
|
||||||
|
one or more red GitHub Actions checks.
|
||||||
|
|
||||||
|
## Accepted Path
|
||||||
|
|
||||||
|
1. Capture PR and branch context:
|
||||||
|
- `gh pr view <pr-number> --json files,statusCheckRollup,headRefName,baseRefName`
|
||||||
|
- `gh run view <run-id> --json jobs`
|
||||||
|
2. Fetch the failed log evidence:
|
||||||
|
- `gh run view <run-id> --log-failed`
|
||||||
|
3. Record the failing job, step, OS, Node/Python/Rust version, package manager,
|
||||||
|
and shortest useful error excerpt.
|
||||||
|
4. Compare the failing step to the PR changed files.
|
||||||
|
5. Search current docs, tests, and prior PRs for a known matching failure mode.
|
||||||
|
6. Promote the smallest fix path only when it includes a local reproduction or
|
||||||
|
regression command.
|
||||||
|
7. After a separate implementation branch exists, rerun the focused local gate,
|
||||||
|
then wait for the full GitHub Actions matrix before merge.
|
||||||
|
|
||||||
|
## Rejected Path
|
||||||
|
|
||||||
|
Do not keep rerunning CI until a transient green result appears without
|
||||||
|
recording the original failure and why it is safe to ignore.
|
||||||
|
|
||||||
|
Do not weaken tests, skip matrix legs, or broaden the patch to unrelated files
|
||||||
|
just to make the check pass.
|
||||||
|
|
||||||
|
Do not claim release readiness from a branch with required checks still red.
|
||||||
|
|
||||||
|
## Minimum Validation
|
||||||
|
|
||||||
|
- `gh run view <run-id> --log-failed`
|
||||||
|
- Focused local command matching the failing surface, such as:
|
||||||
|
- `node tests/<matching-test>.js`
|
||||||
|
- `npm run harness:audit -- --format json`
|
||||||
|
- `npm run observability:ready`
|
||||||
|
- `cargo test`
|
||||||
|
- `git diff --check`
|
||||||
|
- Full required GitHub Actions matrix before merge
|
||||||
|
|
||||||
|
Record the failed-log excerpt and the chosen regression command in the
|
||||||
|
maintainer PR body or handoff before merging the fix.
|
||||||
@@ -0,0 +1,35 @@
|
|||||||
|
{
|
||||||
|
"schema_version": "ecc.evaluator-rag.report.v1",
|
||||||
|
"scenario_id": "ci-failure-diagnosis",
|
||||||
|
"run_id": "2026-05-12-ci-failure-diagnosis-prototype",
|
||||||
|
"result": "prototype_passed",
|
||||||
|
"read_only": true,
|
||||||
|
"scores": {
|
||||||
|
"failure_evidence": 0.92,
|
||||||
|
"scope_control": 0.9,
|
||||||
|
"regression_specificity": 0.86,
|
||||||
|
"matrix_safety": 1,
|
||||||
|
"publication_safety": 1
|
||||||
|
},
|
||||||
|
"findings": [
|
||||||
|
{
|
||||||
|
"id": "log-first-required",
|
||||||
|
"severity": "warning",
|
||||||
|
"summary": "A CI fix candidate must start from the exact failed job, step, platform, runtime, package manager, and log excerpt rather than from a generic rerun."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "changed-file-scope-needed",
|
||||||
|
"severity": "info",
|
||||||
|
"summary": "Changed-file context should narrow the fix to the surface that can affect the failing step, especially in a broad OS/runtime matrix."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "regression-gate-needed",
|
||||||
|
"severity": "warning",
|
||||||
|
"summary": "A promoted fix playbook must name a local reproduction or regression command before the branch is allowed to merge."
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"recommended_next_action": {
|
||||||
|
"candidate_id": "log-backed-minimal-fix",
|
||||||
|
"action": "Use the promoted CI failure diagnosis playbook whenever a PR check goes red before implementing or rerunning fixes."
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,57 @@
|
|||||||
|
{
|
||||||
|
"schema_version": "ecc.evaluator-rag.scenario.v1",
|
||||||
|
"scenario_id": "ci-failure-diagnosis",
|
||||||
|
"title": "Diagnose CI failures from captured logs before proposing fixes",
|
||||||
|
"mode": "read_only_prototype",
|
||||||
|
"objective": "Given a failed CI run on a PR or maintainer branch, retrieve the exact failing job, captured log excerpt, changed-file context, and prior known-fix evidence before promoting a fix playbook.",
|
||||||
|
"sources": [
|
||||||
|
{
|
||||||
|
"kind": "repo_doc",
|
||||||
|
"path": "docs/ECC-2.0-GA-ROADMAP.md",
|
||||||
|
"purpose": "Records ECC-Tools CI failure-mode predictive follow-ups and the evaluator/RAG corpus expansion need"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"kind": "repo_doc",
|
||||||
|
"path": "docs/architecture/evaluator-rag-prototype.md",
|
||||||
|
"purpose": "Defines the artifact contract and promotion rules for evaluator/RAG scenarios"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"kind": "github_actions",
|
||||||
|
"command": "gh run view <run-id> --log-failed",
|
||||||
|
"purpose": "Primary evidence for the failing job, failing step, and deterministic error text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"kind": "github_pr",
|
||||||
|
"command": "gh pr view <pr-number> --json files,statusCheckRollup,headRefName,baseRefName",
|
||||||
|
"purpose": "Changed-file and check-rollup context for scoping the fix"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"kind": "repo_test",
|
||||||
|
"command": "node tests/run-all.js",
|
||||||
|
"purpose": "Local regression gate after a candidate fix is implemented outside the read-only evaluator run"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"retrieval_questions": [
|
||||||
|
"Which job, step, platform, runtime, and package manager failed?",
|
||||||
|
"What is the smallest failing log excerpt that explains the failure?",
|
||||||
|
"Which changed files are plausibly connected to the failing step?",
|
||||||
|
"Is there a prior known-fix, troubleshooting note, or fixture that matches this failure mode?",
|
||||||
|
"Which local command reproduces or guards the failure before a fix can merge?"
|
||||||
|
],
|
||||||
|
"forbidden_actions": [
|
||||||
|
"rerunning CI until it passes without diagnosing the failure",
|
||||||
|
"pushing speculative fixes without a captured failing log excerpt",
|
||||||
|
"editing unrelated files to make the matrix green",
|
||||||
|
"weakening or deleting tests to silence a failure",
|
||||||
|
"merging or publishing while required checks are red",
|
||||||
|
"creating release tags or posting announcements from this evaluator run"
|
||||||
|
],
|
||||||
|
"acceptance_gates": [
|
||||||
|
"failing job and step are named",
|
||||||
|
"captured log excerpt is linked or summarized",
|
||||||
|
"changed-file context is compared to the failing step",
|
||||||
|
"known-fix or no-known-fix status is recorded",
|
||||||
|
"local reproduction or regression command is named",
|
||||||
|
"at least one rerun-only candidate is rejected"
|
||||||
|
]
|
||||||
|
}
|
||||||
@@ -0,0 +1,45 @@
|
|||||||
|
{
|
||||||
|
"schema_version": "ecc.evaluator-rag.trace.v1",
|
||||||
|
"scenario_id": "ci-failure-diagnosis",
|
||||||
|
"run_id": "2026-05-12-ci-failure-diagnosis-prototype",
|
||||||
|
"read_only": true,
|
||||||
|
"events": [
|
||||||
|
{
|
||||||
|
"phase": "observation",
|
||||||
|
"summary": "A PR or maintainer branch has a red GitHub Actions matrix. The evaluator records status without rerunning, merging, or editing code.",
|
||||||
|
"evidence": [
|
||||||
|
"gh pr view <pr-number> --json statusCheckRollup,files",
|
||||||
|
"gh run view <run-id> --json jobs"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"phase": "retrieval",
|
||||||
|
"summary": "Retrieved failed-job logs, changed-file context, current roadmap CI failure-mode requirements, and existing local regression commands.",
|
||||||
|
"evidence": [
|
||||||
|
"gh run view <run-id> --log-failed",
|
||||||
|
"docs/ECC-2.0-GA-ROADMAP.md",
|
||||||
|
"tests/run-all.js"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"phase": "proposal",
|
||||||
|
"summary": "Generated two candidate playbooks: log-backed minimal fix with regression coverage, and rerun-only optimism that treats CI flake as proven without evidence.",
|
||||||
|
"candidate_ids": [
|
||||||
|
"log-backed-minimal-fix",
|
||||||
|
"rerun-only-green-wait"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"phase": "verification",
|
||||||
|
"summary": "Accepted the log-backed minimal fix because it names failing evidence, scope, and validation. Rejected rerun-only waiting because it does not explain the failure or preserve a regression guard.",
|
||||||
|
"evidence": [
|
||||||
|
"examples/evaluator-rag-prototype/ci-failure-diagnosis/verifier-result.json"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"phase": "promotion",
|
||||||
|
"summary": "Promoted only the read-only CI triage playbook. The evaluator does not push a fix, rerun CI, merge, publish, or weaken checks.",
|
||||||
|
"promoted_candidate_id": "log-backed-minimal-fix"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
@@ -0,0 +1,35 @@
|
|||||||
|
{
|
||||||
|
"schema_version": "ecc.evaluator-rag.verifier.v1",
|
||||||
|
"scenario_id": "ci-failure-diagnosis",
|
||||||
|
"run_id": "2026-05-12-ci-failure-diagnosis-prototype",
|
||||||
|
"read_only": true,
|
||||||
|
"candidates": [
|
||||||
|
{
|
||||||
|
"candidate_id": "log-backed-minimal-fix",
|
||||||
|
"decision": "accepted",
|
||||||
|
"score": 0.93,
|
||||||
|
"reasons": [
|
||||||
|
"requires failed job, step, platform, runtime, and log evidence",
|
||||||
|
"compares changed files to the failing surface before proposing a fix",
|
||||||
|
"names a focused local reproduction or regression command",
|
||||||
|
"keeps required checks intact",
|
||||||
|
"keeps merge, release, package, plugin, billing, and announcement actions approval-gated"
|
||||||
|
],
|
||||||
|
"rollback": "Revert the future implementation PR or restore the original failing test fixture; no code is changed by this read-only playbook."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"candidate_id": "rerun-only-green-wait",
|
||||||
|
"decision": "rejected",
|
||||||
|
"score": 0.17,
|
||||||
|
"reasons": [
|
||||||
|
"does not preserve the failing log excerpt",
|
||||||
|
"does not identify job, step, platform, runtime, or package manager",
|
||||||
|
"does not compare failure surface to changed files",
|
||||||
|
"does not add or name a regression gate",
|
||||||
|
"risks merging a flaky or still-unexplained CI failure"
|
||||||
|
],
|
||||||
|
"rollback": "Do not treat this as a fix; restart diagnosis from captured failed logs and changed-file context."
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"promoted_candidate_id": "log-backed-minimal-fix"
|
||||||
|
}
|
||||||
@@ -175,6 +175,52 @@ test('billing readiness scenario rejects launch copy overclaims', () => {
|
|||||||
assert.ok(playbook.includes('https://github.com/marketplace/ecc-tools'));
|
assert.ok(playbook.includes('https://github.com/marketplace/ecc-tools'));
|
||||||
});
|
});
|
||||||
|
|
||||||
|
test('ci failure diagnosis scenario rejects rerun-only fixes', () => {
|
||||||
|
const scenario = readFixtureJson('ci-failure-diagnosis/scenario.json');
|
||||||
|
const trace = readFixtureJson('ci-failure-diagnosis/trace.json');
|
||||||
|
const report = readFixtureJson('ci-failure-diagnosis/report.json');
|
||||||
|
const verifier = readFixtureJson('ci-failure-diagnosis/verifier-result.json');
|
||||||
|
const playbook = read('examples/evaluator-rag-prototype/ci-failure-diagnosis/candidate-playbook.md');
|
||||||
|
|
||||||
|
assert.strictEqual(scenario.scenario_id, 'ci-failure-diagnosis');
|
||||||
|
assert.strictEqual(trace.scenario_id, scenario.scenario_id);
|
||||||
|
assert.strictEqual(report.scenario_id, scenario.scenario_id);
|
||||||
|
assert.strictEqual(verifier.scenario_id, scenario.scenario_id);
|
||||||
|
assert.strictEqual(trace.read_only, true);
|
||||||
|
assert.strictEqual(report.read_only, true);
|
||||||
|
assert.strictEqual(verifier.read_only, true);
|
||||||
|
|
||||||
|
for (const blocked of [
|
||||||
|
'rerunning CI until it passes without diagnosing the failure',
|
||||||
|
'pushing speculative fixes without a captured failing log excerpt',
|
||||||
|
'weakening or deleting tests to silence a failure',
|
||||||
|
'merging or publishing while required checks are red'
|
||||||
|
]) {
|
||||||
|
assert.ok(scenario.forbidden_actions.includes(blocked), `Missing CI forbidden action: ${blocked}`);
|
||||||
|
}
|
||||||
|
|
||||||
|
for (const required of [
|
||||||
|
'failing job and step are named',
|
||||||
|
'captured log excerpt is linked or summarized',
|
||||||
|
'changed-file context is compared to the failing step',
|
||||||
|
'local reproduction or regression command is named'
|
||||||
|
]) {
|
||||||
|
assert.ok(scenario.acceptance_gates.includes(required), `Missing CI acceptance gate: ${required}`);
|
||||||
|
}
|
||||||
|
|
||||||
|
const accepted = verifier.candidates.find(candidate => candidate.candidate_id === 'log-backed-minimal-fix');
|
||||||
|
const rejected = verifier.candidates.find(candidate => candidate.candidate_id === 'rerun-only-green-wait');
|
||||||
|
|
||||||
|
assert.ok(accepted, 'Missing accepted log-backed CI candidate');
|
||||||
|
assert.ok(rejected, 'Missing rejected rerun-only CI candidate');
|
||||||
|
assert.strictEqual(accepted.decision, 'accepted');
|
||||||
|
assert.strictEqual(rejected.decision, 'rejected');
|
||||||
|
assert.strictEqual(verifier.promoted_candidate_id, accepted.candidate_id);
|
||||||
|
assert.ok(rejected.reasons.join('\n').includes('failing log excerpt'));
|
||||||
|
assert.ok(playbook.includes('gh run view <run-id> --log-failed'));
|
||||||
|
assert.ok(playbook.includes('Full required GitHub Actions matrix before merge'));
|
||||||
|
});
|
||||||
|
|
||||||
if (failed > 0) {
|
if (failed > 0) {
|
||||||
console.log(`\nFailed: ${failed}`);
|
console.log(`\nFailed: ${failed}`);
|
||||||
process.exit(1);
|
process.exit(1);
|
||||||
|
|||||||
Reference in New Issue
Block a user