Commit Graph

1934 Commits

Author SHA1 Message Date
Jamkris bcbd2acee2 fix(lib): retry rename on Windows EPERM/EACCES/EBUSY in writeBridgeAtomic
PR #1983 round 1 introduced unique-suffix tmp paths so two concurrent
writers no longer share a single `.tmp` file. That fix is correct
under POSIX semantics — `rename(2)` is atomic between source and
destination, so each writer renames onto the same target without
conflict.

Windows `MoveFileExW` is not the same. It fails with
EPERM / EACCES / EBUSY when the target is currently being renamed
by *another* process — a short race window that fires reliably under
this hook's PostToolUse + statusline concurrency. Round 1's CI run
made this visible:

  Test (windows-latest, Node 18.x, npm) — FAILURE
  Error: EPERM: operation not permitted, rename
    'C:\…\ecc-metrics-test-bridge-race-….json.9504.4aef575a.tmp' ->
    'C:\…\ecc-metrics-test-bridge-race-….json'
      at writeBridgeAtomic (scripts/lib/session-bridge.js:79:8)

All nine Windows matrix cells (Node 18 / 20 / 22 × npm / pnpm / yarn)
hit the same path. POSIX matrices (Linux + macOS) passed unchanged.

Fix: extract a `renameWithRetry(tmp, target)` helper that retries
`fs.renameSync` up to 5 times on EPERM / EACCES / EBUSY with
exponential backoff (20 ms → 320 ms total). Other error codes
(ENOENT, ENOSPC, EROFS, …) re-throw on the first attempt — they are
not transient. POSIX runs hit the first try and exit immediately.

The backoff uses `Atomics.wait` on a throwaway `SharedArrayBuffer`
so the retry path does not busy-spin the CPU; verified on Node ≥ 17
that this works on the main thread. There is a `try/catch` fallback
to a brief busy-wait for older runtimes where `Atomics.wait` is
restricted to workers.

`writeBridgeAtomic` calls the helper instead of `fs.renameSync` and
keeps its existing best-effort tmp cleanup on terminal failure.

`renameWithRetry` is added to `module.exports` so the companion
`writeWarnState` in `scripts/hooks/ecc-context-monitor.js` can
adopt the same retry policy without duplicating the helper. That
adoption lands in the next commit.

Local: `node tests/lib/session-bridge.test.js` 14/14, `yarn test`
green, `yarn lint` clean. The round-1 test (two concurrent child
writers, 200 iterations each) now passes on macOS without retrying
at all (POSIX path) and is expected to pass on Windows via the new
retry loop.
2026-05-19 11:01:10 +09:00
Jamkris 8149beac02 test(lib): make concurrent-write test actually concurrent + use regex matcher for assert.throws
Two round-1 review findings in `tests/lib/session-bridge.test.js`,
both about test correctness rather than the underlying fix:

1. **greptile P1 + coderabbitai Major + cubic P2 (all three): concurrent-write test ran sequentially.**

   The test spawned two child processes with two consecutive
   `spawnSync` calls. Because `spawnSync` blocks until the child
   exits, the second writer started *after* the first finished —
   the two writers never overlapped, so the rename race the fix
   targets was never actually exercised. The test would have passed
   with the old broken `${target}.tmp` suffix.

   Fix: introduce a one-off "race runner" helper that runs inside
   its own subprocess and uses async `spawn` to start both writers
   simultaneously. The runner waits for both to exit (the event
   loop is local to the runner subprocess, so this stays compatible
   with the synchronous test harness used elsewhere in this file)
   and reports both exit codes plus stderrs on stdout. The test
   then calls the runner via `spawnSync` and parses the result.
   Both writer children now overlap for the duration of their 200
   `writeBridgeAtomic` calls each, which is enough wall time to
   reliably trigger the rename race against the pre-fix code.

   Verified: with the fixed `${target}.${pid}.${nonce}.tmp` suffix,
   the test passes; with the old fixed `${target}.tmp` suffix
   reintroduced, it fails as expected (one writer hits ENOENT on
   roughly half its rename calls).

2. **greptile P2 + cubic P3: `assert.throws` used a string as the second argument.**

   Node deprecated passing a string as the second argument to
   `assert.throws` years ago: the string is silently treated as
   the assertion failure message (what to print when the function
   does *not* throw) rather than as an error matcher. The check
   passed for any thrown error, not just the rename failure.

   Fix: pass a regex matcher as the second arg and keep the
   explanatory text as the third. The regex matches `EISDIR`,
   `EPERM`, `ENOTDIR`, or `ENOENT` because `renameSync` of a
   regular tmp file onto an existing directory raises different
   codes on Linux / macOS / BSD — making the matcher portable
   across CI runners.

Test count unchanged at 14; `npm test` green; `npm run lint` clean.

The two helper files (`tests/__tmp_bridge_writer.js`,
`tests/__tmp_bridge_race_runner.js`) are written and unlinked
inside the test's try/finally so they never persist beyond the
test run.
2026-05-19 09:44:53 +09:00
Jamkris a2b98be4a3 test(lib): concurrent writeBridgeAtomic + tmp-cleanup regression
Two regression tests pin down the previous two commits' atomic-rename
fixes:

1. **concurrent writes don't throw ENOENT or corrupt the file** —
   spawns two child Node processes (`tests/__tmp_bridge_writer.js`
   created in-test, cleaned up in finally) that each call
   `writeBridgeAtomic(sid, …)` 200 times against the same session
   ID with independent payloads. Asserts both subprocesses exit 0
   (the previous implementation produced ENOENT on roughly 50% of
   rename calls, all swallowed by the in-test catch) and the final
   bridge file is parseable JSON belonging to one of the two writers
   (last-writer-wins is fine; the contract is *no corruption* and
   *no rename ENOENT*, not data preservation).

2. **tmp file cleanup on rename failure** — pre-creates a directory
   at the target bridge path so `renameSync(tmp, target)` fails,
   calls `writeBridgeAtomic`, asserts the call throws AND that no
   tmp file with the writer's `pid.<nonce>.tmp` prefix is left
   behind in `os.tmpdir()`. The previous code had no cleanup; the
   fix's `try/catch + unlinkSync` keeps tmpdir from accumulating
   orphan files across repeated rename failures.

The first test deliberately writes independent payloads from each
subprocess so this regression doesn't try to claim a property the
fix doesn't actually deliver (read-modify-write race in the caller
is a separate issue and out of scope per PR body).

Test count: 12 → 14 in `tests/lib/session-bridge.test.js`;
`npm test` green; `npm run lint` clean.
2026-05-19 09:31:35 +09:00
Jamkris 74fda58a9c fix(hooks): use unique tmp suffix in writeWarnState (ecc-context-monitor)
Mirror the previous commit's `writeBridgeAtomic` fix on the
companion `writeWarnState` in `ecc-context-monitor.js`. Same shape:
fixed `${target}.tmp` → `${target}.${process.pid}.${randomNonce}.tmp`,
plus best-effort cleanup of the tmp file on `renameSync` failure
(throws original error after cleanup).

`writeWarnState` debounces the context-monitor's threshold alarms
(`COST_NOTICE_USD`, `COST_WARNING_USD`, `COST_CRITICAL_USD`, plus the
context-remaining and loop-detection ones). Without unique suffixes,
two PostToolUse subprocesses racing on the warn-state file produce
either a corrupted JSON debounce-state on disk or an ENOENT throw
that the hook catches and swallows — either way the next warn-state
read returns the default `{callsSinceWarn: 0, lastSeverity: null}`
and the threshold alarms re-fire or stop firing erratically. Users
see warning messages flicker or vanish; debounce no longer works.

Three call sites in this repo now share the same atomic-write
contract:
- `writeBridgeAtomic` (scripts/lib/session-bridge.js) — primary
- `writeCostWarningIfChanged` (scripts/hooks/ecc-metrics-bridge.js) — cost cache
- `writeWarnState` (this file) — debounce state

`yarn lint` clean. Regression test covering both `writeBridgeAtomic`
and `writeWarnState` under concurrent load lands in the next commit.
2026-05-19 09:30:07 +09:00
Jamkris 95602cc27b fix(lib): use unique tmp suffix in writeBridgeAtomic to eliminate ENOENT race
`writeBridgeAtomic` wrote to a fixed `${target}.tmp` path before
calling `renameSync`. When two processes write to the same session
bridge concurrently (e.g. PostToolUse `ecc-metrics-bridge` + the
background `ecc-statusline`, both calling `writeBridgeAtomic(sessionId, ...)`),
the canonical atomic-rename race fires:

  1. Process A: writeFileSync(target.tmp, JSON_A) — tmp file exists.
  2. Process B: writeFileSync(target.tmp, JSON_B) — tmp file overwritten.
  3. Process A: renameSync(target.tmp, target) — succeeds; target = JSON_B
     (A's payload silently corrupted en-route).
  4. Process B: renameSync(target.tmp, target) — throws ENOENT (the
     rename consumed the file).

Every caller in the repo wraps `writeBridgeAtomic` in `try {} catch {}`,
so the ENOENT exception is swallowed and the user-visible symptom is
just "the bridge file occasionally contains the wrong process's
payload" with no diagnostic.

Reproduced before this commit:

  $ # two concurrent writers, each calling writeBridgeAtomic 500 times
  $ # against the same session ID
  [A] errors=244   # 244 ENOENT exceptions swallowed
  [B] errors=248   # ditto

After this commit the same workload reports 0 errors in both
subprocesses: tmp paths no longer collide.

Fix: change `${target}.tmp` to
`${target}.${process.pid}.${crypto.randomBytes(4).toString('hex')}.tmp`,
matching the pattern already used by `writeCostWarningIfChanged` in
`scripts/hooks/ecc-metrics-bridge.js` (commit 9b1d8918). The pid +
4-byte nonce gives each writer process a distinct tmp path, so step 2
above no longer overwrites step 1's payload and step 4 no longer
races step 3.

Also added: on `renameSync` failure, attempt `fs.unlinkSync(tmp)` so
a writer that fails (disk full, permission, parent dir gone) does
not leak its tmp file. The cleanup is best-effort and the original
error is still re-thrown.

**Scope clarification.** This commit closes the atomic-rename
primitive's race only. The *read-modify-write* race in callers —
two writers each read the same bridge state, increment, and write
back, the second clobbering the first — is a separate concern that
needs locking or per-writer logs, and is intentionally out of scope
for this PR. The cost-tracker / metrics-bridge callers tolerate
last-writer-wins on their cumulative aggregates today and this
commit does not change that contract.

The companion `writeWarnState` in `ecc-context-monitor.js` has the
same fixed-suffix pattern and the same race; that fix lands in the
next commit so each can be reviewed against its own diff.
2026-05-19 09:29:09 +09:00
Affaan Mustafa cb81f1b0fe docs: narrow ecc tools billing blocker 2026-05-18 16:45:31 -04:00
Affaan Mustafa 7e2cdeaeb5 docs: refresh rc1 operator evidence 2026-05-18 16:27:09 -04:00
Affaan Mustafa 4470e2e670 docs: refresh rc1 publication evidence 2026-05-18 16:12:37 -04:00
Affaan Mustafa 67e63e63f9 docs: align publication readiness evidence 2026-05-18 15:36:39 -04:00
Affaan Mustafa fe7b4f2ba3 docs: regenerate operator readiness dashboard 2026-05-18 15:24:25 -04:00
Affaan Mustafa 0f1775e30b docs: refresh release blockers evidence 2026-05-18 15:23:48 -04:00
Affaan Mustafa 12ac22e674 docs: add discussion response playbook 2026-05-18 14:39:11 -04:00
Affaan Mustafa c032e07b1e docs: refresh may 18 release evidence 2026-05-18 14:24:50 -04:00
Affaan Mustafa 97567a91e7 test: normalize release workflow line endings 2026-05-18 13:53:26 -04:00
Affaan Mustafa 7911af4a39 security: scope release oidc publishing 2026-05-18 13:41:10 -04:00
Affaan Mustafa 386326df8e fix: treat MCP HTTP 406 probes as reachable 2026-05-18 12:48:52 -04:00
Affaan Mustafa b41e6fb3d0 docs: refresh publication readiness gate 2026-05-18 10:49:49 -04:00
Affaan Mustafa 99e01ded7d docs: refresh operator dashboard evidence 2026-05-18 10:32:26 -04:00
Affaan Mustafa 2ba0c62d8a docs: mirror agentshield fleet ticket evidence 2026-05-18 10:24:21 -04:00
Affaan Mustafa 9abe721bfe docs: refresh release readiness evidence 2026-05-18 09:30:14 -04:00
Affaan Mustafa 680aeff0fb test: enforce release publication checklist in readiness gates 2026-05-18 09:10:51 -04:00
Affaan Mustafa 6c0fbfb6c5 docs: add release plugin publication checklist 2026-05-18 08:56:17 -04:00
Affaan Mustafa 0e88e6a4dd docs: refresh zero queue dashboard 2026-05-18 06:37:10 -04:00
Affaan Mustafa cdc92de42a docs: finish owner queue cleanup 2026-05-18 06:35:44 -04:00
Affaan Mustafa 25dc518e1d docs: regenerate owner queue dashboard 2026-05-18 06:17:31 -04:00
Affaan Mustafa 08807e7fd6 docs: record owner-wide queue cleanup 2026-05-18 06:16:45 -04:00
Affaan Mustafa feeaa97511 docs: regenerate operator readiness dashboard 2026-05-18 05:38:44 -04:00
Affaan Mustafa 5e8f412cb5 docs: refresh ecc tools billing blocker evidence 2026-05-18 05:38:14 -04:00
Affaan Mustafa 4d6fc194ea fix: include blender skill in install manifest 2026-05-18 04:54:17 -04:00
Affaan Mustafa aae735d458 docs: regenerate operator readiness dashboard 2026-05-18 04:30:43 -04:00
Affaan Mustafa ff3eaff137 docs: refresh billing readback gate evidence 2026-05-18 04:30:09 -04:00
Da Wei 922d2d8f8b Add Blender motion state inspection skill
Adds the Blender motion state inspection skill with maintainer refinements for tools metadata, usage guidance, meter-scale threshold assumptions, and Blender interpreter notes.
2026-05-18 04:11:31 -04:00
Affaan Mustafa bf17737969 test: stabilize repair lifecycle on Windows 2026-05-18 03:48:51 -04:00
Affaan Mustafa f92f15199c docs: refresh target billing dashboard evidence 2026-05-18 03:28:36 -04:00
Affaan Mustafa fb4b0c8dce docs: mirror target billing readback gate 2026-05-18 03:27:42 -04:00
Affaan Mustafa aa634df9e5 docs: record clean preview pack smoke 2026-05-18 02:48:41 -04:00
Affaan Mustafa 742bc58d97 docs: refresh release evidence after ioc scanner hardening 2026-05-18 02:45:30 -04:00
Affaan Mustafa 04d4d81938 fix: ignore defensive ioc deny rules 2026-05-18 02:29:59 -04:00
Affaan Mustafa 99e9f118bd docs: refresh evidence head after billing mirror 2026-05-18 02:18:22 -04:00
Affaan Mustafa f010f78332 docs: refresh dashboard after wrangler billing mirror 2026-05-18 02:04:21 -04:00
Affaan Mustafa e53933de1b docs: refine billing readback dashboard blocker 2026-05-18 02:03:37 -04:00
Affaan Mustafa 10313d847a docs: mirror ecc tools wrangler billing readback 2026-05-18 02:00:46 -04:00
Affaan Mustafa aa4ae863f8 docs: refresh release evidence after provider guard merge 2026-05-18 01:30:51 -04:00
Affaan Mustafa 80f6c27957 Merge PR #1976 provider response guards 2026-05-18 01:05:37 -04:00
Affaan Mustafa eb0d893948 fix: harden openai-compatible provider responses 2026-05-18 01:04:28 -04:00
Your Name cc62e89152 fix: guard against empty choices in OpenAI and AstraFlow LLM providers
The OpenAI-compatible API can return HTTP 200 with an empty choices list
or choices[0].message = None (content-filtered responses on Gemini,
overwhelmed Ollama instances). Without a guard, both sites raise an
unhandled IndexError or AttributeError crashing the provider.

Added guard in OpenAIProvider.generate() and AstraFlowProvider.generate().
2026-05-17 23:49:00 -05:00
Affaan Mustafa 044d1863d0 test: skip insaits monitor subprocesses without python 2026-05-18 00:47:05 -04:00
Affaan Mustafa 43822b9c1a docs: refresh operator readiness dashboard 2026-05-18 00:36:30 -04:00
Affaan Mustafa c276639bc7 docs: mirror marketplace billing provenance gate 2026-05-18 00:36:01 -04:00
Affaan Mustafa 804f8ab79a docs: refresh dashboard for billing readback 2026-05-18 00:01:16 -04:00