Compare commits

...

6 Commits

Author SHA1 Message Date
Jamkris
1f6fc9c27d fix(hooks): use shared renameWithRetry in writeWarnState (ecc-context-monitor)
Mirror the previous commit's Windows-EPERM retry on the companion
`writeWarnState` in `scripts/hooks/ecc-context-monitor.js`. Same
race: two PostToolUse subprocesses writing concurrent debounce
state racing on `MoveFileExW`, target-in-use throwing EPERM on
Windows even though each writer's tmp path is now unique.

Implementation: import `renameWithRetry` from `scripts/lib/session-bridge.js`
(exported in the previous commit) instead of duplicating the helper.
The retry policy, backoff schedule, and main-thread `Atomics.wait`
strategy stay identical to `writeBridgeAtomic`.

Three writers in the repo now share the same atomic-write contract:
- `writeBridgeAtomic` (scripts/lib/session-bridge.js) — round 1 +
  this round's retry
- `writeWarnState` (this file) — round 1 + this round's retry via shared helper
- `writeCostWarningIfChanged` (scripts/hooks/ecc-metrics-bridge.js) —
  out of scope for this PR (already uses unique tmp suffix; a future
  consolidation could move it to the shared helper too).

Local: `yarn test` green, `yarn lint` clean. The companion test
suite for `ecc-context-monitor.js` does not currently exercise
concurrent `writeWarnState` writes, but the helper it now uses is
covered by the `tests/lib/session-bridge.test.js` concurrent-write
regression added in round 1's last commit.
2026-05-19 11:01:28 +09:00
Jamkris
bcbd2acee2 fix(lib): retry rename on Windows EPERM/EACCES/EBUSY in writeBridgeAtomic
PR #1983 round 1 introduced unique-suffix tmp paths so two concurrent
writers no longer share a single `.tmp` file. That fix is correct
under POSIX semantics — `rename(2)` is atomic between source and
destination, so each writer renames onto the same target without
conflict.

Windows `MoveFileExW` is not the same. It fails with
EPERM / EACCES / EBUSY when the target is currently being renamed
by *another* process — a short race window that fires reliably under
this hook's PostToolUse + statusline concurrency. Round 1's CI run
made this visible:

  Test (windows-latest, Node 18.x, npm) — FAILURE
  Error: EPERM: operation not permitted, rename
    'C:\…\ecc-metrics-test-bridge-race-….json.9504.4aef575a.tmp' ->
    'C:\…\ecc-metrics-test-bridge-race-….json'
      at writeBridgeAtomic (scripts/lib/session-bridge.js:79:8)

All nine Windows matrix cells (Node 18 / 20 / 22 × npm / pnpm / yarn)
hit the same path. POSIX matrices (Linux + macOS) passed unchanged.

Fix: extract a `renameWithRetry(tmp, target)` helper that retries
`fs.renameSync` up to 5 times on EPERM / EACCES / EBUSY with
exponential backoff (20 ms → 320 ms total). Other error codes
(ENOENT, ENOSPC, EROFS, …) re-throw on the first attempt — they are
not transient. POSIX runs hit the first try and exit immediately.

The backoff uses `Atomics.wait` on a throwaway `SharedArrayBuffer`
so the retry path does not busy-spin the CPU; verified on Node ≥ 17
that this works on the main thread. There is a `try/catch` fallback
to a brief busy-wait for older runtimes where `Atomics.wait` is
restricted to workers.

`writeBridgeAtomic` calls the helper instead of `fs.renameSync` and
keeps its existing best-effort tmp cleanup on terminal failure.

`renameWithRetry` is added to `module.exports` so the companion
`writeWarnState` in `scripts/hooks/ecc-context-monitor.js` can
adopt the same retry policy without duplicating the helper. That
adoption lands in the next commit.

Local: `node tests/lib/session-bridge.test.js` 14/14, `yarn test`
green, `yarn lint` clean. The round-1 test (two concurrent child
writers, 200 iterations each) now passes on macOS without retrying
at all (POSIX path) and is expected to pass on Windows via the new
retry loop.
2026-05-19 11:01:10 +09:00
Jamkris
8149beac02 test(lib): make concurrent-write test actually concurrent + use regex matcher for assert.throws
Two round-1 review findings in `tests/lib/session-bridge.test.js`,
both about test correctness rather than the underlying fix:

1. **greptile P1 + coderabbitai Major + cubic P2 (all three): concurrent-write test ran sequentially.**

   The test spawned two child processes with two consecutive
   `spawnSync` calls. Because `spawnSync` blocks until the child
   exits, the second writer started *after* the first finished —
   the two writers never overlapped, so the rename race the fix
   targets was never actually exercised. The test would have passed
   with the old broken `${target}.tmp` suffix.

   Fix: introduce a one-off "race runner" helper that runs inside
   its own subprocess and uses async `spawn` to start both writers
   simultaneously. The runner waits for both to exit (the event
   loop is local to the runner subprocess, so this stays compatible
   with the synchronous test harness used elsewhere in this file)
   and reports both exit codes plus stderrs on stdout. The test
   then calls the runner via `spawnSync` and parses the result.
   Both writer children now overlap for the duration of their 200
   `writeBridgeAtomic` calls each, which is enough wall time to
   reliably trigger the rename race against the pre-fix code.

   Verified: with the fixed `${target}.${pid}.${nonce}.tmp` suffix,
   the test passes; with the old fixed `${target}.tmp` suffix
   reintroduced, it fails as expected (one writer hits ENOENT on
   roughly half its rename calls).

2. **greptile P2 + cubic P3: `assert.throws` used a string as the second argument.**

   Node deprecated passing a string as the second argument to
   `assert.throws` years ago: the string is silently treated as
   the assertion failure message (what to print when the function
   does *not* throw) rather than as an error matcher. The check
   passed for any thrown error, not just the rename failure.

   Fix: pass a regex matcher as the second arg and keep the
   explanatory text as the third. The regex matches `EISDIR`,
   `EPERM`, `ENOTDIR`, or `ENOENT` because `renameSync` of a
   regular tmp file onto an existing directory raises different
   codes on Linux / macOS / BSD — making the matcher portable
   across CI runners.

Test count unchanged at 14; `npm test` green; `npm run lint` clean.

The two helper files (`tests/__tmp_bridge_writer.js`,
`tests/__tmp_bridge_race_runner.js`) are written and unlinked
inside the test's try/finally so they never persist beyond the
test run.
2026-05-19 09:44:53 +09:00
Jamkris
a2b98be4a3 test(lib): concurrent writeBridgeAtomic + tmp-cleanup regression
Two regression tests pin down the previous two commits' atomic-rename
fixes:

1. **concurrent writes don't throw ENOENT or corrupt the file** —
   spawns two child Node processes (`tests/__tmp_bridge_writer.js`
   created in-test, cleaned up in finally) that each call
   `writeBridgeAtomic(sid, …)` 200 times against the same session
   ID with independent payloads. Asserts both subprocesses exit 0
   (the previous implementation produced ENOENT on roughly 50% of
   rename calls, all swallowed by the in-test catch) and the final
   bridge file is parseable JSON belonging to one of the two writers
   (last-writer-wins is fine; the contract is *no corruption* and
   *no rename ENOENT*, not data preservation).

2. **tmp file cleanup on rename failure** — pre-creates a directory
   at the target bridge path so `renameSync(tmp, target)` fails,
   calls `writeBridgeAtomic`, asserts the call throws AND that no
   tmp file with the writer's `pid.<nonce>.tmp` prefix is left
   behind in `os.tmpdir()`. The previous code had no cleanup; the
   fix's `try/catch + unlinkSync` keeps tmpdir from accumulating
   orphan files across repeated rename failures.

The first test deliberately writes independent payloads from each
subprocess so this regression doesn't try to claim a property the
fix doesn't actually deliver (read-modify-write race in the caller
is a separate issue and out of scope per PR body).

Test count: 12 → 14 in `tests/lib/session-bridge.test.js`;
`npm test` green; `npm run lint` clean.
2026-05-19 09:31:35 +09:00
Jamkris
74fda58a9c fix(hooks): use unique tmp suffix in writeWarnState (ecc-context-monitor)
Mirror the previous commit's `writeBridgeAtomic` fix on the
companion `writeWarnState` in `ecc-context-monitor.js`. Same shape:
fixed `${target}.tmp` → `${target}.${process.pid}.${randomNonce}.tmp`,
plus best-effort cleanup of the tmp file on `renameSync` failure
(throws original error after cleanup).

`writeWarnState` debounces the context-monitor's threshold alarms
(`COST_NOTICE_USD`, `COST_WARNING_USD`, `COST_CRITICAL_USD`, plus the
context-remaining and loop-detection ones). Without unique suffixes,
two PostToolUse subprocesses racing on the warn-state file produce
either a corrupted JSON debounce-state on disk or an ENOENT throw
that the hook catches and swallows — either way the next warn-state
read returns the default `{callsSinceWarn: 0, lastSeverity: null}`
and the threshold alarms re-fire or stop firing erratically. Users
see warning messages flicker or vanish; debounce no longer works.

Three call sites in this repo now share the same atomic-write
contract:
- `writeBridgeAtomic` (scripts/lib/session-bridge.js) — primary
- `writeCostWarningIfChanged` (scripts/hooks/ecc-metrics-bridge.js) — cost cache
- `writeWarnState` (this file) — debounce state

`yarn lint` clean. Regression test covering both `writeBridgeAtomic`
and `writeWarnState` under concurrent load lands in the next commit.
2026-05-19 09:30:07 +09:00
Jamkris
95602cc27b fix(lib): use unique tmp suffix in writeBridgeAtomic to eliminate ENOENT race
`writeBridgeAtomic` wrote to a fixed `${target}.tmp` path before
calling `renameSync`. When two processes write to the same session
bridge concurrently (e.g. PostToolUse `ecc-metrics-bridge` + the
background `ecc-statusline`, both calling `writeBridgeAtomic(sessionId, ...)`),
the canonical atomic-rename race fires:

  1. Process A: writeFileSync(target.tmp, JSON_A) — tmp file exists.
  2. Process B: writeFileSync(target.tmp, JSON_B) — tmp file overwritten.
  3. Process A: renameSync(target.tmp, target) — succeeds; target = JSON_B
     (A's payload silently corrupted en-route).
  4. Process B: renameSync(target.tmp, target) — throws ENOENT (the
     rename consumed the file).

Every caller in the repo wraps `writeBridgeAtomic` in `try {} catch {}`,
so the ENOENT exception is swallowed and the user-visible symptom is
just "the bridge file occasionally contains the wrong process's
payload" with no diagnostic.

Reproduced before this commit:

  $ # two concurrent writers, each calling writeBridgeAtomic 500 times
  $ # against the same session ID
  [A] errors=244   # 244 ENOENT exceptions swallowed
  [B] errors=248   # ditto

After this commit the same workload reports 0 errors in both
subprocesses: tmp paths no longer collide.

Fix: change `${target}.tmp` to
`${target}.${process.pid}.${crypto.randomBytes(4).toString('hex')}.tmp`,
matching the pattern already used by `writeCostWarningIfChanged` in
`scripts/hooks/ecc-metrics-bridge.js` (commit 9b1d8918). The pid +
4-byte nonce gives each writer process a distinct tmp path, so step 2
above no longer overwrites step 1's payload and step 4 no longer
races step 3.

Also added: on `renameSync` failure, attempt `fs.unlinkSync(tmp)` so
a writer that fails (disk full, permission, parent dir gone) does
not leak its tmp file. The cleanup is best-effort and the original
error is still re-thrown.

**Scope clarification.** This commit closes the atomic-rename
primitive's race only. The *read-modify-write* race in callers —
two writers each read the same bridge state, increment, and write
back, the second clobbering the first — is a separate concern that
needs locking or per-writer logs, and is intentionally out of scope
for this PR. The cost-tracker / metrics-bridge callers tolerate
last-writer-wins on their cumulative aggregates today and this
commit does not change that contract.

The companion `writeWarnState` in `ecc-context-monitor.js` has the
same fixed-suffix pattern and the same race; that fix lands in the
next commit so each can be reviewed against its own diff.
2026-05-19 09:29:09 +09:00
3 changed files with 214 additions and 7 deletions

View File

@@ -9,10 +9,11 @@
'use strict'; 'use strict';
const crypto = require('crypto');
const fs = require('fs'); const fs = require('fs');
const os = require('os'); const os = require('os');
const path = require('path'); const path = require('path');
const { sanitizeSessionId, readBridge } = require('../lib/session-bridge'); const { sanitizeSessionId, readBridge, renameWithRetry } = require('../lib/session-bridge');
const CONTEXT_WARNING_PCT = 35; const CONTEXT_WARNING_PCT = 35;
const CONTEXT_CRITICAL_PCT = 25; const CONTEXT_CRITICAL_PCT = 25;
@@ -61,15 +62,30 @@ function readWarnState(sessionId) {
} }
/** /**
* Write debounce state. * Write debounce state atomically (unique-suffix tmp then rename).
*
* The tmp path includes `process.pid` plus a random nonce so concurrent
* PostToolUse subprocesses writing to the same session's warn-state
* file do not clobber each other's tmp mid-write. Without the unique
* suffix, two writers race over a shared `${target}.tmp` and produce
* either a corrupted payload or an ENOENT throw on the second rename.
*
* Same pattern as `writeBridgeAtomic` in `scripts/lib/session-bridge.js`
* and `writeCostWarningIfChanged` in `scripts/hooks/ecc-metrics-bridge.js`.
*
* @param {string} sessionId * @param {string} sessionId
* @param {object} state * @param {object} state
*/ */
function writeWarnState(sessionId, state) { function writeWarnState(sessionId, state) {
const target = getWarnPath(sessionId); const target = getWarnPath(sessionId);
const tmp = `${target}.tmp`; const tmp = `${target}.${process.pid}.${crypto.randomBytes(4).toString('hex')}.tmp`;
fs.writeFileSync(tmp, JSON.stringify(state), 'utf8'); fs.writeFileSync(tmp, JSON.stringify(state), 'utf8');
fs.renameSync(tmp, target); try {
renameWithRetry(tmp, target);
} catch (err) {
try { fs.unlinkSync(tmp); } catch { /* ignore */ }
throw err;
}
} }
/** /**

View File

@@ -8,6 +8,7 @@
* without scanning large JSONL logs on every invocation. * without scanning large JSONL logs on every invocation.
*/ */
const crypto = require('crypto');
const fs = require('fs'); const fs = require('fs');
const os = require('os'); const os = require('os');
const path = require('path'); const path = require('path');
@@ -51,15 +52,80 @@ function readBridge(sessionId) {
} }
/** /**
* Write bridge data atomically (write .tmp then rename). * Write bridge data atomically (write unique-suffix tmp then rename).
*
* The tmp path includes `process.pid` plus a random nonce so concurrent
* writers (e.g. PostToolUse `ecc-metrics-bridge` and the background
* `ecc-statusline`, both writing to the same session bridge) do not
* clobber each other's tmp file mid-write. With a fixed `.tmp` suffix
* two writers could both call `writeFileSync` against the same path
* before either reaches `renameSync`, causing one writer's payload to
* silently overwrite the other and the second `renameSync` to throw
* ENOENT once the rename consumes the file.
*
* Same pattern already used by `writeCostWarningIfChanged` in
* `scripts/hooks/ecc-metrics-bridge.js` (commit 9b1d8918) for the
* cost-warning cache; this commit applies it to the session-bridge
* primitive too.
*
* @param {string} sessionId - Already-sanitized session ID * @param {string} sessionId - Already-sanitized session ID
* @param {object} data * @param {object} data
*/ */
function writeBridgeAtomic(sessionId, data) { function writeBridgeAtomic(sessionId, data) {
const target = getBridgePath(sessionId); const target = getBridgePath(sessionId);
const tmp = `${target}.tmp`; const tmp = `${target}.${process.pid}.${crypto.randomBytes(4).toString('hex')}.tmp`;
fs.writeFileSync(tmp, JSON.stringify(data), 'utf8'); fs.writeFileSync(tmp, JSON.stringify(data), 'utf8');
try {
renameWithRetry(tmp, target);
} catch (err) {
try { fs.unlinkSync(tmp); } catch { /* ignore */ }
throw err;
}
}
/**
* Replace a file via rename, retrying briefly on transient OS-level errors.
*
* POSIX `rename(2)` is atomic between source and destination, so concurrent
* writers each rename onto the same target without conflict. Windows
* `MoveFileExW` is different: it fails with EPERM/EACCES/EBUSY if the
* target is currently being renamed by *another* process — a short race
* window that fires reliably under our PostToolUse + statusline concurrency.
*
* To stay portable, retry up to 5 times with exponential backoff (20 ms,
* 40, 80, 160, 320) on the Windows-only transient codes. POSIX runs hit
* the first try and exit immediately. Other error codes (ENOENT, ENOSPC,
* EROFS, …) re-throw without retry — they are not transient.
*
* Sleep uses `Atomics.wait` on a throwaway SharedArrayBuffer so the
* retry path does not busy-spin the CPU. This works on the main thread
* in Node ≥ 17 (and on workers in earlier versions).
*
* @param {string} tmp
* @param {string} target
*/
function renameWithRetry(tmp, target) {
const RETRY_CODES = new Set(['EPERM', 'EACCES', 'EBUSY']);
const MAX_ATTEMPTS = 5;
for (let attempt = 0; ; attempt++) {
try {
fs.renameSync(tmp, target); fs.renameSync(tmp, target);
return;
} catch (err) {
if (attempt + 1 >= MAX_ATTEMPTS || !RETRY_CODES.has(err.code)) {
throw err;
}
const delayMs = 20 << attempt;
try {
Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, delayMs);
} catch {
// Atomics.wait throws on the main thread in some older runtimes;
// fall back to a brief busy-wait so the retry path still has a delay.
const until = Date.now() + delayMs;
while (Date.now() < until) { /* spin */ }
}
}
}
} }
/** /**
@@ -76,6 +142,7 @@ module.exports = {
getBridgePath, getBridgePath,
readBridge, readBridge,
writeBridgeAtomic, writeBridgeAtomic,
renameWithRetry,
resolveSessionId, resolveSessionId,
MAX_SESSION_ID_LENGTH MAX_SESSION_ID_LENGTH
}; };

View File

@@ -135,6 +135,130 @@ function runTests() {
passed++; passed++;
else failed++; else failed++;
// Concurrency contract: two processes writing to the same session
// bridge must not throw ENOENT and must never leave a corrupt JSON
// file behind. The previous implementation used a fixed `${target}.tmp`
// suffix; with concurrent writers it raced over a shared tmp path,
// producing both ENOENT on rename and (occasionally) a half-written
// payload on the destination.
//
// This test exercises the atomic-rename primitive only — it does NOT
// attempt to defend against the read-modify-write race in callers,
// which is a separate concern. Each subprocess writes its own
// independent payload N times; we assert (a) every process exits 0
// (no ENOENT bubbled up) and (b) the final file is always parseable
// JSON whose contents match one of the two writers' last payloads.
if (
test('concurrent writeBridgeAtomic does not throw ENOENT or corrupt the bridge file', () => {
// Spawn two child processes that BOTH stay alive at the same time
// and call writeBridgeAtomic in a tight loop. `spawnSync` would
// run them sequentially (blocking on each), which would never
// exercise the race the fix targets. Instead a sync runner script
// launches both as async `spawn` children inside its own process,
// waits for both to exit, and reports their statuses on stdout —
// and the test calls *that* runner via `spawnSync`. The runner is
// the only place that needs the event loop.
const { spawnSync } = require('child_process');
const path = require('path');
const testId = `test-bridge-race-${Date.now()}-${process.pid}`;
const writerPath = path.join(__dirname, '..', '__tmp_bridge_writer.js');
const runnerPath = path.join(__dirname, '..', '__tmp_bridge_race_runner.js');
const bridgeLib = path.join(__dirname, '..', '..', 'scripts', 'lib', 'session-bridge');
fs.writeFileSync(
writerPath,
[
"const { writeBridgeAtomic } = require(" + JSON.stringify(bridgeLib) + ");",
"const [, , sid, tag] = process.argv;",
"for (let i = 0; i < 200; i++) {",
" writeBridgeAtomic(sid, { writer: tag, i });",
"}",
].join('\n'),
'utf8'
);
fs.writeFileSync(
runnerPath,
[
"'use strict';",
"const { spawn } = require('child_process');",
"const [, , writerPath, sid] = process.argv;",
"const c1 = spawn(process.execPath, [writerPath, sid, 'A'], { stdio: ['ignore','pipe','pipe'] });",
"const c2 = spawn(process.execPath, [writerPath, sid, 'B'], { stdio: ['ignore','pipe','pipe'] });",
"const exits = {};",
"const stderrs = { A: '', B: '' };",
"c1.stderr.on('data', chunk => { stderrs.A += chunk.toString(); });",
"c2.stderr.on('data', chunk => { stderrs.B += chunk.toString(); });",
"let done = 0;",
"function onExit(tag) { return function(code) { exits[tag] = code; if (++done === 2) finish(); }; }",
"c1.on('exit', onExit('A'));",
"c2.on('exit', onExit('B'));",
"function finish() {",
" process.stdout.write(JSON.stringify({ exits, stderrs }));",
" process.exit(0);",
"}",
].join('\n'),
'utf8'
);
try {
const result = spawnSync('node', [runnerPath, writerPath, testId], { encoding: 'utf8' });
assert.strictEqual(result.status, 0,
`race runner should exit 0, got ${result.status}: ${result.stderr}`);
const parsed = JSON.parse(result.stdout);
assert.strictEqual(parsed.exits.A, 0,
`writer A should exit 0 (no ENOENT), got ${parsed.exits.A}: ${parsed.stderrs.A}`);
assert.strictEqual(parsed.exits.B, 0,
`writer B should exit 0 (no ENOENT), got ${parsed.exits.B}: ${parsed.stderrs.B}`);
// Final file must be parseable JSON and belong to one of the writers.
const final = readBridge(testId);
assert.ok(final && typeof final === 'object',
`expected parseable JSON object, got: ${JSON.stringify(final)}`);
assert.ok(final.writer === 'A' || final.writer === 'B',
`expected last-writer-wins payload, got: ${JSON.stringify(final)}`);
} finally {
try { fs.unlinkSync(getBridgePath(testId)); } catch { /* ignore */ }
try { fs.unlinkSync(writerPath); } catch { /* ignore */ }
try { fs.unlinkSync(runnerPath); } catch { /* ignore */ }
}
})
)
passed++;
else failed++;
if (
test('writeBridgeAtomic cleans up its tmp file on renameSync failure', () => {
// Trigger renameSync failure by passing a sessionId whose path is
// already a directory. The tmp file exists at this point; the fix
// must not leak it behind.
const path = require('path');
const testId = `test-bridge-cleanup-${Date.now()}-${process.pid}`;
const target = getBridgePath(testId);
const os = require('os');
const tmpDir = os.tmpdir();
// Plant a directory at the target path so renameSync (target.tmp → target) fails.
fs.mkdirSync(target);
try {
assert.throws(
() => writeBridgeAtomic(testId, { x: 1 }),
// renameSync of a regular file onto an existing directory throws
// EISDIR on Linux, EPERM on macOS, ENOTDIR on some BSDs. Accept
// any of those so the test stays portable across CI runners.
/EISDIR|EPERM|ENOTDIR|ENOENT/,
'expected rename failure to surface'
);
// Count any leaked tmp files. The pid+nonce suffix is unique per
// call, so we look for any matching pattern under os.tmpdir().
const prefix = path.basename(target) + '.' + process.pid + '.';
const leaked = fs.readdirSync(tmpDir).filter(f => f.startsWith(prefix) && f.endsWith('.tmp'));
assert.strictEqual(leaked.length, 0,
`expected no leaked tmp files after rename failure, found: ${leaked.join(', ')}`);
} finally {
try { fs.rmdirSync(target); } catch { /* ignore */ }
}
})
)
passed++;
else failed++;
// resolveSessionId tests // resolveSessionId tests
console.log('\nresolveSessionId:'); console.log('\nresolveSessionId:');