From e3483fda159cce89ec81551030018eb7da86a876 Mon Sep 17 00:00:00 2001 From: Jamkris Date: Tue, 19 May 2026 09:16:26 +0900 Subject: [PATCH] =?UTF-8?q?fix(ci):=20cover=20Unicode=20Tag=20block=20(U+E?= =?UTF-8?q?0000=E2=80=93U+E007F)=20in=20check-unicode-safety?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `isDangerousInvisibleCodePoint` enumerated seven ranges of invisible/ bidi/variation-selector code points but omitted the Unicode Tag block (U+E0000–U+E007F). Tag characters were proposed for language tagging in Unicode 3.1 and have been deprecated since Unicode 5.1, so no legitimate text uses them. They are the canonical vector for "ASCII Smuggling" / "Tag Smuggling" LLM prompt injection: an attacker hides instructions inside an ASCII-looking string, the model reads the tag bytes, the human reviewer sees nothing. Demonstrated against multiple LLM assistants during 2024–2025. `check-unicode-safety.js` is the repo's last line of defence before contributor content reaches agent context; the same script also runs in `--write` auto-sanitize mode on `.md` / `.mdx` / `.txt`. Today it silently passes tag-block characters through unchanged in both detection mode and `--write` mode. Reproduced before this commit: $ mkdir -p /tmp/uni-test && node -e " const fs = require('fs'); const hidden = [...Array(5)].map((_,i) => String.fromCodePoint(0xE0041 + i)).join(''); fs.writeFileSync('/tmp/uni-test/innocent.md', '# Title\\n\\nBenign text' + hidden + ' more.\\n');" $ ECC_UNICODE_SCAN_ROOT=/tmp/uni-test \ node scripts/ci/check-unicode-safety.js Unicode safety check passed. $ echo $? 0 Expected: tag-block characters reported as `dangerous-invisible` violations (exit 1) and stripped under `--write`. Actual: validator passes, `--write` leaves the bytes intact. Fix: extend the denylist with one new range `(codePoint >= 0xE0000 && codePoint <= 0xE007F)`. The change is purely additive; the existing seven ranges are untouched. After this commit the same reproduction returns: $ ECC_UNICODE_SCAN_ROOT=/tmp/uni-test \ node scripts/ci/check-unicode-safety.js Unicode safety violations detected: innocent.md:3:12 dangerous-invisible U+E0041 innocent.md:3:14 dangerous-invisible U+E0042 innocent.md:3:16 dangerous-invisible U+E0043 innocent.md:3:18 dangerous-invisible U+E0044 innocent.md:3:20 dangerous-invisible U+E0045 exit=1 `--write` mode also strips the bytes (verified: file length 47 → 42 after sanitize, regex `/[\u{E0000}-\u{E007F}]/u` no longer matches). Existing 5 unicode-safety tests still pass; `yarn lint` clean. The ECC repo's own self-scan (`node scripts/ci/check-unicode-safety.js` with no `ECC_UNICODE_SCAN_ROOT`) reports the same warnings as before this commit and exits with the same status (no regressions on in-repo content). A handful of other widely-cited invisible code points are missing from the denylist (`U+180E`, `U+115F`, `U+1160`, `U+2061–U+2064`, `U+3164`); those are addressed in the next commit so each fix remains independently reviewable. Regression coverage for both fixes lands two commits later. --- scripts/ci/check-unicode-safety.js | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/scripts/ci/check-unicode-safety.js b/scripts/ci/check-unicode-safety.js index 6c7893e7..c4f1740c 100644 --- a/scripts/ci/check-unicode-safety.js +++ b/scripts/ci/check-unicode-safety.js @@ -114,7 +114,15 @@ function isDangerousInvisibleCodePoint(codePoint) { (codePoint >= 0x202A && codePoint <= 0x202E) || (codePoint >= 0x2066 && codePoint <= 0x2069) || (codePoint >= 0xFE00 && codePoint <= 0xFE0F) || - (codePoint >= 0xE0100 && codePoint <= 0xE01EF) + (codePoint >= 0xE0100 && codePoint <= 0xE01EF) || + // Unicode Tag block (U+E0000–U+E007F). Tag characters were proposed + // for language tagging in Unicode 3.1 and have been deprecated since + // Unicode 5.1, so no legitimate text uses them. They are the canonical + // vector for "ASCII smuggling" / "Tag smuggling" prompt injection: + // an attacker hides instructions inside ASCII-looking strings (PR + // bodies, SKILL.md, frontmatter), the LLM consumes the tag bytes, + // and the human reviewer sees nothing. + (codePoint >= 0xE0000 && codePoint <= 0xE007F) ); }