fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172) by sroussey · Pull Request #174 · workglow-dev/sec

sroussey · 2026-06-30T08:33:45Z

Summary

Two HIGH-priority follow-up fixes to the parallel hardening PRs #171 (XML entity bounds) and #172 (defang TAG_SHAPED widening). Both predecessor PRs left a closing seam that an adversarial filer could still drive through; this PR closes both.

The branch is based on #172's head (claude/wonderful-hypatia-bko1yr-defang-pool-fix) and merges in #171's head (claude/wonderful-hypatia-y6cb4l-entity-fix) as the first commit, so this PR carries both predecessor changes plus the two follow-up fixes on top. If #171 and #172 land first this PR rebases trivially; if not, it can ship them all together.

Fix 1 — Close residual Unicode-invisible defang bypass (follow-up to #172)

File: src/sec/forms/registration-statements/s1/sectionExtractors.ts

stripFormatChars previously stripped only 8 explicit BMP codepoints (ZWSP/ZWNJ/ZWJ/LRM/RLM/WJ/BOM/SHY). A filer could splice U+180E (Mongolian Vowel Separator), the math invisibles U+2061–U+2064, or any variation selector (BMP VS1–VS16 at U+FE00–U+FE0F, supplementary VS17–VS256 at U+E0100–U+E01EF) between the letters of UNTRUSTED_FILER_DOCUMENT. None of those codepoints appear in JS \s, and only U+FE0F is Mn (the rest are Cf), so they survived the strip — the downstream squashed.startsWith("UNTRUSTEDFILERDOCUMENT") check then failed because the residual non-letter codepoints stayed in the tag body after the squash to letters.

Widen the class to \p{Cf} (subsumes all 8 originals plus U+180E + math invisibles) joined with the explicit variation-selector ranges U+FE00–U+FE0F and U+E0100–U+E01EF. u flag is required for \p{...} and supplementary-plane escapes.

Also corrects misleading comments on two existing non-redaction tests: the actual rejection mechanism is the inner squashed.startsWith("UNTRUSTEDFILERDOCUMENT") check, not the case-insensitive [_A-Z] anchor (which the i flag makes irrelevant).

Fix 2 — Close stripDoctype bypass via leading XML comment / PI / quoted PUBLIC id (follow-up to #171)

File: src/sec/forms/Form.ts

The stripDoctype regex anchors at the start of the document and only tolerates an optional <?xml ...?> declaration before the DOCTYPE. A filer who slips a leading XML comment, a non-xml processing instruction, or a ]>/[ inside a quoted PUBLIC id defeats the regex and lets <!ENTITY ...> reach the parser. Bounded processEntities would still expand the declared entity up to its caps — surfacing filer-controlled bytes (PWNED, exfil URLs, etc.) in stored values.

Move the seal to the parser layer:

Flip processEntities to { enabled: false }. No entity ever expands. maxEntityCount / maxExpansionDepth / maxExpandedLength / maxTotalExpansions are removed as dead under enabled: false.
With expansion off the parser preserves every &...; byte sequence literally, including the five predefined ones — so round-tripping & would now break. Restore the round-trip with a new exported decodePredefinedEntities(value) walker that runs post-parse: single-pass regex over the predefined five, recursive over Arrays and plain objects (value.constructor === Object), untouched for number/boolean/null/undefined/Date/typed arrays. Returns new containers for pure semantics.
Single-pass contract is load-bearing: &lt; decodes to < (one regex match consumes the &; the literal lt; that follows is never re-scanned), not to <. This preserves the round-trip the bounded-processEntities mode had.
FormXmlParser.parse now returns decodePredefinedEntities(parser.parse(stripDoctype(xml))).
stripDoctype stays as best-effort hygiene; its JSDoc now spells out that it is no longer the security boundary — the parser-side enabled: false plus the post-walker is.

Tests added

sectionExtractors.injection.test.ts (+7 tests, all pass):

U+180E (Mongolian Vowel Separator) intra-tag obfuscation
U+2061 (FUNCTION APPLICATION) intra-tag obfuscation
U+2062 (INVISIBLE TIMES) intra-tag obfuscation
U+2063 (INVISIBLE SEPARATOR) intra-tag obfuscation
U+2064 (INVISIBLE PLUS) intra-tag obfuscation
U+FE0F (VS-16) intra-tag obfuscation
Combined adversarial mix (Mongolian VS + all 4 math invisibles + VS-16 + supplementary VS17)

Form.entity.test.ts (+10 tests, all pass):

DOCTYPE after leading  → entity stays literal
DOCTYPE after leading non-xml <?xml-stylesheet ...?> PI → entity stays literal
DOCTYPE with ]> inside quoted PUBLIC id → no PWNED
DOCTYPE with [ inside quoted PUBLIC id → no PWNED
All five predefined entities round-trip; &lt; → < (one-pass)
Unit block decodePredefinedEntities: decodes all five flat; recurses into nested object/array; leaves number/bool/null/undefined/Date untouched; does NOT double-decode; leaves typed arrays untouched

Existing billion-laughs test still passes under enabled: false.

Test plan

bun test src/sec/forms/registration-statements/s1/sectionExtractors.injection.test.ts — 36 pass
bun test src/sec/forms/Form.entity.test.ts — 17 pass
bun test src/sec/forms/ — 468 pass, 0 fail (the pre-existing JP-address normalization warnings are unrelated noise)
bun run build — clean (full build: bundle + tsc, no errors)

Notes

Two logical commits on top of the merge commit, one per fix.
New file copyright headers follow the repo convention (Copyright 2026 Steven Roussey <sroussey@gmail.com>, SPDX form). No new files were created — both fixes are edits to existing files.
No spec / plan / PR-number comments inside the source (per CLAUDE.md).

Generated by Claude Code

The prompt-injection seal around S-1/424 AI section extraction had two filer-controllable weak points: 1. The fence tag was a static literal (`UNTRUSTED_FILER_DOCUMENT`), so a filer could pre-stage a matching closing tag and end the fence early. The defang scan was case-insensitive but flat — only a single literal tag-shape was rewritten. 2. A model-emitted `source_span` was capped at the verifier (post- normalization) but persisted raw, so an attacker who slipped any verifier-passing row could ship unbounded raw bytes through the provenance column. This patch deepens the seal: - The fence tag carries a per-call 64-bit random nonce. The `UNTRUSTED_FILER_DOCUMENT_NONCE_<hex>` shape means a pre-staged closing tag in the prospectus cannot match the call's actual fence. - Before defang, the section body is HTML-entity-decoded (multi-pass, up to a fixed point), NFKC-normalized, and stripped of zero-width / bidi format chars. The defang scan matches any tag-shaped token whose alphabetic payload squashes to `UNTRUSTEDFILERDOCUMENT...`, so obfuscations via `<`, fullwidth letters, ZWSP, intra-tag spaces, and case-mixing all collapse to `[redacted-fence-tag]`. - A new `boundSourceSpan` caps stored spans at 1000 raw chars (returning `null` over the cap rather than truncating). A new `verifyRowSpan` rejects a span whose raw byte count exceeds the cap before the normalize-and-substring check runs, so a whitespace-inflated payload that would otherwise normalize under cap can no longer pass the gate. All `verifyRow:` callsites and `source_span:` persist sites in the S-1 storage and shared offering-sections layer route through these. - Bumps the S-1 extractor version to 1.3.0 and the 424 extractor version to 1.2.0: prompt-shape changes drift confidence calibration, and the span-storage shape changes too. Operators should run startDev/promote to roll the new version into production. Adds unit tests for `boundSourceSpan` / `verifyRowSpan` boundary cases, a 1500-raw-char whitespace-padded span dead-letter test in the storage layer, and obfuscation tests for fullwidth, HTML-entity, mixed-case + zero-width, intra-tag whitespace, wrong-nonce, and nonce uniqueness. Co-Authored-By: Claude <noreply@anthropic.com>

…ntity hardening Four hardening fixes around the Form 8-K event-storage path: 1. fast-xml-parser entity expansion is disabled (`processEntities: false`) on the shared Form XML parser. A filer-controlled SGML payload that declared a chain of nested entity references would otherwise expand into a multi-GB string ("billion laughs") and peg CPU at parse time. A regression test feeds a 10-level billion-laughs DOCTYPE through Form_8_K.parse and asserts the parse stays well under 50 ms. 2. EDGAR accession numbers cross trust boundaries unconstrained — the filing-task input schema and the Form 8-K event table both accepted any string, so an over-long or malformed accession could land in storage. Introduces `TypeAccessionNumber()` (20-char fixed length, `^\d{10}-\d{2}-\d{6}$`) and applies it at the ProcessAccessionDocFormTask input and the event row schema. 3. The Form 8-K event row keyed `(cik, accession_number, item_code)` as its primary key — a re-extract under a new extractor version would overwrite the prior version's rows, erasing the time series. Switches the table to a synthetic `event_id` AUTOINCREMENT PK plus an explicit `(cik, accession_number, extractor_id, extractor_version, item_code)` UNIQUE natural-key index, mirroring the PersonObservation / CompanyObservation shape. Both extractor columns are now first-class so coverage / drop-previous ceremonies can target a single version. A one-shot legacy-schema migration drops the pre-versioned table on the SQLite and Postgres paths (the natural-key PK cannot be ALTERed away on either backend, and 8-K events are deterministic to re-extract). 4. `processForm8K` previously looped over items with one `put` per item, so a mid-loop crash left the row set torn between old and new items for the same (filing, version). Adds `Form8KEventRepo.replaceEvents` — DELETE all rows for `(cik, accession_number, extractor_id, extractor_version)` then bulk-insert the new set, wrapped in a real transaction on the SQLite (better-sqlite3 `db.transaction`) and Postgres (`BEGIN / COMMIT / ROLLBACK` on a checked-out client) paths. The in-memory backend (tests only) is synchronous so a torn write cannot interleave. A failure-injection test seeds a row, then re-runs `replaceEvents` with a NOT NULL-violating second insert and asserts the prior baseline is intact after rollback. Also wires `extractor_id` + `extractor_version` through the task layer into `processForm8K` so the same writer can run under any version slot. Co-Authored-By: Claude <noreply@anthropic.com>

…B_TYPE token In CI the 21 Form_8_K tests failed with `no such table: form_8k_events` because `replaceForm8KEvents` was dispatching to `replaceSqlite` even though the test harness had wired `FORM_8K_EVENT_REPOSITORY_TOKEN` to an in-memory storage. The trigger was test-process global-DI contamination: `FetchDailyIndexTask.test.ts` calls `EnvToDI()` at module-load time, which registers `SEC_DB_TYPE = "sqlite"` in the `globalServiceRegistry`. The ServiceRegistry has no unregister API, so once any earlier test in the same Bun worker hits that path, `SEC_DB_TYPE` sticks for the rest of the run. `resetDependencyInjectionsForTesting()` rebinds the repo tokens to in-memory storages but cannot clear `SEC_DB_TYPE`, so the SQLite branch in `replaceForm8KEvents` won and reached for `getDb()`, which either fell over on an uninitialized SQLite handle (locally) or write-attempted against a table that was never created (CI). Fix: trust the actual repo. `InMemoryTabularStorage.isDurable()` returns `false`; the production storages don't override it. When the resolved repo is non-durable, take the repo path regardless of `SEC_DB_TYPE`. This makes the dispatch correct even when global config and the registered repo disagree, which is the steady-state in the test process. Reproduces locally via: bun test src/task/index/FetchDailyIndexTask.test.ts \\ src/sec/forms/miscellaneous-filings/Form_8_K.test.ts (without the fix: 25 Form_8_K fails; with the fix: all 29 pass). Co-Authored-By: Claude <noreply@anthropic.com>

Resolves conflicts created by PR #166 (SPAC de-SPAC lifecycle / merger-proxy / redemption extraction) landing on main after this PR opened. Conflicts resolved: - src/sec/forms/miscellaneous-filings/Form_8_K.storage.ts - Function signature combines both side's additive params: extractor_id, extractor_version (this PR), fullSubmissionText, model (#166). - Event writes go through replaceEvents() (this PR), threading extractor_id + extractor_version into the version-scoped delete-then-insert. - SPAC milestone mapping + redemption extraction blocks from #166 follow unchanged after the events are persisted. - src/task/forms/ProcessAccessionDocFormTask.ts - Keep TypeAccessionNumber import (this PR), processMergerProxy + hasRedemptionTriggerItem imports (#166). - 8-K dispatch call site passes both extractor_id/extractor_version and fullSubmissionText into processForm8K; merger-proxy case from #166 follows. - src/sec/forms/registration-statements/s1/sectionExtractors.ts (auto-merged cleanly by git but the new extractMergerDeal / extractRedemption functions still called the pre-PR wrapUntrusted shape + UNTRUSTED_PREAMBLE constant, which this PR removed. Both updated to the nonce-fence API (wrapUntrusted -> { wrapped, nonce }, buildUntrustedPreamble(nonce)) so the new SPAC AI extractors get the per-call nonce fence + multi-stage defang for free. Without this, the prompts would interpolate as "[object Object]" and the model receives garbage. Verification: - targeted: bun test src/sec/forms/miscellaneous-filings/ \ src/sec/forms/proxies-information-statements/ src/task/forms/ \ src/storage/spac/ src/storage/form-8k-event/ \ src/sec/forms/registration-statements/ -> 229 pass / 0 fail. - full: bun test -> 1410 pass / 7 fail. All 7 fails are pre-existing FetchDailyIndexTask + FetchQuarterlyIndexTask 5000ms network timeouts unrelated to this PR (sandbox can't reach SEC.gov reliably). - bun run build -> clean (bun build + tsc, no errors). Co-Authored-By: Claude <noreply@anthropic.com>

…e-fence API The new SPAC extractors added in PR #166 (extractMergerDeal, extractRedemption) called the pre-PR wrapUntrusted shape (returning a string) and the removed UNTRUSTED_PREAMBLE constant. After this PR swapped wrapUntrusted to return { wrapped, nonce } and replaced the constant with buildUntrustedPreamble(nonce), the surviving call sites template-interpolated UNTRUSTED_PREAMBLE as a free identifier -> compile error (TS2552), and even if the type had survived the { wrapped, nonce } object would have rendered as "[object Object]" in the prompt -> the model receives garbage and silently returns nothing (caught by Form_DEFM14A.storage.e2e.test.ts target_name=null assertions in the post-merge run). Both extractors now use the same nonce-fence + multi-stage defang as the other section extractors -- a forced consequence of the merge, extending the per-call nonce + entity decode + NFKC + zero-width strip protection to the new SPAC AI extractors at no extra design cost. Co-Authored-By: Claude <noreply@anthropic.com>

…extractors The merger-proxy and redemption extractors that landed via PR #166 missed the new prompt-injection seal helpers introduced in PR #165. The seal — raw- byte verifyRowSpan at gate, boundSourceSpan at persist — is now applied to both extractors so an unbounded source_span can no longer ship through SpacMergerExtractionRepo / SpacRedemptionExtractionRepo via filer-controlled DEFM14A or post-vote 8-K narrative. Also widen the fence defang to neutralize the </UNTRUSTED&Tab;FILER&Tab; DOCUMENT> family of bypasses: add whitespace named entities (Tab, NewLine, nbsp, ensp, emsp, thinsp, zwsp, zwnj, zwj) to NAMED_ENTITY_TABLE and collapse numeric whitespace entities (	 /   etc.) to a single space before the TAG_SHAPED scan. The per-call 64-bit nonce on the real fence remains the primary defense; this closes the layered defang gap. No extractor version bumps: prompt is unchanged in non-adversarial inputs, the gate change is normalization-only.

…ompute Two SPAC correctness issues: 1. processMergerProxy never wrote to extractor_runs. The outer ProcessAccessionDocFormTask records a run for the form's extractor id (DEFM14A), but the merger-proxy nested extractor id was uncovered, so `sec version coverage extractor merger-proxy` always read zero and `drop-previous` was a no-op. Mirrors the redemption recordRun pattern from PR #168: success at the end, PARSE_ERROR in the segmenter catch, PROVIDER_ERROR around runSection. 2. SpacReportWriter.recomputeAndSaveDeals deleted orphan deal rows then wrote new deals in a non-atomic loop. A crash, AbortSignal, or DB error between the delete and the final saveDeal corrupted the SPAC report row. New SpacDealReplace helper wraps the delete+upsert pass in a real transaction: better-sqlite3 `db.transaction` for SQLite, BEGIN/COMMIT/ ROLLBACK on a checked-out PG client. In-memory fallback retains the sequential semantics (no concurrency in tests). No extractor version bump: merger-proxy stays at 1.0.0; `coverage` will simply start populating an empty table.

…ities + DOCTYPE strip PR #165 disabled entity processing entirely (processEntities: false) to defang billion-laughs payloads. That also silently corrupted every XML form value carrying one of the five predefined XML entities — e.g. `Mac Accounting Group & CPAs, LLP` was persisting as the literal four-character string `&` instead of the intended `&`. This restores the standard predefined-entity decode (so `&`, `<`, `>`, `"`, `'` round-trip normally) while keeping XXE / billion- laughs defenses in place: - fast-xml-parser's bounded processEntities config caps entity count, expansion depth, total expansions, and expanded length — a filer-declared chain bombs out at the limit instead of expanding geometrically. - A new stripDoctype() pass removes any leading <!DOCTYPE name [...]> block before parsing, so filer-declared entities never reach the parser at all. getParser() now returns a thin wrapper that runs stripDoctype() before parse(), keeping all callsites unchanged.

…loses 
 bypass) The prompt-injection defang scan ran the multi-pass HTML-entity decoder BEFORE the numeric-whitespace collapse pass, so a filer-controlled `
` (or `
` / `` / `` / `` / ``) got unwrapped to a literal `\n` / `\r` / `\v` / `\f` first — and the TAG_SHAPED middle character class `[\w \t-]` only admitted `\t` and space, so `</UNTRUSTED
FILER
DOCUMENT>` no longer matched the tag-shape regex once it decoded to `</UNTRUSTED\nFILER\nDOCUMENT>`. That left the lookalike intact and the fence un-redacted. Widening the mid-class to `[\w\s-]` admits every ASCII/Unicode whitespace codepoint, so the squash-and-compare callback fires on every variant and the fence redaction reaches its target. The `[_A-Z]` anchor is unchanged, so benign lowercase / non-fence tag shapes are still left literal. Tests cover 
, 
, , , 	 (regression), , , a mixed encoded+raw obfuscation, raw \r\n, raw \n, raw \v / \f, and two negative cases (`<NotAFence\nfoo>` and `<\nFOO>`) that must not redact.

… avoid deadlock with outer lock When a caller already holds an outer Postgres transaction wrapping a critical section (e.g. PR #170's `withSpacCikLock` will issue `BEGIN ... pg_advisory_xact_lock ...` on a checked-out client to serialize SPAC writes per CIK), `recomputeSpacDeals` checking out a *second* client from the shared pool to run its own BEGIN/COMMIT will deadlock the moment the pool is saturated by concurrent CIK locks — every connection in the pool holds an outer lock and waits on a second client that the pool can no longer hand out. Threading an optional `pgClient: PoolClient` through `RecomputeSpacDealsArgs` (and through `SpacReportWriter.recomputeAndSaveDeals`) lets the lock owner hand its connection to the inner ops; the Postgres branch then runs DELETE/INSERT directly on that client and skips its own BEGIN/COMMIT/ ROLLBACK/release (the caller still owns those). When `pgClient` is undefined the defensive default path is unchanged: own pool checkout + own transaction wrap. Adds a test asserting both paths — the back-compat case still emits BEGIN…COMMIT and releases its client, and the caller-supplied case runs the INSERT on the provided client without issuing any txn or release calls.

The prior stripFormatChars only stripped 8 explicit BMP codepoints (ZWSP/ZWNJ/ZWJ/LRM/RLM/WJ/BOM/SHY). A filer could splice U+180E, the math invisibles U+2061-U+2064, or any variation selector (VS1-VS16 at U+FE00-U+FE0F, VS17-VS256 at U+E0100-U+E01EF) between the letters of `UNTRUSTED_FILER_DOCUMENT` and survive the strip — the `squashed.startsWith("UNTRUSTEDFILERDOCUMENT")` check then failed because non-letter codepoints stayed in the tag body. Widen the class to `\p{Cf}` (subsumes the 8 original codepoints plus U+180E + math invisibles) joined with the explicit VS ranges U+FE00-U+FE0F (VS-16 is `Mn`, not `Cf`) and U+E0100-U+E01EF. Add one test per residual class and a combined adversarial case. Also correct the misleading comments on the existing non-redaction tests: the actual rejection mechanism is the inner squashed-letters check, not the case-insensitive `[_A-Z]` anchor.

The `stripDoctype` regex anchors at the start of the document and only tolerates an optional `<?xml ...?>` declaration before the DOCTYPE. A filer who slips a leading XML comment, a non-xml processing instruction, or a `]>`/`[` inside a quoted PUBLIC id defeats the regex and lets `<!ENTITY ...>` reach the parser. Bounded `processEntities` would still expand the declared entity up to its caps, surfacing filer-controlled bytes in stored values. Move the seal to the parser layer: flip `processEntities` to `{ enabled: false }` so no entity ever expands. With expansion off the parser preserves every `&...;` byte sequence literally, including the five predefined ones (`&` `<` `>` `"` `'`), so round-tripping a value like `&` would now break. Restore the round-trip with a post-parse `decodePredefinedEntities` walker: single-pass regex over the predefined five, recursive over plain arrays and `constructor === Object` objects, untouched for non-string primitives / `Date` / typed arrays. The single-pass contract is load-bearing: `&lt;` decodes to `<` (one match consumes the `&`), not to `<`. Keep `stripDoctype` as best-effort hygiene; its JSDoc now spells out that it is no longer the security boundary. Tests pin the three bypass paths (leading comment, leading PI, quoted PUBLIC id with `]>` / `[`), the predefined-entity round-trip including the one-pass `&lt;` -> `<` rule, the billion-laughs bound, and the walker's recursion / primitive-preservation contracts.

sroussey · 2026-06-30T19:45:20Z

Superseded by #176. Both fixes here — the residual Unicode-invisible defang closure (\p{Cf} + variation selectors) and the stripDoctype leading-comment/PI seam moved to the parser layer (processEntities off + decodePredefinedEntities) — are carried forward there as the final form of the entity/defang handling. Closing in favor of #176.

Generated by Claude Code

claude added 13 commits June 23, 2026 08:31

merge: bring in #171 entity-fix branch as base for combined follow-up

a6b6f43

sroussey closed this Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172)#174

fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172)#174
sroussey wants to merge 13 commits into
mainfrom
claude/wonderful-hypatia-ebsdur

sroussey commented Jun 30, 2026

Uh oh!

sroussey commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sroussey commented Jun 30, 2026

Summary

Fix 1 — Close residual Unicode-invisible defang bypass (follow-up to #172)

Fix 2 — Close stripDoctype bypass via leading XML comment / PI / quoted PUBLIC id (follow-up to #171)

Tests added

Test plan

Notes

Uh oh!

sroussey commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants