fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172)#174
Closed
sroussey wants to merge 13 commits into
Closed
fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172)#174sroussey wants to merge 13 commits into
sroussey wants to merge 13 commits into
Conversation
The prompt-injection seal around S-1/424 AI section extraction had two filer-controllable weak points: 1. The fence tag was a static literal (`UNTRUSTED_FILER_DOCUMENT`), so a filer could pre-stage a matching closing tag and end the fence early. The defang scan was case-insensitive but flat — only a single literal tag-shape was rewritten. 2. A model-emitted `source_span` was capped at the verifier (post- normalization) but persisted raw, so an attacker who slipped any verifier-passing row could ship unbounded raw bytes through the provenance column. This patch deepens the seal: - The fence tag carries a per-call 64-bit random nonce. The `UNTRUSTED_FILER_DOCUMENT_NONCE_<hex>` shape means a pre-staged closing tag in the prospectus cannot match the call's actual fence. - Before defang, the section body is HTML-entity-decoded (multi-pass, up to a fixed point), NFKC-normalized, and stripped of zero-width / bidi format chars. The defang scan matches any tag-shaped token whose alphabetic payload squashes to `UNTRUSTEDFILERDOCUMENT...`, so obfuscations via `<`, fullwidth letters, ZWSP, intra-tag spaces, and case-mixing all collapse to `[redacted-fence-tag]`. - A new `boundSourceSpan` caps stored spans at 1000 raw chars (returning `null` over the cap rather than truncating). A new `verifyRowSpan` rejects a span whose raw byte count exceeds the cap before the normalize-and-substring check runs, so a whitespace-inflated payload that would otherwise normalize under cap can no longer pass the gate. All `verifyRow:` callsites and `source_span:` persist sites in the S-1 storage and shared offering-sections layer route through these. - Bumps the S-1 extractor version to 1.3.0 and the 424 extractor version to 1.2.0: prompt-shape changes drift confidence calibration, and the span-storage shape changes too. Operators should run startDev/promote to roll the new version into production. Adds unit tests for `boundSourceSpan` / `verifyRowSpan` boundary cases, a 1500-raw-char whitespace-padded span dead-letter test in the storage layer, and obfuscation tests for fullwidth, HTML-entity, mixed-case + zero-width, intra-tag whitespace, wrong-nonce, and nonce uniqueness. Co-Authored-By: Claude <noreply@anthropic.com>
…ntity hardening
Four hardening fixes around the Form 8-K event-storage path:
1. fast-xml-parser entity expansion is disabled (`processEntities: false`)
on the shared Form XML parser. A filer-controlled SGML payload that
declared a chain of nested entity references would otherwise expand
into a multi-GB string ("billion laughs") and peg CPU at parse time.
A regression test feeds a 10-level billion-laughs DOCTYPE through
Form_8_K.parse and asserts the parse stays well under 50 ms.
2. EDGAR accession numbers cross trust boundaries unconstrained — the
filing-task input schema and the Form 8-K event table both accepted
any string, so an over-long or malformed accession could land in
storage. Introduces `TypeAccessionNumber()` (20-char fixed length,
`^\d{10}-\d{2}-\d{6}$`) and applies it at the
ProcessAccessionDocFormTask input and the event row schema.
3. The Form 8-K event row keyed `(cik, accession_number, item_code)` as
its primary key — a re-extract under a new extractor version would
overwrite the prior version's rows, erasing the time series. Switches
the table to a synthetic `event_id` AUTOINCREMENT PK plus an explicit
`(cik, accession_number, extractor_id, extractor_version, item_code)`
UNIQUE natural-key index, mirroring the PersonObservation /
CompanyObservation shape. Both extractor columns are now first-class
so coverage / drop-previous ceremonies can target a single version.
A one-shot legacy-schema migration drops the pre-versioned table on
the SQLite and Postgres paths (the natural-key PK cannot be ALTERed
away on either backend, and 8-K events are deterministic to re-extract).
4. `processForm8K` previously looped over items with one `put` per item,
so a mid-loop crash left the row set torn between old and new items
for the same (filing, version). Adds `Form8KEventRepo.replaceEvents`
— DELETE all rows for `(cik, accession_number, extractor_id,
extractor_version)` then bulk-insert the new set, wrapped in a
real transaction on the SQLite (better-sqlite3 `db.transaction`)
and Postgres (`BEGIN / COMMIT / ROLLBACK` on a checked-out client)
paths. The in-memory backend (tests only) is synchronous so a torn
write cannot interleave. A failure-injection test seeds a row,
then re-runs `replaceEvents` with a NOT NULL-violating second
insert and asserts the prior baseline is intact after rollback.
Also wires `extractor_id` + `extractor_version` through the task layer
into `processForm8K` so the same writer can run under any version slot.
Co-Authored-By: Claude <noreply@anthropic.com>
…B_TYPE token
In CI the 21 Form_8_K tests failed with `no such table: form_8k_events`
because `replaceForm8KEvents` was dispatching to `replaceSqlite` even
though the test harness had wired `FORM_8K_EVENT_REPOSITORY_TOKEN` to an
in-memory storage. The trigger was test-process global-DI contamination:
`FetchDailyIndexTask.test.ts` calls `EnvToDI()` at module-load time,
which registers `SEC_DB_TYPE = "sqlite"` in the `globalServiceRegistry`.
The ServiceRegistry has no unregister API, so once any earlier test in
the same Bun worker hits that path, `SEC_DB_TYPE` sticks for the rest of
the run. `resetDependencyInjectionsForTesting()` rebinds the repo tokens
to in-memory storages but cannot clear `SEC_DB_TYPE`, so the SQLite
branch in `replaceForm8KEvents` won and reached for `getDb()`, which
either fell over on an uninitialized SQLite handle (locally) or
write-attempted against a table that was never created (CI).
Fix: trust the actual repo. `InMemoryTabularStorage.isDurable()` returns
`false`; the production storages don't override it. When the resolved
repo is non-durable, take the repo path regardless of `SEC_DB_TYPE`.
This makes the dispatch correct even when global config and the
registered repo disagree, which is the steady-state in the test process.
Reproduces locally via:
bun test src/task/index/FetchDailyIndexTask.test.ts \\
src/sec/forms/miscellaneous-filings/Form_8_K.test.ts
(without the fix: 25 Form_8_K fails; with the fix: all 29 pass).
Co-Authored-By: Claude <noreply@anthropic.com>
Resolves conflicts created by PR #166 (SPAC de-SPAC lifecycle / merger-proxy / redemption extraction) landing on main after this PR opened. Conflicts resolved: - src/sec/forms/miscellaneous-filings/Form_8_K.storage.ts - Function signature combines both side's additive params: extractor_id, extractor_version (this PR), fullSubmissionText, model (#166). - Event writes go through replaceEvents() (this PR), threading extractor_id + extractor_version into the version-scoped delete-then-insert. - SPAC milestone mapping + redemption extraction blocks from #166 follow unchanged after the events are persisted. - src/task/forms/ProcessAccessionDocFormTask.ts - Keep TypeAccessionNumber import (this PR), processMergerProxy + hasRedemptionTriggerItem imports (#166). - 8-K dispatch call site passes both extractor_id/extractor_version and fullSubmissionText into processForm8K; merger-proxy case from #166 follows. - src/sec/forms/registration-statements/s1/sectionExtractors.ts (auto-merged cleanly by git but the new extractMergerDeal / extractRedemption functions still called the pre-PR wrapUntrusted shape + UNTRUSTED_PREAMBLE constant, which this PR removed. Both updated to the nonce-fence API (wrapUntrusted -> { wrapped, nonce }, buildUntrustedPreamble(nonce)) so the new SPAC AI extractors get the per-call nonce fence + multi-stage defang for free. Without this, the prompts would interpolate as "[object Object]" and the model receives garbage. Verification: - targeted: bun test src/sec/forms/miscellaneous-filings/ \ src/sec/forms/proxies-information-statements/ src/task/forms/ \ src/storage/spac/ src/storage/form-8k-event/ \ src/sec/forms/registration-statements/ -> 229 pass / 0 fail. - full: bun test -> 1410 pass / 7 fail. All 7 fails are pre-existing FetchDailyIndexTask + FetchQuarterlyIndexTask 5000ms network timeouts unrelated to this PR (sandbox can't reach SEC.gov reliably). - bun run build -> clean (bun build + tsc, no errors). Co-Authored-By: Claude <noreply@anthropic.com>
…e-fence API The new SPAC extractors added in PR #166 (extractMergerDeal, extractRedemption) called the pre-PR wrapUntrusted shape (returning a string) and the removed UNTRUSTED_PREAMBLE constant. After this PR swapped wrapUntrusted to return { wrapped, nonce } and replaced the constant with buildUntrustedPreamble(nonce), the surviving call sites template-interpolated UNTRUSTED_PREAMBLE as a free identifier -> compile error (TS2552), and even if the type had survived the { wrapped, nonce } object would have rendered as "[object Object]" in the prompt -> the model receives garbage and silently returns nothing (caught by Form_DEFM14A.storage.e2e.test.ts target_name=null assertions in the post-merge run). Both extractors now use the same nonce-fence + multi-stage defang as the other section extractors -- a forced consequence of the merge, extending the per-call nonce + entity decode + NFKC + zero-width strip protection to the new SPAC AI extractors at no extra design cost. Co-Authored-By: Claude <noreply@anthropic.com>
…extractors The merger-proxy and redemption extractors that landed via PR #166 missed the new prompt-injection seal helpers introduced in PR #165. The seal — raw- byte verifyRowSpan at gate, boundSourceSpan at persist — is now applied to both extractors so an unbounded source_span can no longer ship through SpacMergerExtractionRepo / SpacRedemptionExtractionRepo via filer-controlled DEFM14A or post-vote 8-K narrative. Also widen the fence defang to neutralize the </UNTRUSTED	FILER	 DOCUMENT> family of bypasses: add whitespace named entities (Tab, NewLine, nbsp, ensp, emsp, thinsp, zwsp, zwnj, zwj) to NAMED_ENTITY_TABLE and collapse numeric whitespace entities (	 /   etc.) to a single space before the TAG_SHAPED scan. The per-call 64-bit nonce on the real fence remains the primary defense; this closes the layered defang gap. No extractor version bumps: prompt is unchanged in non-adversarial inputs, the gate change is normalization-only.
…ompute Two SPAC correctness issues: 1. processMergerProxy never wrote to extractor_runs. The outer ProcessAccessionDocFormTask records a run for the form's extractor id (DEFM14A), but the merger-proxy nested extractor id was uncovered, so `sec version coverage extractor merger-proxy` always read zero and `drop-previous` was a no-op. Mirrors the redemption recordRun pattern from PR #168: success at the end, PARSE_ERROR in the segmenter catch, PROVIDER_ERROR around runSection. 2. SpacReportWriter.recomputeAndSaveDeals deleted orphan deal rows then wrote new deals in a non-atomic loop. A crash, AbortSignal, or DB error between the delete and the final saveDeal corrupted the SPAC report row. New SpacDealReplace helper wraps the delete+upsert pass in a real transaction: better-sqlite3 `db.transaction` for SQLite, BEGIN/COMMIT/ ROLLBACK on a checked-out PG client. In-memory fallback retains the sequential semantics (no concurrency in tests). No extractor version bump: merger-proxy stays at 1.0.0; `coverage` will simply start populating an empty table.
…ities + DOCTYPE strip PR #165 disabled entity processing entirely (processEntities: false) to defang billion-laughs payloads. That also silently corrupted every XML form value carrying one of the five predefined XML entities — e.g. `Mac Accounting Group & CPAs, LLP` was persisting as the literal four-character string `&` instead of the intended `&`. This restores the standard predefined-entity decode (so `&`, `<`, `>`, `"`, `'` round-trip normally) while keeping XXE / billion- laughs defenses in place: - fast-xml-parser's bounded processEntities config caps entity count, expansion depth, total expansions, and expanded length — a filer-declared chain bombs out at the limit instead of expanding geometrically. - A new stripDoctype() pass removes any leading <!DOCTYPE name [...]> block before parsing, so filer-declared entities never reach the parser at all. getParser() now returns a thin wrapper that runs stripDoctype() before parse(), keeping all callsites unchanged.
…loses bypass) The prompt-injection defang scan ran the multi-pass HTML-entity decoder BEFORE the numeric-whitespace collapse pass, so a filer-controlled ` ` (or `
` / ` ` / `
` / `` / ``) got unwrapped to a literal `\n` / `\r` / `\v` / `\f` first — and the TAG_SHAPED middle character class `[\w \t-]` only admitted `\t` and space, so `</UNTRUSTED FILER DOCUMENT>` no longer matched the tag-shape regex once it decoded to `</UNTRUSTED\nFILER\nDOCUMENT>`. That left the lookalike intact and the fence un-redacted. Widening the mid-class to `[\w\s-]` admits every ASCII/Unicode whitespace codepoint, so the squash-and-compare callback fires on every variant and the fence redaction reaches its target. The `[_A-Z]` anchor is unchanged, so benign lowercase / non-fence tag shapes are still left literal. Tests cover , 
, , 
, 	 (regression), , , a mixed encoded+raw obfuscation, raw \r\n, raw \n, raw \v / \f, and two negative cases (`<NotAFence\nfoo>` and `<\nFOO>`) that must not redact.
… avoid deadlock with outer lock When a caller already holds an outer Postgres transaction wrapping a critical section (e.g. PR #170's `withSpacCikLock` will issue `BEGIN ... pg_advisory_xact_lock ...` on a checked-out client to serialize SPAC writes per CIK), `recomputeSpacDeals` checking out a *second* client from the shared pool to run its own BEGIN/COMMIT will deadlock the moment the pool is saturated by concurrent CIK locks — every connection in the pool holds an outer lock and waits on a second client that the pool can no longer hand out. Threading an optional `pgClient: PoolClient` through `RecomputeSpacDealsArgs` (and through `SpacReportWriter.recomputeAndSaveDeals`) lets the lock owner hand its connection to the inner ops; the Postgres branch then runs DELETE/INSERT directly on that client and skips its own BEGIN/COMMIT/ ROLLBACK/release (the caller still owns those). When `pgClient` is undefined the defensive default path is unchanged: own pool checkout + own transaction wrap. Adds a test asserting both paths — the back-compat case still emits BEGIN…COMMIT and releases its client, and the caller-supplied case runs the INSERT on the provided client without issuing any txn or release calls.
The prior stripFormatChars only stripped 8 explicit BMP codepoints
(ZWSP/ZWNJ/ZWJ/LRM/RLM/WJ/BOM/SHY). A filer could splice U+180E,
the math invisibles U+2061-U+2064, or any variation selector
(VS1-VS16 at U+FE00-U+FE0F, VS17-VS256 at U+E0100-U+E01EF) between
the letters of `UNTRUSTED_FILER_DOCUMENT` and survive the strip —
the `squashed.startsWith("UNTRUSTEDFILERDOCUMENT")` check then
failed because non-letter codepoints stayed in the tag body.
Widen the class to `\p{Cf}` (subsumes the 8 original codepoints plus
U+180E + math invisibles) joined with the explicit VS ranges
U+FE00-U+FE0F (VS-16 is `Mn`, not `Cf`) and U+E0100-U+E01EF.
Add one test per residual class and a combined adversarial case.
Also correct the misleading comments on the existing non-redaction
tests: the actual rejection mechanism is the inner squashed-letters
check, not the case-insensitive `[_A-Z]` anchor.
The `stripDoctype` regex anchors at the start of the document and
only tolerates an optional `<?xml ...?>` declaration before the
DOCTYPE. A filer who slips a leading XML comment, a non-xml
processing instruction, or a `]>`/`[` inside a quoted PUBLIC id
defeats the regex and lets `<!ENTITY ...>` reach the parser.
Bounded `processEntities` would still expand the declared entity
up to its caps, surfacing filer-controlled bytes in stored values.
Move the seal to the parser layer: flip `processEntities` to
`{ enabled: false }` so no entity ever expands. With expansion off
the parser preserves every `&...;` byte sequence literally,
including the five predefined ones (`&` `<` `>` `"`
`'`), so round-tripping a value like `&` would now break.
Restore the round-trip with a post-parse `decodePredefinedEntities`
walker: single-pass regex over the predefined five, recursive over
plain arrays and `constructor === Object` objects, untouched for
non-string primitives / `Date` / typed arrays. The single-pass
contract is load-bearing: `&lt;` decodes to `<` (one match
consumes the `&`), not to `<`.
Keep `stripDoctype` as best-effort hygiene; its JSDoc now spells
out that it is no longer the security boundary. Tests pin the
three bypass paths (leading comment, leading PI, quoted PUBLIC id
with `]>` / `[`), the predefined-entity round-trip including the
one-pass `&lt;` -> `<` rule, the billion-laughs bound, and
the walker's recursion / primitive-preservation contracts.
This was referenced Jun 30, 2026
Contributor
Author
|
Superseded by #176. Both fixes here — the residual Unicode-invisible defang closure ( Generated by Claude Code |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two HIGH-priority follow-up fixes to the parallel hardening PRs #171 (XML entity bounds) and #172 (defang TAG_SHAPED widening). Both predecessor PRs left a closing seam that an adversarial filer could still drive through; this PR closes both.
The branch is based on #172's head (
claude/wonderful-hypatia-bko1yr-defang-pool-fix) and merges in #171's head (claude/wonderful-hypatia-y6cb4l-entity-fix) as the first commit, so this PR carries both predecessor changes plus the two follow-up fixes on top. If #171 and #172 land first this PR rebases trivially; if not, it can ship them all together.Fix 1 — Close residual Unicode-invisible defang bypass (follow-up to #172)
File:
src/sec/forms/registration-statements/s1/sectionExtractors.tsstripFormatCharspreviously stripped only 8 explicit BMP codepoints (ZWSP/ZWNJ/ZWJ/LRM/RLM/WJ/BOM/SHY). A filer could splice U+180E (Mongolian Vowel Separator), the math invisibles U+2061–U+2064, or any variation selector (BMP VS1–VS16 at U+FE00–U+FE0F, supplementary VS17–VS256 at U+E0100–U+E01EF) between the letters ofUNTRUSTED_FILER_DOCUMENT. None of those codepoints appear in JS\s, and only U+FE0F isMn(the rest areCf), so they survived the strip — the downstreamsquashed.startsWith("UNTRUSTEDFILERDOCUMENT")check then failed because the residual non-letter codepoints stayed in the tag body after the squash to letters.Widen the class to
\p{Cf}(subsumes all 8 originals plus U+180E + math invisibles) joined with the explicit variation-selector ranges U+FE00–U+FE0F and U+E0100–U+E01EF.uflag is required for\p{...}and supplementary-plane escapes.Also corrects misleading comments on two existing non-redaction tests: the actual rejection mechanism is the inner
squashed.startsWith("UNTRUSTEDFILERDOCUMENT")check, not the case-insensitive[_A-Z]anchor (which theiflag makes irrelevant).Fix 2 — Close stripDoctype bypass via leading XML comment / PI / quoted PUBLIC id (follow-up to #171)
File:
src/sec/forms/Form.tsThe
stripDoctyperegex anchors at the start of the document and only tolerates an optional<?xml ...?>declaration before the DOCTYPE. A filer who slips a leading XML comment, a non-xml processing instruction, or a]>/[inside a quoted PUBLIC id defeats the regex and lets<!ENTITY ...>reach the parser. BoundedprocessEntitieswould still expand the declared entity up to its caps — surfacing filer-controlled bytes (PWNED, exfil URLs, etc.) in stored values.Move the seal to the parser layer:
processEntitiesto{ enabled: false }. No entity ever expands.maxEntityCount/maxExpansionDepth/maxExpandedLength/maxTotalExpansionsare removed as dead underenabled: false.&...;byte sequence literally, including the five predefined ones — so round-tripping&would now break. Restore the round-trip with a new exporteddecodePredefinedEntities(value)walker that runs post-parse: single-pass regex over the predefined five, recursive overArrays and plain objects (value.constructor === Object), untouched fornumber/boolean/null/undefined/Date/typed arrays. Returns new containers for pure semantics.&lt;decodes to<(one regex match consumes the&; the literallt;that follows is never re-scanned), not to<. This preserves the round-trip the bounded-processEntitiesmode had.FormXmlParser.parsenow returnsdecodePredefinedEntities(parser.parse(stripDoctype(xml))).stripDoctypestays as best-effort hygiene; its JSDoc now spells out that it is no longer the security boundary — the parser-sideenabled: falseplus the post-walker is.Tests added
sectionExtractors.injection.test.ts(+7 tests, all pass):U+180E(Mongolian Vowel Separator) intra-tag obfuscationU+2061(FUNCTION APPLICATION) intra-tag obfuscationU+2062(INVISIBLE TIMES) intra-tag obfuscationU+2063(INVISIBLE SEPARATOR) intra-tag obfuscationU+2064(INVISIBLE PLUS) intra-tag obfuscationU+FE0F(VS-16) intra-tag obfuscationForm.entity.test.ts(+10 tests, all pass):<!-- comment -->→ entity stays literal<?xml-stylesheet ...?>PI → entity stays literal]>inside quoted PUBLIC id → no PWNED[inside quoted PUBLIC id → no PWNED&lt;→<(one-pass)decodePredefinedEntities: decodes all five flat; recurses into nested object/array; leaves number/bool/null/undefined/Date untouched; does NOT double-decode; leaves typed arrays untouchedExisting billion-laughs test still passes under
enabled: false.Test plan
bun test src/sec/forms/registration-statements/s1/sectionExtractors.injection.test.ts— 36 passbun test src/sec/forms/Form.entity.test.ts— 17 passbun test src/sec/forms/— 468 pass, 0 fail (the pre-existing JP-address normalization warnings are unrelated noise)bun run build— clean (full build: bundle + tsc, no errors)Notes
Copyright 2026 Steven Roussey <sroussey@gmail.com>, SPDX form). No new files were created — both fixes are edits to existing files.Generated by Claude Code