Skip to content

fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172)#174

Closed
sroussey wants to merge 13 commits into
mainfrom
claude/wonderful-hypatia-ebsdur
Closed

fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172)#174
sroussey wants to merge 13 commits into
mainfrom
claude/wonderful-hypatia-ebsdur

Conversation

@sroussey

Copy link
Copy Markdown
Contributor

Summary

Two HIGH-priority follow-up fixes to the parallel hardening PRs #171 (XML entity bounds) and #172 (defang TAG_SHAPED widening). Both predecessor PRs left a closing seam that an adversarial filer could still drive through; this PR closes both.

The branch is based on #172's head (claude/wonderful-hypatia-bko1yr-defang-pool-fix) and merges in #171's head (claude/wonderful-hypatia-y6cb4l-entity-fix) as the first commit, so this PR carries both predecessor changes plus the two follow-up fixes on top. If #171 and #172 land first this PR rebases trivially; if not, it can ship them all together.

Fix 1 — Close residual Unicode-invisible defang bypass (follow-up to #172)

File: src/sec/forms/registration-statements/s1/sectionExtractors.ts

stripFormatChars previously stripped only 8 explicit BMP codepoints (ZWSP/ZWNJ/ZWJ/LRM/RLM/WJ/BOM/SHY). A filer could splice U+180E (Mongolian Vowel Separator), the math invisibles U+2061–U+2064, or any variation selector (BMP VS1–VS16 at U+FE00–U+FE0F, supplementary VS17–VS256 at U+E0100–U+E01EF) between the letters of UNTRUSTED_FILER_DOCUMENT. None of those codepoints appear in JS \s, and only U+FE0F is Mn (the rest are Cf), so they survived the strip — the downstream squashed.startsWith("UNTRUSTEDFILERDOCUMENT") check then failed because the residual non-letter codepoints stayed in the tag body after the squash to letters.

Widen the class to \p{Cf} (subsumes all 8 originals plus U+180E + math invisibles) joined with the explicit variation-selector ranges U+FE00–U+FE0F and U+E0100–U+E01EF. u flag is required for \p{...} and supplementary-plane escapes.

Also corrects misleading comments on two existing non-redaction tests: the actual rejection mechanism is the inner squashed.startsWith("UNTRUSTEDFILERDOCUMENT") check, not the case-insensitive [_A-Z] anchor (which the i flag makes irrelevant).

Fix 2 — Close stripDoctype bypass via leading XML comment / PI / quoted PUBLIC id (follow-up to #171)

File: src/sec/forms/Form.ts

The stripDoctype regex anchors at the start of the document and only tolerates an optional <?xml ...?> declaration before the DOCTYPE. A filer who slips a leading XML comment, a non-xml processing instruction, or a ]>/[ inside a quoted PUBLIC id defeats the regex and lets <!ENTITY ...> reach the parser. Bounded processEntities would still expand the declared entity up to its caps — surfacing filer-controlled bytes (PWNED, exfil URLs, etc.) in stored values.

Move the seal to the parser layer:

  • Flip processEntities to { enabled: false }. No entity ever expands. maxEntityCount / maxExpansionDepth / maxExpandedLength / maxTotalExpansions are removed as dead under enabled: false.
  • With expansion off the parser preserves every &...; byte sequence literally, including the five predefined ones — so round-tripping &amp; would now break. Restore the round-trip with a new exported decodePredefinedEntities(value) walker that runs post-parse: single-pass regex over the predefined five, recursive over Arrays and plain objects (value.constructor === Object), untouched for number/boolean/null/undefined/Date/typed arrays. Returns new containers for pure semantics.
  • Single-pass contract is load-bearing: &amp;lt; decodes to &lt; (one regex match consumes the &amp;; the literal lt; that follows is never re-scanned), not to <. This preserves the round-trip the bounded-processEntities mode had.
  • FormXmlParser.parse now returns decodePredefinedEntities(parser.parse(stripDoctype(xml))).
  • stripDoctype stays as best-effort hygiene; its JSDoc now spells out that it is no longer the security boundary — the parser-side enabled: false plus the post-walker is.

Tests added

sectionExtractors.injection.test.ts (+7 tests, all pass):

  • U+180E (Mongolian Vowel Separator) intra-tag obfuscation
  • U+2061 (FUNCTION APPLICATION) intra-tag obfuscation
  • U+2062 (INVISIBLE TIMES) intra-tag obfuscation
  • U+2063 (INVISIBLE SEPARATOR) intra-tag obfuscation
  • U+2064 (INVISIBLE PLUS) intra-tag obfuscation
  • U+FE0F (VS-16) intra-tag obfuscation
  • Combined adversarial mix (Mongolian VS + all 4 math invisibles + VS-16 + supplementary VS17)

Form.entity.test.ts (+10 tests, all pass):

  • DOCTYPE after leading <!-- comment --> → entity stays literal
  • DOCTYPE after leading non-xml <?xml-stylesheet ...?> PI → entity stays literal
  • DOCTYPE with ]> inside quoted PUBLIC id → no PWNED
  • DOCTYPE with [ inside quoted PUBLIC id → no PWNED
  • All five predefined entities round-trip; &amp;lt;&lt; (one-pass)
  • Unit block decodePredefinedEntities: decodes all five flat; recurses into nested object/array; leaves number/bool/null/undefined/Date untouched; does NOT double-decode; leaves typed arrays untouched

Existing billion-laughs test still passes under enabled: false.

Test plan

  • bun test src/sec/forms/registration-statements/s1/sectionExtractors.injection.test.ts — 36 pass
  • bun test src/sec/forms/Form.entity.test.ts — 17 pass
  • bun test src/sec/forms/ — 468 pass, 0 fail (the pre-existing JP-address normalization warnings are unrelated noise)
  • bun run build — clean (full build: bundle + tsc, no errors)

Notes

  • Two logical commits on top of the merge commit, one per fix.
  • New file copyright headers follow the repo convention (Copyright 2026 Steven Roussey <sroussey@gmail.com>, SPDX form). No new files were created — both fixes are edits to existing files.
  • No spec / plan / PR-number comments inside the source (per CLAUDE.md).

Generated by Claude Code

claude added 13 commits June 23, 2026 08:31
The prompt-injection seal around S-1/424 AI section extraction had two
filer-controllable weak points:

1. The fence tag was a static literal (`UNTRUSTED_FILER_DOCUMENT`), so a
   filer could pre-stage a matching closing tag and end the fence early.
   The defang scan was case-insensitive but flat — only a single literal
   tag-shape was rewritten.
2. A model-emitted `source_span` was capped at the verifier (post-
   normalization) but persisted raw, so an attacker who slipped any
   verifier-passing row could ship unbounded raw bytes through the
   provenance column.

This patch deepens the seal:

- The fence tag carries a per-call 64-bit random nonce. The
  `UNTRUSTED_FILER_DOCUMENT_NONCE_<hex>` shape means a pre-staged closing
  tag in the prospectus cannot match the call's actual fence.
- Before defang, the section body is HTML-entity-decoded (multi-pass, up
  to a fixed point), NFKC-normalized, and stripped of zero-width / bidi
  format chars. The defang scan matches any tag-shaped token whose
  alphabetic payload squashes to `UNTRUSTEDFILERDOCUMENT...`, so
  obfuscations via `&lt;`, fullwidth letters, ZWSP, intra-tag spaces,
  and case-mixing all collapse to `[redacted-fence-tag]`.
- A new `boundSourceSpan` caps stored spans at 1000 raw chars (returning
  `null` over the cap rather than truncating). A new `verifyRowSpan`
  rejects a span whose raw byte count exceeds the cap before the
  normalize-and-substring check runs, so a whitespace-inflated payload
  that would otherwise normalize under cap can no longer pass the gate.
  All `verifyRow:` callsites and `source_span:` persist sites in the S-1
  storage and shared offering-sections layer route through these.
- Bumps the S-1 extractor version to 1.3.0 and the 424 extractor version
  to 1.2.0: prompt-shape changes drift confidence calibration, and the
  span-storage shape changes too. Operators should run startDev/promote
  to roll the new version into production.

Adds unit tests for `boundSourceSpan` / `verifyRowSpan` boundary cases,
a 1500-raw-char whitespace-padded span dead-letter test in the storage
layer, and obfuscation tests for fullwidth, HTML-entity, mixed-case +
zero-width, intra-tag whitespace, wrong-nonce, and nonce uniqueness.

Co-Authored-By: Claude <noreply@anthropic.com>
…ntity hardening

Four hardening fixes around the Form 8-K event-storage path:

1. fast-xml-parser entity expansion is disabled (`processEntities: false`)
   on the shared Form XML parser. A filer-controlled SGML payload that
   declared a chain of nested entity references would otherwise expand
   into a multi-GB string ("billion laughs") and peg CPU at parse time.
   A regression test feeds a 10-level billion-laughs DOCTYPE through
   Form_8_K.parse and asserts the parse stays well under 50 ms.

2. EDGAR accession numbers cross trust boundaries unconstrained — the
   filing-task input schema and the Form 8-K event table both accepted
   any string, so an over-long or malformed accession could land in
   storage. Introduces `TypeAccessionNumber()` (20-char fixed length,
   `^\d{10}-\d{2}-\d{6}$`) and applies it at the
   ProcessAccessionDocFormTask input and the event row schema.

3. The Form 8-K event row keyed `(cik, accession_number, item_code)` as
   its primary key — a re-extract under a new extractor version would
   overwrite the prior version's rows, erasing the time series. Switches
   the table to a synthetic `event_id` AUTOINCREMENT PK plus an explicit
   `(cik, accession_number, extractor_id, extractor_version, item_code)`
   UNIQUE natural-key index, mirroring the PersonObservation /
   CompanyObservation shape. Both extractor columns are now first-class
   so coverage / drop-previous ceremonies can target a single version.
   A one-shot legacy-schema migration drops the pre-versioned table on
   the SQLite and Postgres paths (the natural-key PK cannot be ALTERed
   away on either backend, and 8-K events are deterministic to re-extract).

4. `processForm8K` previously looped over items with one `put` per item,
   so a mid-loop crash left the row set torn between old and new items
   for the same (filing, version). Adds `Form8KEventRepo.replaceEvents`
   — DELETE all rows for `(cik, accession_number, extractor_id,
   extractor_version)` then bulk-insert the new set, wrapped in a
   real transaction on the SQLite (better-sqlite3 `db.transaction`)
   and Postgres (`BEGIN / COMMIT / ROLLBACK` on a checked-out client)
   paths. The in-memory backend (tests only) is synchronous so a torn
   write cannot interleave. A failure-injection test seeds a row,
   then re-runs `replaceEvents` with a NOT NULL-violating second
   insert and asserts the prior baseline is intact after rollback.

Also wires `extractor_id` + `extractor_version` through the task layer
into `processForm8K` so the same writer can run under any version slot.

Co-Authored-By: Claude <noreply@anthropic.com>
…B_TYPE token

In CI the 21 Form_8_K tests failed with `no such table: form_8k_events`
because `replaceForm8KEvents` was dispatching to `replaceSqlite` even
though the test harness had wired `FORM_8K_EVENT_REPOSITORY_TOKEN` to an
in-memory storage. The trigger was test-process global-DI contamination:
`FetchDailyIndexTask.test.ts` calls `EnvToDI()` at module-load time,
which registers `SEC_DB_TYPE = "sqlite"` in the `globalServiceRegistry`.
The ServiceRegistry has no unregister API, so once any earlier test in
the same Bun worker hits that path, `SEC_DB_TYPE` sticks for the rest of
the run. `resetDependencyInjectionsForTesting()` rebinds the repo tokens
to in-memory storages but cannot clear `SEC_DB_TYPE`, so the SQLite
branch in `replaceForm8KEvents` won and reached for `getDb()`, which
either fell over on an uninitialized SQLite handle (locally) or
write-attempted against a table that was never created (CI).

Fix: trust the actual repo. `InMemoryTabularStorage.isDurable()` returns
`false`; the production storages don't override it. When the resolved
repo is non-durable, take the repo path regardless of `SEC_DB_TYPE`.
This makes the dispatch correct even when global config and the
registered repo disagree, which is the steady-state in the test process.

Reproduces locally via:
  bun test src/task/index/FetchDailyIndexTask.test.ts \\
           src/sec/forms/miscellaneous-filings/Form_8_K.test.ts
(without the fix: 25 Form_8_K fails; with the fix: all 29 pass).

Co-Authored-By: Claude <noreply@anthropic.com>
Resolves conflicts created by PR #166 (SPAC de-SPAC lifecycle / merger-proxy /
redemption extraction) landing on main after this PR opened.

Conflicts resolved:

- src/sec/forms/miscellaneous-filings/Form_8_K.storage.ts
  - Function signature combines both side's additive params:
    extractor_id, extractor_version (this PR), fullSubmissionText, model (#166).
  - Event writes go through replaceEvents() (this PR), threading
    extractor_id + extractor_version into the version-scoped delete-then-insert.
  - SPAC milestone mapping + redemption extraction blocks from #166 follow
    unchanged after the events are persisted.

- src/task/forms/ProcessAccessionDocFormTask.ts
  - Keep TypeAccessionNumber import (this PR), processMergerProxy +
    hasRedemptionTriggerItem imports (#166).
  - 8-K dispatch call site passes both extractor_id/extractor_version and
    fullSubmissionText into processForm8K; merger-proxy case from #166 follows.

- src/sec/forms/registration-statements/s1/sectionExtractors.ts (auto-merged
  cleanly by git but the new extractMergerDeal / extractRedemption functions
  still called the pre-PR wrapUntrusted shape + UNTRUSTED_PREAMBLE constant,
  which this PR removed. Both updated to the nonce-fence API
  (wrapUntrusted -> { wrapped, nonce }, buildUntrustedPreamble(nonce))
  so the new SPAC AI extractors get the per-call nonce fence + multi-stage
  defang for free. Without this, the prompts would interpolate as
  "[object Object]" and the model receives garbage.

Verification:
- targeted: bun test src/sec/forms/miscellaneous-filings/ \
    src/sec/forms/proxies-information-statements/ src/task/forms/ \
    src/storage/spac/ src/storage/form-8k-event/ \
    src/sec/forms/registration-statements/  -> 229 pass / 0 fail.
- full: bun test  -> 1410 pass / 7 fail. All 7 fails are pre-existing
  FetchDailyIndexTask + FetchQuarterlyIndexTask 5000ms network timeouts
  unrelated to this PR (sandbox can't reach SEC.gov reliably).
- bun run build  -> clean (bun build + tsc, no errors).

Co-Authored-By: Claude <noreply@anthropic.com>
…e-fence API

The new SPAC extractors added in PR #166 (extractMergerDeal,
extractRedemption) called the pre-PR wrapUntrusted shape (returning a
string) and the removed UNTRUSTED_PREAMBLE constant. After this PR
swapped wrapUntrusted to return { wrapped, nonce } and replaced the
constant with buildUntrustedPreamble(nonce), the surviving call sites
template-interpolated UNTRUSTED_PREAMBLE as a free identifier ->
compile error (TS2552), and even if the type had survived the
{ wrapped, nonce } object would have rendered as "[object Object]"
in the prompt -> the model receives garbage and silently returns
nothing (caught by Form_DEFM14A.storage.e2e.test.ts target_name=null
assertions in the post-merge run).

Both extractors now use the same nonce-fence + multi-stage defang as
the other section extractors -- a forced consequence of the merge,
extending the per-call nonce + entity decode + NFKC + zero-width
strip protection to the new SPAC AI extractors at no extra design
cost.

Co-Authored-By: Claude <noreply@anthropic.com>
…extractors

The merger-proxy and redemption extractors that landed via PR #166 missed
the new prompt-injection seal helpers introduced in PR #165. The seal — raw-
byte verifyRowSpan at gate, boundSourceSpan at persist — is now applied to
both extractors so an unbounded source_span can no longer ship through
SpacMergerExtractionRepo / SpacRedemptionExtractionRepo via filer-controlled
DEFM14A or post-vote 8-K narrative.

Also widen the fence defang to neutralize the </UNTRUSTED&Tab;FILER&Tab;
DOCUMENT> family of bypasses: add whitespace named entities (Tab, NewLine,
nbsp, ensp, emsp, thinsp, zwsp, zwnj, zwj) to NAMED_ENTITY_TABLE and collapse
numeric whitespace entities (&#9; / &#x20; etc.) to a single space before the
TAG_SHAPED scan. The per-call 64-bit nonce on the real fence remains the
primary defense; this closes the layered defang gap.

No extractor version bumps: prompt is unchanged in non-adversarial inputs,
the gate change is normalization-only.
…ompute

Two SPAC correctness issues:

1. processMergerProxy never wrote to extractor_runs. The outer
   ProcessAccessionDocFormTask records a run for the form's extractor id
   (DEFM14A), but the merger-proxy nested extractor id was uncovered, so
   `sec version coverage extractor merger-proxy` always read zero and
   `drop-previous` was a no-op. Mirrors the redemption recordRun pattern
   from PR #168: success at the end, PARSE_ERROR in the segmenter catch,
   PROVIDER_ERROR around runSection.
2. SpacReportWriter.recomputeAndSaveDeals deleted orphan deal rows then
   wrote new deals in a non-atomic loop. A crash, AbortSignal, or DB error
   between the delete and the final saveDeal corrupted the SPAC report
   row. New SpacDealReplace helper wraps the delete+upsert pass in a real
   transaction: better-sqlite3 `db.transaction` for SQLite, BEGIN/COMMIT/
   ROLLBACK on a checked-out PG client. In-memory fallback retains the
   sequential semantics (no concurrency in tests).

No extractor version bump: merger-proxy stays at 1.0.0; `coverage` will
simply start populating an empty table.
…ities + DOCTYPE strip

PR #165 disabled entity processing entirely (processEntities: false) to defang
billion-laughs payloads. That also silently corrupted every XML form value
carrying one of the five predefined XML entities — e.g.
`Mac Accounting Group &amp; CPAs, LLP` was persisting as the literal
four-character string `&amp;` instead of the intended `&`.

This restores the standard predefined-entity decode (so `&amp;`, `&lt;`,
`&gt;`, `&quot;`, `&apos;` round-trip normally) while keeping XXE / billion-
laughs defenses in place:

  - fast-xml-parser's bounded processEntities config caps entity count,
    expansion depth, total expansions, and expanded length — a filer-declared
    chain bombs out at the limit instead of expanding geometrically.
  - A new stripDoctype() pass removes any leading <!DOCTYPE name [...]>
    block before parsing, so filer-declared entities never reach the parser
    at all.

getParser() now returns a thin wrapper that runs stripDoctype() before
parse(), keeping all callsites unchanged.
…loses &#10; bypass)

The prompt-injection defang scan ran the multi-pass HTML-entity decoder BEFORE
the numeric-whitespace collapse pass, so a filer-controlled `&#10;` (or `&#xA;`
/ `&#13;` / `&#xD;` / `&#11;` / `&#12;`) got unwrapped to a literal `\n` /
`\r` / `\v` / `\f` first — and the TAG_SHAPED middle character class
`[\w \t-]` only admitted `\t` and space, so `</UNTRUSTED&#10;FILER&#10;DOCUMENT>`
no longer matched the tag-shape regex once it decoded to `</UNTRUSTED\nFILER\nDOCUMENT>`.
That left the lookalike intact and the fence un-redacted.

Widening the mid-class to `[\w\s-]` admits every ASCII/Unicode whitespace
codepoint, so the squash-and-compare callback fires on every variant and the
fence redaction reaches its target. The `[_A-Z]` anchor is unchanged, so
benign lowercase / non-fence tag shapes are still left literal.

Tests cover &#10;, &#xA;, &#13;, &#xD;, &#9; (regression), &#11;, &#12;, a
mixed encoded+raw obfuscation, raw \r\n, raw \n, raw \v / \f, and two
negative cases (`<NotAFence\nfoo>` and `<\nFOO>`) that must not redact.
… avoid deadlock with outer lock

When a caller already holds an outer Postgres transaction wrapping a
critical section (e.g. PR #170's `withSpacCikLock` will issue
`BEGIN ... pg_advisory_xact_lock ...` on a checked-out client to serialize
SPAC writes per CIK), `recomputeSpacDeals` checking out a *second* client
from the shared pool to run its own BEGIN/COMMIT will deadlock the
moment the pool is saturated by concurrent CIK locks — every connection
in the pool holds an outer lock and waits on a second client that the
pool can no longer hand out.

Threading an optional `pgClient: PoolClient` through `RecomputeSpacDealsArgs`
(and through `SpacReportWriter.recomputeAndSaveDeals`) lets the lock owner
hand its connection to the inner ops; the Postgres branch then runs
DELETE/INSERT directly on that client and skips its own BEGIN/COMMIT/
ROLLBACK/release (the caller still owns those). When `pgClient` is
undefined the defensive default path is unchanged: own pool checkout +
own transaction wrap.

Adds a test asserting both paths — the back-compat case still emits
BEGIN…COMMIT and releases its client, and the caller-supplied case
runs the INSERT on the provided client without issuing any txn or
release calls.
The prior stripFormatChars only stripped 8 explicit BMP codepoints
(ZWSP/ZWNJ/ZWJ/LRM/RLM/WJ/BOM/SHY). A filer could splice U+180E,
the math invisibles U+2061-U+2064, or any variation selector
(VS1-VS16 at U+FE00-U+FE0F, VS17-VS256 at U+E0100-U+E01EF) between
the letters of `UNTRUSTED_FILER_DOCUMENT` and survive the strip —
the `squashed.startsWith("UNTRUSTEDFILERDOCUMENT")` check then
failed because non-letter codepoints stayed in the tag body.

Widen the class to `\p{Cf}` (subsumes the 8 original codepoints plus
U+180E + math invisibles) joined with the explicit VS ranges
U+FE00-U+FE0F (VS-16 is `Mn`, not `Cf`) and U+E0100-U+E01EF.

Add one test per residual class and a combined adversarial case.
Also correct the misleading comments on the existing non-redaction
tests: the actual rejection mechanism is the inner squashed-letters
check, not the case-insensitive `[_A-Z]` anchor.
The `stripDoctype` regex anchors at the start of the document and
only tolerates an optional `<?xml ...?>` declaration before the
DOCTYPE. A filer who slips a leading XML comment, a non-xml
processing instruction, or a `]>`/`[` inside a quoted PUBLIC id
defeats the regex and lets `<!ENTITY ...>` reach the parser.
Bounded `processEntities` would still expand the declared entity
up to its caps, surfacing filer-controlled bytes in stored values.

Move the seal to the parser layer: flip `processEntities` to
`{ enabled: false }` so no entity ever expands. With expansion off
the parser preserves every `&...;` byte sequence literally,
including the five predefined ones (`&amp;` `&lt;` `&gt;` `&quot;`
`&apos;`), so round-tripping a value like `&amp;` would now break.
Restore the round-trip with a post-parse `decodePredefinedEntities`
walker: single-pass regex over the predefined five, recursive over
plain arrays and `constructor === Object` objects, untouched for
non-string primitives / `Date` / typed arrays. The single-pass
contract is load-bearing: `&amp;lt;` decodes to `&lt;` (one match
consumes the `&amp;`), not to `<`.

Keep `stripDoctype` as best-effort hygiene; its JSDoc now spells
out that it is no longer the security boundary. Tests pin the
three bypass paths (leading comment, leading PI, quoted PUBLIC id
with `]>` / `[`), the predefined-entity round-trip including the
one-pass `&amp;lt;` -> `&lt;` rule, the billion-laughs bound, and
the walker's recursion / primitive-preservation contracts.

Copy link
Copy Markdown
Contributor Author

Superseded by #176. Both fixes here — the residual Unicode-invisible defang closure (\p{Cf} + variation selectors) and the stripDoctype leading-comment/PI seam moved to the parser layer (processEntities off + decodePredefinedEntities) — are carried forward there as the final form of the entity/defang handling. Closing in favor of #176.


Generated by Claude Code

@sroussey sroussey closed this Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants