Skip to content

fix(sec): consolidated SPAC/S-1/8-K extractor hardening (supersedes #165, #168–#174)#176

Open
sroussey wants to merge 25 commits into
mainfrom
claude/great-keller-o9gbrc
Open

fix(sec): consolidated SPAC/S-1/8-K extractor hardening (supersedes #165, #168–#174)#176
sroussey wants to merge 25 commits into
mainfrom
claude/great-keller-o9gbrc

Conversation

@sroussey

Copy link
Copy Markdown
Contributor

What this is

A single, verified consolidation of the eight open hardening PRs against the SPAC / S-1 / 8-K / merger-proxy / redemption extractors — #165, #168, #169, #170, #171, #172, #173, #174 — rebuilt on top of current main. Those eight were a tangled stack (three independent chains off the same pre-#175 base); this branch merges their net effect, reconciled against what main has since gained in #175, and drops the parts #175 superseded. The eight source PRs should be closed in favor of this one.

Verified end-to-end (see Verification): tsc clean, 1537 pass / 7 fail, where all 7 fails are the pre-existing FetchDailyIndexTask / FetchQuarterlyIndexTask network-timeout tests (no EDGAR outbound in CI) on files this branch does not touch.

The stack, and how it maps here

The eight PRs were three independent chains off 7a0f271 (the pre-#175 main):

Chain PRs Tip merged
A — S-1/424 + 8-K + merger-proxy/redemption seal #165#169#172, and #165#171#174 #174
B — SPAC writer atomicity + redemption caps + partial outcome #170#173 #173
C — redemption persistence + idempotent backfill #168 #168

main advanced by exactly one commit since they forked — #175 ("Reap stale observations & serialize concurrent writes per CIK"), which independently added a per-CIK AsyncMutex (withCikLock) serialising every SpacReportWriter.record*, an observation UNIQUE-key + insert-race recovery, and full db-reset table coverage. That overlap is the whole reason a naive "merge all eight" would be wrong, and is reconciled below.

Reconciliation against main (#175)

Kept (genuinely additive, not in main):

Dropped (superseded by #175 or tied to the dropped lock):

Conflicts resolved by hand

Verification

  • bun run build — clean (bundle + tsc, 0 type errors).
  • bun test1537 pass / 7 fail; the 7 are FetchDailyIndexTask / FetchQuarterlyIndexTask (real EDGAR fetch, no outbound in CI) — files untouched by this branch.
  • Targeted suites all green: storage/spac (51), storage/versioning (103), storage/form-8k-event (13), task/forms (16), task/spac (8), sec/forms (483), sec/edgar (6), cli/queries (87), storage/observation (19).

Net change

46 files changed, ~+4753 / −321.


🤖 Generated with Claude Code


Generated by Claude Code

claude added 24 commits June 23, 2026 08:31
The prompt-injection seal around S-1/424 AI section extraction had two
filer-controllable weak points:

1. The fence tag was a static literal (`UNTRUSTED_FILER_DOCUMENT`), so a
   filer could pre-stage a matching closing tag and end the fence early.
   The defang scan was case-insensitive but flat — only a single literal
   tag-shape was rewritten.
2. A model-emitted `source_span` was capped at the verifier (post-
   normalization) but persisted raw, so an attacker who slipped any
   verifier-passing row could ship unbounded raw bytes through the
   provenance column.

This patch deepens the seal:

- The fence tag carries a per-call 64-bit random nonce. The
  `UNTRUSTED_FILER_DOCUMENT_NONCE_<hex>` shape means a pre-staged closing
  tag in the prospectus cannot match the call's actual fence.
- Before defang, the section body is HTML-entity-decoded (multi-pass, up
  to a fixed point), NFKC-normalized, and stripped of zero-width / bidi
  format chars. The defang scan matches any tag-shaped token whose
  alphabetic payload squashes to `UNTRUSTEDFILERDOCUMENT...`, so
  obfuscations via `&lt;`, fullwidth letters, ZWSP, intra-tag spaces,
  and case-mixing all collapse to `[redacted-fence-tag]`.
- A new `boundSourceSpan` caps stored spans at 1000 raw chars (returning
  `null` over the cap rather than truncating). A new `verifyRowSpan`
  rejects a span whose raw byte count exceeds the cap before the
  normalize-and-substring check runs, so a whitespace-inflated payload
  that would otherwise normalize under cap can no longer pass the gate.
  All `verifyRow:` callsites and `source_span:` persist sites in the S-1
  storage and shared offering-sections layer route through these.
- Bumps the S-1 extractor version to 1.3.0 and the 424 extractor version
  to 1.2.0: prompt-shape changes drift confidence calibration, and the
  span-storage shape changes too. Operators should run startDev/promote
  to roll the new version into production.

Adds unit tests for `boundSourceSpan` / `verifyRowSpan` boundary cases,
a 1500-raw-char whitespace-padded span dead-letter test in the storage
layer, and obfuscation tests for fullwidth, HTML-entity, mixed-case +
zero-width, intra-tag whitespace, wrong-nonce, and nonce uniqueness.

Co-Authored-By: Claude <noreply@anthropic.com>
…ntity hardening

Four hardening fixes around the Form 8-K event-storage path:

1. fast-xml-parser entity expansion is disabled (`processEntities: false`)
   on the shared Form XML parser. A filer-controlled SGML payload that
   declared a chain of nested entity references would otherwise expand
   into a multi-GB string ("billion laughs") and peg CPU at parse time.
   A regression test feeds a 10-level billion-laughs DOCTYPE through
   Form_8_K.parse and asserts the parse stays well under 50 ms.

2. EDGAR accession numbers cross trust boundaries unconstrained — the
   filing-task input schema and the Form 8-K event table both accepted
   any string, so an over-long or malformed accession could land in
   storage. Introduces `TypeAccessionNumber()` (20-char fixed length,
   `^\d{10}-\d{2}-\d{6}$`) and applies it at the
   ProcessAccessionDocFormTask input and the event row schema.

3. The Form 8-K event row keyed `(cik, accession_number, item_code)` as
   its primary key — a re-extract under a new extractor version would
   overwrite the prior version's rows, erasing the time series. Switches
   the table to a synthetic `event_id` AUTOINCREMENT PK plus an explicit
   `(cik, accession_number, extractor_id, extractor_version, item_code)`
   UNIQUE natural-key index, mirroring the PersonObservation /
   CompanyObservation shape. Both extractor columns are now first-class
   so coverage / drop-previous ceremonies can target a single version.
   A one-shot legacy-schema migration drops the pre-versioned table on
   the SQLite and Postgres paths (the natural-key PK cannot be ALTERed
   away on either backend, and 8-K events are deterministic to re-extract).

4. `processForm8K` previously looped over items with one `put` per item,
   so a mid-loop crash left the row set torn between old and new items
   for the same (filing, version). Adds `Form8KEventRepo.replaceEvents`
   — DELETE all rows for `(cik, accession_number, extractor_id,
   extractor_version)` then bulk-insert the new set, wrapped in a
   real transaction on the SQLite (better-sqlite3 `db.transaction`)
   and Postgres (`BEGIN / COMMIT / ROLLBACK` on a checked-out client)
   paths. The in-memory backend (tests only) is synchronous so a torn
   write cannot interleave. A failure-injection test seeds a row,
   then re-runs `replaceEvents` with a NOT NULL-violating second
   insert and asserts the prior baseline is intact after rollback.

Also wires `extractor_id` + `extractor_version` through the task layer
into `processForm8K` so the same writer can run under any version slot.

Co-Authored-By: Claude <noreply@anthropic.com>
…B_TYPE token

In CI the 21 Form_8_K tests failed with `no such table: form_8k_events`
because `replaceForm8KEvents` was dispatching to `replaceSqlite` even
though the test harness had wired `FORM_8K_EVENT_REPOSITORY_TOKEN` to an
in-memory storage. The trigger was test-process global-DI contamination:
`FetchDailyIndexTask.test.ts` calls `EnvToDI()` at module-load time,
which registers `SEC_DB_TYPE = "sqlite"` in the `globalServiceRegistry`.
The ServiceRegistry has no unregister API, so once any earlier test in
the same Bun worker hits that path, `SEC_DB_TYPE` sticks for the rest of
the run. `resetDependencyInjectionsForTesting()` rebinds the repo tokens
to in-memory storages but cannot clear `SEC_DB_TYPE`, so the SQLite
branch in `replaceForm8KEvents` won and reached for `getDb()`, which
either fell over on an uninitialized SQLite handle (locally) or
write-attempted against a table that was never created (CI).

Fix: trust the actual repo. `InMemoryTabularStorage.isDurable()` returns
`false`; the production storages don't override it. When the resolved
repo is non-durable, take the repo path regardless of `SEC_DB_TYPE`.
This makes the dispatch correct even when global config and the
registered repo disagree, which is the steady-state in the test process.

Reproduces locally via:
  bun test src/task/index/FetchDailyIndexTask.test.ts \\
           src/sec/forms/miscellaneous-filings/Form_8_K.test.ts
(without the fix: 25 Form_8_K fails; with the fix: all 29 pass).

Co-Authored-By: Claude <noreply@anthropic.com>
Resolves conflicts created by PR #166 (SPAC de-SPAC lifecycle / merger-proxy /
redemption extraction) landing on main after this PR opened.

Conflicts resolved:

- src/sec/forms/miscellaneous-filings/Form_8_K.storage.ts
  - Function signature combines both side's additive params:
    extractor_id, extractor_version (this PR), fullSubmissionText, model (#166).
  - Event writes go through replaceEvents() (this PR), threading
    extractor_id + extractor_version into the version-scoped delete-then-insert.
  - SPAC milestone mapping + redemption extraction blocks from #166 follow
    unchanged after the events are persisted.

- src/task/forms/ProcessAccessionDocFormTask.ts
  - Keep TypeAccessionNumber import (this PR), processMergerProxy +
    hasRedemptionTriggerItem imports (#166).
  - 8-K dispatch call site passes both extractor_id/extractor_version and
    fullSubmissionText into processForm8K; merger-proxy case from #166 follows.

- src/sec/forms/registration-statements/s1/sectionExtractors.ts (auto-merged
  cleanly by git but the new extractMergerDeal / extractRedemption functions
  still called the pre-PR wrapUntrusted shape + UNTRUSTED_PREAMBLE constant,
  which this PR removed. Both updated to the nonce-fence API
  (wrapUntrusted -> { wrapped, nonce }, buildUntrustedPreamble(nonce))
  so the new SPAC AI extractors get the per-call nonce fence + multi-stage
  defang for free. Without this, the prompts would interpolate as
  "[object Object]" and the model receives garbage.

Verification:
- targeted: bun test src/sec/forms/miscellaneous-filings/ \
    src/sec/forms/proxies-information-statements/ src/task/forms/ \
    src/storage/spac/ src/storage/form-8k-event/ \
    src/sec/forms/registration-statements/  -> 229 pass / 0 fail.
- full: bun test  -> 1410 pass / 7 fail. All 7 fails are pre-existing
  FetchDailyIndexTask + FetchQuarterlyIndexTask 5000ms network timeouts
  unrelated to this PR (sandbox can't reach SEC.gov reliably).
- bun run build  -> clean (bun build + tsc, no errors).

Co-Authored-By: Claude <noreply@anthropic.com>
…e-fence API

The new SPAC extractors added in PR #166 (extractMergerDeal,
extractRedemption) called the pre-PR wrapUntrusted shape (returning a
string) and the removed UNTRUSTED_PREAMBLE constant. After this PR
swapped wrapUntrusted to return { wrapped, nonce } and replaced the
constant with buildUntrustedPreamble(nonce), the surviving call sites
template-interpolated UNTRUSTED_PREAMBLE as a free identifier ->
compile error (TS2552), and even if the type had survived the
{ wrapped, nonce } object would have rendered as "[object Object]"
in the prompt -> the model receives garbage and silently returns
nothing (caught by Form_DEFM14A.storage.e2e.test.ts target_name=null
assertions in the post-merge run).

Both extractors now use the same nonce-fence + multi-stage defang as
the other section extractors -- a forced consequence of the merge,
extending the per-call nonce + entity decode + NFKC + zero-width
strip protection to the new SPAC AI extractors at no extra design
cost.

Co-Authored-By: Claude <noreply@anthropic.com>
…eal yet

H-4: processRedemption8K previously early-returned and wrote nothing when
spacRepo.getDeals(cik) was empty. For SPACs where a 5.07 / 2.01 / 8.01
vote-results 8-K is ingested before any 1.01 definitive-agreement 8-K
("5.07-first ingestion"), the redemption was lost permanently — no row,
no dead-letter, no correlation when the missing 1.01 later landed.

Delete the deals.length === 0 guard. deriveDeals already reads the full
redemption-extraction set on every recompute and partitions the timeline
into deal windows (including the spacDealGrouping.redemption.test.ts:87-97
"completed-only deal" case), so an orphan extraction persisted here is
automatically correlated by any future write that mints a deal. No schema
change, no version bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LfEFT4C5ayZkU7157sNwTg
…kfill

C-1: processRedemption8K never wrote to extractor_runs, so the major-bump
coverage gate (`sec version coverage extractor redemption`) always read 0
and `sec version drop-previous extractor redemption` was a no-op. Wire
recordRun around both the well-defined PARSE_ERROR catch and the section
runner — success on return, failure on throw + rethrow. Trigger-item and
SPAC gates remain unrecorded (defensive dead branches in production).

H-1 / H-2: BackfillRedemptionsTask re-ran every known-SPAC trigger 8-K on
every invocation. Add a left-anti-join against extractor_runs via
hasSuccessfulRun(cik, accession, "redemption", activeVersion) so the
sweep is idempotent; a new --force flag opts back into reprocessing.
Replace the per-SPAC (form, cik) loop with two bulk filingRepo.query({
form }) calls + an in-memory CIK-set filter, matching the existing
UpdateAllFormsTask pattern.

H-3: emit a progress log every 100 processed and honor context.signal so
a long sweep can be aborted cleanly; the next invocation resumes
naturally since already-extracted filings are skipped.

CLI: `sec spac backfill-redemptions [--force] [--dry-run]` now reports
selected / processed / skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LfEFT4C5ayZkU7157sNwTg
…ndidates

Two Copilot-review findings on #168:

- getRedemptionModel() throw used to bypass the surrounding try/catch and
  abort the whole 8-K processing, leaving no extractor_runs row. Move the
  recordRedemptionRun helper above the model-resolution call and wrap that
  call in a try/catch that dead-letters with reason_code MODEL_RESOLUTION_ERROR,
  records a failed run, and returns cleanly — mirroring the PARSE_ERROR path.
- BackfillRedemptionsTask's skip predicate ran one hasSuccessfulRun query per
  candidate AND used exact-semver matching, so a patch-only version bump would
  reprocess every previously-extracted filing. Switch to a single
  listFilingsWithoutSuccessfulRun call up front, which both narrows to one
  storage query and applies the codebase's standard major.minor.* gating.
…extractors

The merger-proxy and redemption extractors that landed via PR #166 missed
the new prompt-injection seal helpers introduced in PR #165. The seal — raw-
byte verifyRowSpan at gate, boundSourceSpan at persist — is now applied to
both extractors so an unbounded source_span can no longer ship through
SpacMergerExtractionRepo / SpacRedemptionExtractionRepo via filer-controlled
DEFM14A or post-vote 8-K narrative.

Also widen the fence defang to neutralize the </UNTRUSTED&Tab;FILER&Tab;
DOCUMENT> family of bypasses: add whitespace named entities (Tab, NewLine,
nbsp, ensp, emsp, thinsp, zwsp, zwnj, zwj) to NAMED_ENTITY_TABLE and collapse
numeric whitespace entities (&#9; / &#x20; etc.) to a single space before the
TAG_SHAPED scan. The per-call 64-bit nonce on the real fence remains the
primary defense; this closes the layered defang gap.

No extractor version bumps: prompt is unchanged in non-adversarial inputs,
the gate change is normalization-only.
…ompute

Two SPAC correctness issues:

1. processMergerProxy never wrote to extractor_runs. The outer
   ProcessAccessionDocFormTask records a run for the form's extractor id
   (DEFM14A), but the merger-proxy nested extractor id was uncovered, so
   `sec version coverage extractor merger-proxy` always read zero and
   `drop-previous` was a no-op. Mirrors the redemption recordRun pattern
   from PR #168: success at the end, PARSE_ERROR in the segmenter catch,
   PROVIDER_ERROR around runSection.
2. SpacReportWriter.recomputeAndSaveDeals deleted orphan deal rows then
   wrote new deals in a non-atomic loop. A crash, AbortSignal, or DB error
   between the delete and the final saveDeal corrupted the SPAC report
   row. New SpacDealReplace helper wraps the delete+upsert pass in a real
   transaction: better-sqlite3 `db.transaction` for SQLite, BEGIN/COMMIT/
   ROLLBACK on a checked-out PG client. In-memory fallback retains the
   sequential semantics (no concurrency in tests).

No extractor version bump: merger-proxy stays at 1.0.0; `coverage` will
simply start populating an empty table.
SpacReportWriter.snapshot() derived valid_from from wall-clock
next.updated_at and only de-collided against the currently-open
history row, so clock-skew or a stale-replay could invert the chain
or back-date a history snapshot. rebuild() and snapshot() also
read-modify-write without any lock, so two concurrent writers on
the same CIK could leave two valid_to == null rows.

Anchor valid_from to the data: filingDate for non-stale writes, the
existing row's as_of for stale replays, with strict monotonicity
enforced against the max of all prior closed/open valid_to values.
Wrap the rebuild critical section in withSpacCikLock — SQLite
BEGIN IMMEDIATE, Postgres pg_advisory_xact_lock keyed on CIK,
in-memory keyed mutex fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01V3e3m8cMRy5stFhDzGmZrF
processRedemption8K joined the primary doc + every EX-99 exhibit
markdown unconditionally into runStructured, with MAX_TOKENS=4096
bounding only the model's completion. A multi-megabyte EX-99 ran
up token bills and widened the prompt-injection surface proportional
to filing size. Cap per-exhibit at 200k chars and total at 400k
chars; oversized exhibits are dropped (not truncated, since a partial
span breaks source-span verification). Full-drop records an
OVERSIZED_INPUT dead-letter without invoking the model; partial-drop
records an additional informational partial-letter so operators can
triage filings whose largest exhibit was skipped. Bump redemption
extractor version 1.0.0 -> 1.1.0 - the model now sees a different
prompt shape, so confidence calibration drifts; treat as a fresh
dev cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01V3e3m8cMRy5stFhDzGmZrF
makeRunSection catches MODEL_INVALID_OUTPUT / LOW_CONFIDENCE_ALL /
UNVERIFIED_SOURCE_SPAN, writes a dead-letter, and returns without
throwing. ProcessAccessionDocFormTask then recorded a success
extractor_run row even when every section dead-lettered, so
sec version coverage counted them as covered and drop-previous
purged the dead-letter rows operators needed for triage.

Add a three-state outcome column (success / partial / failure) to
extractor_runs. ProcessAccessionDocFormTask now queries the pending
section-level dead-letters for the filing it just stored and writes
outcome = partial when any exist. countSuccessfulAtVersion and
listFilingsWithoutSuccessfulRun count only outcome = success;
partial rows stay eligible for retry-dead-letters. Legacy rows
backfill outcome from the existing success boolean - partial
breakdown is unknowable for them; SQLite gets a one-shot
ADD COLUMN migration in setupAllDatabases for pre-existing
databases.

Also tightens SpacWriteLock's backend dispatch to test the
dealRepository class rather than the SEC_DB_TYPE token alone -
tests register the token as sqlite while binding in-memory
storages, so the env-only check spuriously opened a stray
SQLite file via getDb().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01V3e3m8cMRy5stFhDzGmZrF
…ities + DOCTYPE strip

PR #165 disabled entity processing entirely (processEntities: false) to defang
billion-laughs payloads. That also silently corrupted every XML form value
carrying one of the five predefined XML entities — e.g.
`Mac Accounting Group &amp; CPAs, LLP` was persisting as the literal
four-character string `&amp;` instead of the intended `&`.

This restores the standard predefined-entity decode (so `&amp;`, `&lt;`,
`&gt;`, `&quot;`, `&apos;` round-trip normally) while keeping XXE / billion-
laughs defenses in place:

  - fast-xml-parser's bounded processEntities config caps entity count,
    expansion depth, total expansions, and expanded length — a filer-declared
    chain bombs out at the limit instead of expanding geometrically.
  - A new stripDoctype() pass removes any leading <!DOCTYPE name [...]>
    block before parsing, so filer-declared entities never reach the parser
    at all.

getParser() now returns a thin wrapper that runs stripDoctype() before
parse(), keeping all callsites unchanged.
…loses &#10; bypass)

The prompt-injection defang scan ran the multi-pass HTML-entity decoder BEFORE
the numeric-whitespace collapse pass, so a filer-controlled `&#10;` (or `&#xA;`
/ `&#13;` / `&#xD;` / `&#11;` / `&#12;`) got unwrapped to a literal `\n` /
`\r` / `\v` / `\f` first — and the TAG_SHAPED middle character class
`[\w \t-]` only admitted `\t` and space, so `</UNTRUSTED&#10;FILER&#10;DOCUMENT>`
no longer matched the tag-shape regex once it decoded to `</UNTRUSTED\nFILER\nDOCUMENT>`.
That left the lookalike intact and the fence un-redacted.

Widening the mid-class to `[\w\s-]` admits every ASCII/Unicode whitespace
codepoint, so the squash-and-compare callback fires on every variant and the
fence redaction reaches its target. The `[_A-Z]` anchor is unchanged, so
benign lowercase / non-fence tag shapes are still left literal.

Tests cover &#10;, &#xA;, &#13;, &#xD;, &#9; (regression), &#11;, &#12;, a
mixed encoded+raw obfuscation, raw \r\n, raw \n, raw \v / \f, and two
negative cases (`<NotAFence\nfoo>` and `<\nFOO>`) that must not redact.
… avoid deadlock with outer lock

When a caller already holds an outer Postgres transaction wrapping a
critical section (e.g. PR #170's `withSpacCikLock` will issue
`BEGIN ... pg_advisory_xact_lock ...` on a checked-out client to serialize
SPAC writes per CIK), `recomputeSpacDeals` checking out a *second* client
from the shared pool to run its own BEGIN/COMMIT will deadlock the
moment the pool is saturated by concurrent CIK locks — every connection
in the pool holds an outer lock and waits on a second client that the
pool can no longer hand out.

Threading an optional `pgClient: PoolClient` through `RecomputeSpacDealsArgs`
(and through `SpacReportWriter.recomputeAndSaveDeals`) lets the lock owner
hand its connection to the inner ops; the Postgres branch then runs
DELETE/INSERT directly on that client and skips its own BEGIN/COMMIT/
ROLLBACK/release (the caller still owns those). When `pgClient` is
undefined the defensive default path is unchanged: own pool checkout +
own transaction wrap.

Adds a test asserting both paths — the back-compat case still emits
BEGIN…COMMIT and releases its client, and the caller-supplied case
runs the INSERT on the provided client without issuing any txn or
release calls.
…oid concurrent BEGIN IMMEDIATE crash)

The SQLite branch issues `BEGIN IMMEDIATE` on the singleton
`better-sqlite3` connection that `getDb()` returns. The pre-existing
per-CIK keyed mutex (`withInProcessLock`) only serializes writers on
the same CIK — distinct CIKs race past the mutex and hit BEGIN
concurrently, and SQLite responds with
"cannot start a transaction within a transaction".

Wrapping the SQLite branch body in a process-wide gate keyed by a
sentinel value (`SQLITE_GLOBAL_LOCK_KEY = 0`) forces every SPAC writer
to queue at the connection regardless of CIK. The connection-level
SQLite database lock is single-writer anyway, so this matches the
backend's actual concurrency model. Postgres and the in-memory
fallback are unchanged (per-CIK serialization is correct there).

Test: five parallel `recordRegistration` calls across CIKs
[100, 200, 300, 400, 500] under a real SQLite backend all resolve
fulfilled, with zero throws containing "transaction within a
transaction". The fix is essential — reverting only the
SQLITE_GLOBAL_LOCK_KEY wrap causes this test to throw on the
second concurrent BEGIN.
… (informational only)

`processRedemption8K` records a `redemption-partial-oversized` dead-letter
when at least one exhibit was dropped over the per-exhibit cap but a non-empty
survivor set still ran through extraction. The entry exists so operators can
triage filings whose largest exhibit was elided.

It is purely informational — the drop is deterministic (the cap doesn't move
between runs) so no retry recovers the dropped exhibit. Today the entry sits
in the `pending` worklist forever and pollutes `sec extractor dead-letters
redemption` output with rows that no extractor-version bump can clear.

This call adds a `markResolved` immediately after the `record` so the entry
lands in the `resolved` state and is excluded from `listEligible`. The
`attempts` counter keeps incrementing on each replay (so the audit trail
of how many times the cap was hit for a given accession is preserved), and
`listPending` / `listEligible` queries never surface it again.

Test: a filing with one in-cap + one oversized exhibit runs through
`processRedemption8K` twice; after each run the entry exists at
status="resolved", `listEligible` filtered to its section returns 0 entries,
and `attempts` increments (1 then 2).
The prior stripFormatChars only stripped 8 explicit BMP codepoints
(ZWSP/ZWNJ/ZWJ/LRM/RLM/WJ/BOM/SHY). A filer could splice U+180E,
the math invisibles U+2061-U+2064, or any variation selector
(VS1-VS16 at U+FE00-U+FE0F, VS17-VS256 at U+E0100-U+E01EF) between
the letters of `UNTRUSTED_FILER_DOCUMENT` and survive the strip —
the `squashed.startsWith("UNTRUSTEDFILERDOCUMENT")` check then
failed because non-letter codepoints stayed in the tag body.

Widen the class to `\p{Cf}` (subsumes the 8 original codepoints plus
U+180E + math invisibles) joined with the explicit VS ranges
U+FE00-U+FE0F (VS-16 is `Mn`, not `Cf`) and U+E0100-U+E01EF.

Add one test per residual class and a combined adversarial case.
Also correct the misleading comments on the existing non-redaction
tests: the actual rejection mechanism is the inner squashed-letters
check, not the case-insensitive `[_A-Z]` anchor.
The `stripDoctype` regex anchors at the start of the document and
only tolerates an optional `<?xml ...?>` declaration before the
DOCTYPE. A filer who slips a leading XML comment, a non-xml
processing instruction, or a `]>`/`[` inside a quoted PUBLIC id
defeats the regex and lets `<!ENTITY ...>` reach the parser.
Bounded `processEntities` would still expand the declared entity
up to its caps, surfacing filer-controlled bytes in stored values.

Move the seal to the parser layer: flip `processEntities` to
`{ enabled: false }` so no entity ever expands. With expansion off
the parser preserves every `&...;` byte sequence literally,
including the five predefined ones (`&amp;` `&lt;` `&gt;` `&quot;`
`&apos;`), so round-tripping a value like `&amp;` would now break.
Restore the round-trip with a post-parse `decodePredefinedEntities`
walker: single-pass regex over the predefined five, recursive over
plain arrays and `constructor === Object` objects, untouched for
non-string primitives / `Date` / typed arrays. The single-pass
contract is load-bearing: `&amp;lt;` decodes to `&lt;` (one match
consumes the `&amp;`), not to `<`.

Keep `stripDoctype` as best-effort hygiene; its JSDoc now spells
out that it is no longer the security boundary. Tests pin the
three bypass paths (leading comment, leading PI, quoted PUBLIC id
with `]>` / `[`), the predefined-entity round-trip including the
one-pass `&amp;lt;` -> `&lt;` rule, the billion-laughs bound, and
the walker's recursion / primitive-preservation contracts.
Resolved conflicts:
- redemption8k.ts: keep #169 prompt-injection seal (verifyRowSpan/
  boundSourceSpan) inside #168's extractor_runs try/catch recording.
- BackfillRedemptionsTask.ts: take #168's idempotent candidate-selection +
  listFilingsWithoutSuccessfulRun anti-join (subsumes main's per-(form,cik)
  query); use context.signal directly.
…er resolve

Reconciliation against main's #175 (per-CIK AsyncMutex + monotonic snapshot):
- DROP #170/#173 SpacWriteLock/withSpacCikLock: main's withCikLock already
  serialises per-CIK writes; the second lock double-locked and #173's SQLite
  BEGIN-IMMEDIATE gate is moot. Removed SpacWriteLock.ts + test.
- KEEP #170's monotonic snapshot (filing-date anchored, strict monotonicity vs
  all prior history rows, as_of stale-replay anchor) — additive correctness,
  orthogonal to the mutex.
- KEEP #170 redemption AI input caps + #173 dead-letter auto-resolve + #170
  partial-success extractor_runs outcome.
- ProcessAccessionDocFormTask: one listPending scan now drives both the #175
  reap-skip and the #170 partial/success outcome.
- recomputeAndSaveDeals: dropped dormant pgClient param (no outer-txn caller
  remains); SpacDealReplace keeps its own optional pgClient.
…tion

Correctness:
- redemption8k full-drop (OVERSIZED_INPUT): record a successful run before
  returning, so #168's listFilingsWithoutSuccessfulRun backfill sweep is
  idempotent and never re-fetches/re-drops the oversized submission forever
  (the OVERSIZED dead-letter stays pending for version-bump retry). This was a
  cross-stack gap: #170 added the full-drop, #168 added the anti-join.
- sectionExtractors defang: guard String.fromCodePoint with a Unicode
  code-point range check (0..0x10FFFF). Number.isFinite alone let a filer
  '&#x110000;' / '&#1114112;' through, throwing RangeError that aborted the
  defang and permanently dead-lettered the section. Three call sites.
- ProcessAccessionDocFormTask: derive the success/partial outcome from the same
  hasBlockingSectionFailure predicate as the reap gate (recency + exclude
  SECTION_NOT_FOUND / -partial), not a raw 'any pending section' scan — a stale
  prior-version entry or an absent section no longer marks a clean run partial
  and reprocesses it forever via the version-gated sweep.

Cleanup:
- SpacDealReplace: remove the dead caller-owned pgClient path (its only purpose
  was the removed withSpacCikLock outer transaction; no production caller
  remains) and delete its now-obsolete test.
- SpacReportWriter.snapshot: flatten the dead nested ternary in the stale anchor.
- accessionNumber: add the explicit return type CLAUDE.md requires.

Regression tests added for the RangeError crash and the full-drop idempotency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants