fix(sec): consolidated SPAC/S-1/8-K extractor hardening (supersedes #165, #168–#174) by sroussey · Pull Request #176 · workglow-dev/sec

sroussey · 2026-06-30T19:44:29Z

What this is

A single, verified consolidation of the eight open hardening PRs against the SPAC / S-1 / 8-K / merger-proxy / redemption extractors — #165, #168, #169, #170, #171, #172, #173, #174 — rebuilt on top of current main. Those eight were a tangled stack (three independent chains off the same pre-#175 base); this branch merges their net effect, reconciled against what main has since gained in #175, and drops the parts #175 superseded. The eight source PRs should be closed in favor of this one.

Verified end-to-end (see Verification): tsc clean, 1537 pass / 7 fail, where all 7 fails are the pre-existing FetchDailyIndexTask / FetchQuarterlyIndexTask network-timeout tests (no EDGAR outbound in CI) on files this branch does not touch.

The stack, and how it maps here

The eight PRs were three independent chains off 7a0f271 (the pre-#175 main):

Chain	PRs	Tip merged
A — S-1/424 + 8-K + merger-proxy/redemption seal	#165 → #169 → #172, and #165 → #171 → #174	#174
B — SPAC writer atomicity + redemption caps + partial outcome	#170 → #173	#173
C — redemption persistence + idempotent backfill	#168	#168

main advanced by exactly one commit since they forked — #175 ("Reap stale observations & serialize concurrent writes per CIK"), which independently added a per-CIK AsyncMutex (withCikLock) serialising every SpacReportWriter.record*, an observation UNIQUE-key + insert-race recovery, and full db-reset table coverage. That overlap is the whole reason a naive "merge all eight" would be wrong, and is reconciled below.

Reconciliation against `main` (#175)

Kept (genuinely additive, not in main):

Prompt-injection seal (S-1/424 + merger-proxy + redemption): per-call nonce fence, multi-stage defang (HTML-entity decode → NFKC → format-char strip), and the raw-byte source_span cap (boundSourceSpan/verifyRowSpan). Includes the follow-up bypass closures: whitespace-mid-tag (
), Unicode-invisible (\p{Cf} + variation selectors), and the stripDoctype leading-comment/PI seam (fix(sec): close defang 
 bypass + thread pool client through recomputeSpacDeals #172/fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172) #174).
XML entity handling (final form from fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172) #174): parser-level processEntities disabled + post-parse decodePredefinedEntities walker, so billion-laughs is defanged and the five predefined entities still round-trip (the bug fix(forms): restore predefined XML entity decoding without re-opening XXE #171 caught where & was corrupted).
Form 8-K storage hardening: transactional replaceEvents, versioned PK, TypeAccessionNumber validation, legacy-table migration.
Atomic SPAC deal replace (SpacDealReplace.ts, from fix(sec): seal merger-proxy + redemption + transactional SPAC deal recompute #169/fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172) #174): the delete-orphans + upsert pass now runs in one transaction. This is orthogonal to #175's mutex — the mutex prevents concurrent interleaving; the transaction prevents mid-pass-crash corruption.
Monotonic history snapshot (from fix(sec): SPAC writer atomicity, redemption LLM input cap, partial-success extractor outcome #170): valid_from anchored to filing data (not wall-clock), strict monotonicity against all prior history rows, as_of stale-replay anchor. Also orthogonal to the mutex and stronger than main's de-collide.
Redemption AI input caps (fix(sec): SPAC writer atomicity, redemption LLM input cap, partial-success extractor outcome #170), partial-success outcome on extractor_runs (fix(sec): SPAC writer atomicity, redemption LLM input cap, partial-success extractor outcome #170), dead-letter auto-resolve for redemption-partial-oversized (fix(sec): gate SQLite SPAC lock through in-process mutex + auto-resolve oversized dead-letter #173).
Redemption persistence + operability (fix(sec): persist orphan redemptions + record runs + idempotent backfill #168): persist extractions even before a spac_deal exists; extractor_runs recording; idempotent, bulk-query backfill with a listFilingsWithoutSuccessfulRun anti-join + --force.

Dropped (superseded by #175 or tied to the dropped lock):

SpacWriteLock.ts / withSpacCikLock (from fix(sec): SPAC writer atomicity, redemption LLM input cap, partial-success extractor outcome #170) and its SQLite BEGIN IMMEDIATE gate fix (fix(sec): gate SQLite SPAC lock through in-process mutex + auto-resolve oversized dead-letter #173) — main's withCikLock already serialises per-CIK writes; the second lock double-locked.
The pgClient threading through recomputeAndSaveDeals (fix(sec): close defang 
 bypass + thread pool client through recomputeSpacDeals #172 fix 2) — it existed only to feed an outer withSpacCikLock transaction that no longer exists. SpacDealReplace keeps its own optional pgClient (exercised by its direct tests).

Conflicts resolved by hand

redemption8k.ts — kept the fix(sec): seal merger-proxy + redemption + transactional SPAC deal recompute #169 seal (verifyRowSpan/boundSourceSpan) inside fix(sec): persist orphan redemptions + record runs + idempotent backfill #168's extractor_runs recording try/catch; fix(sec): SPAC writer atomicity, redemption LLM input cap, partial-success extractor outcome #170's input caps coexist.
BackfillRedemptionsTask.ts — took fix(sec): persist orphan redemptions + record runs + idempotent backfill #168's idempotent candidate-selection (subsumes main's per-(form,cik) query) and used context.signal directly.
SpacReportWriter.ts — kept main's withCikLock + fix(sec): SPAC writer atomicity, redemption LLM input cap, partial-success extractor outcome #170's snapshot + fix(sec): seal merger-proxy + redemption + transactional SPAC deal recompute #169/fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172) #174's atomic recompute; removed the redundant inner lock and dormant pgClient param.
ProcessAccessionDocFormTask.ts — a single listPending scan now drives both #175's reap-skip and #170's success/partial outcome.
SpacReportWriter.test.ts — kept #175's concurrency test and #170's three history-monotonicity tests.

Verification

bun run build — clean (bundle + tsc, 0 type errors).
bun test — 1537 pass / 7 fail; the 7 are FetchDailyIndexTask / FetchQuarterlyIndexTask (real EDGAR fetch, no outbound in CI) — files untouched by this branch.
Targeted suites all green: storage/spac (51), storage/versioning (103), storage/form-8k-event (13), task/forms (16), task/spac (8), sec/forms (483), sec/edgar (6), cli/queries (87), storage/observation (19).

Net change

46 files changed, ~+4753 / −321.

🤖 Generated with Claude Code

Generated by Claude Code

The prompt-injection seal around S-1/424 AI section extraction had two filer-controllable weak points: 1. The fence tag was a static literal (`UNTRUSTED_FILER_DOCUMENT`), so a filer could pre-stage a matching closing tag and end the fence early. The defang scan was case-insensitive but flat — only a single literal tag-shape was rewritten. 2. A model-emitted `source_span` was capped at the verifier (post- normalization) but persisted raw, so an attacker who slipped any verifier-passing row could ship unbounded raw bytes through the provenance column. This patch deepens the seal: - The fence tag carries a per-call 64-bit random nonce. The `UNTRUSTED_FILER_DOCUMENT_NONCE_<hex>` shape means a pre-staged closing tag in the prospectus cannot match the call's actual fence. - Before defang, the section body is HTML-entity-decoded (multi-pass, up to a fixed point), NFKC-normalized, and stripped of zero-width / bidi format chars. The defang scan matches any tag-shaped token whose alphabetic payload squashes to `UNTRUSTEDFILERDOCUMENT...`, so obfuscations via `<`, fullwidth letters, ZWSP, intra-tag spaces, and case-mixing all collapse to `[redacted-fence-tag]`. - A new `boundSourceSpan` caps stored spans at 1000 raw chars (returning `null` over the cap rather than truncating). A new `verifyRowSpan` rejects a span whose raw byte count exceeds the cap before the normalize-and-substring check runs, so a whitespace-inflated payload that would otherwise normalize under cap can no longer pass the gate. All `verifyRow:` callsites and `source_span:` persist sites in the S-1 storage and shared offering-sections layer route through these. - Bumps the S-1 extractor version to 1.3.0 and the 424 extractor version to 1.2.0: prompt-shape changes drift confidence calibration, and the span-storage shape changes too. Operators should run startDev/promote to roll the new version into production. Adds unit tests for `boundSourceSpan` / `verifyRowSpan` boundary cases, a 1500-raw-char whitespace-padded span dead-letter test in the storage layer, and obfuscation tests for fullwidth, HTML-entity, mixed-case + zero-width, intra-tag whitespace, wrong-nonce, and nonce uniqueness. Co-Authored-By: Claude <noreply@anthropic.com>

…ntity hardening Four hardening fixes around the Form 8-K event-storage path: 1. fast-xml-parser entity expansion is disabled (`processEntities: false`) on the shared Form XML parser. A filer-controlled SGML payload that declared a chain of nested entity references would otherwise expand into a multi-GB string ("billion laughs") and peg CPU at parse time. A regression test feeds a 10-level billion-laughs DOCTYPE through Form_8_K.parse and asserts the parse stays well under 50 ms. 2. EDGAR accession numbers cross trust boundaries unconstrained — the filing-task input schema and the Form 8-K event table both accepted any string, so an over-long or malformed accession could land in storage. Introduces `TypeAccessionNumber()` (20-char fixed length, `^\d{10}-\d{2}-\d{6}$`) and applies it at the ProcessAccessionDocFormTask input and the event row schema. 3. The Form 8-K event row keyed `(cik, accession_number, item_code)` as its primary key — a re-extract under a new extractor version would overwrite the prior version's rows, erasing the time series. Switches the table to a synthetic `event_id` AUTOINCREMENT PK plus an explicit `(cik, accession_number, extractor_id, extractor_version, item_code)` UNIQUE natural-key index, mirroring the PersonObservation / CompanyObservation shape. Both extractor columns are now first-class so coverage / drop-previous ceremonies can target a single version. A one-shot legacy-schema migration drops the pre-versioned table on the SQLite and Postgres paths (the natural-key PK cannot be ALTERed away on either backend, and 8-K events are deterministic to re-extract). 4. `processForm8K` previously looped over items with one `put` per item, so a mid-loop crash left the row set torn between old and new items for the same (filing, version). Adds `Form8KEventRepo.replaceEvents` — DELETE all rows for `(cik, accession_number, extractor_id, extractor_version)` then bulk-insert the new set, wrapped in a real transaction on the SQLite (better-sqlite3 `db.transaction`) and Postgres (`BEGIN / COMMIT / ROLLBACK` on a checked-out client) paths. The in-memory backend (tests only) is synchronous so a torn write cannot interleave. A failure-injection test seeds a row, then re-runs `replaceEvents` with a NOT NULL-violating second insert and asserts the prior baseline is intact after rollback. Also wires `extractor_id` + `extractor_version` through the task layer into `processForm8K` so the same writer can run under any version slot. Co-Authored-By: Claude <noreply@anthropic.com>

…B_TYPE token In CI the 21 Form_8_K tests failed with `no such table: form_8k_events` because `replaceForm8KEvents` was dispatching to `replaceSqlite` even though the test harness had wired `FORM_8K_EVENT_REPOSITORY_TOKEN` to an in-memory storage. The trigger was test-process global-DI contamination: `FetchDailyIndexTask.test.ts` calls `EnvToDI()` at module-load time, which registers `SEC_DB_TYPE = "sqlite"` in the `globalServiceRegistry`. The ServiceRegistry has no unregister API, so once any earlier test in the same Bun worker hits that path, `SEC_DB_TYPE` sticks for the rest of the run. `resetDependencyInjectionsForTesting()` rebinds the repo tokens to in-memory storages but cannot clear `SEC_DB_TYPE`, so the SQLite branch in `replaceForm8KEvents` won and reached for `getDb()`, which either fell over on an uninitialized SQLite handle (locally) or write-attempted against a table that was never created (CI). Fix: trust the actual repo. `InMemoryTabularStorage.isDurable()` returns `false`; the production storages don't override it. When the resolved repo is non-durable, take the repo path regardless of `SEC_DB_TYPE`. This makes the dispatch correct even when global config and the registered repo disagree, which is the steady-state in the test process. Reproduces locally via: bun test src/task/index/FetchDailyIndexTask.test.ts \\ src/sec/forms/miscellaneous-filings/Form_8_K.test.ts (without the fix: 25 Form_8_K fails; with the fix: all 29 pass). Co-Authored-By: Claude <noreply@anthropic.com>

Resolves conflicts created by PR #166 (SPAC de-SPAC lifecycle / merger-proxy / redemption extraction) landing on main after this PR opened. Conflicts resolved: - src/sec/forms/miscellaneous-filings/Form_8_K.storage.ts - Function signature combines both side's additive params: extractor_id, extractor_version (this PR), fullSubmissionText, model (#166). - Event writes go through replaceEvents() (this PR), threading extractor_id + extractor_version into the version-scoped delete-then-insert. - SPAC milestone mapping + redemption extraction blocks from #166 follow unchanged after the events are persisted. - src/task/forms/ProcessAccessionDocFormTask.ts - Keep TypeAccessionNumber import (this PR), processMergerProxy + hasRedemptionTriggerItem imports (#166). - 8-K dispatch call site passes both extractor_id/extractor_version and fullSubmissionText into processForm8K; merger-proxy case from #166 follows. - src/sec/forms/registration-statements/s1/sectionExtractors.ts (auto-merged cleanly by git but the new extractMergerDeal / extractRedemption functions still called the pre-PR wrapUntrusted shape + UNTRUSTED_PREAMBLE constant, which this PR removed. Both updated to the nonce-fence API (wrapUntrusted -> { wrapped, nonce }, buildUntrustedPreamble(nonce)) so the new SPAC AI extractors get the per-call nonce fence + multi-stage defang for free. Without this, the prompts would interpolate as "[object Object]" and the model receives garbage. Verification: - targeted: bun test src/sec/forms/miscellaneous-filings/ \ src/sec/forms/proxies-information-statements/ src/task/forms/ \ src/storage/spac/ src/storage/form-8k-event/ \ src/sec/forms/registration-statements/ -> 229 pass / 0 fail. - full: bun test -> 1410 pass / 7 fail. All 7 fails are pre-existing FetchDailyIndexTask + FetchQuarterlyIndexTask 5000ms network timeouts unrelated to this PR (sandbox can't reach SEC.gov reliably). - bun run build -> clean (bun build + tsc, no errors). Co-Authored-By: Claude <noreply@anthropic.com>

…e-fence API The new SPAC extractors added in PR #166 (extractMergerDeal, extractRedemption) called the pre-PR wrapUntrusted shape (returning a string) and the removed UNTRUSTED_PREAMBLE constant. After this PR swapped wrapUntrusted to return { wrapped, nonce } and replaced the constant with buildUntrustedPreamble(nonce), the surviving call sites template-interpolated UNTRUSTED_PREAMBLE as a free identifier -> compile error (TS2552), and even if the type had survived the { wrapped, nonce } object would have rendered as "[object Object]" in the prompt -> the model receives garbage and silently returns nothing (caught by Form_DEFM14A.storage.e2e.test.ts target_name=null assertions in the post-merge run). Both extractors now use the same nonce-fence + multi-stage defang as the other section extractors -- a forced consequence of the merge, extending the per-call nonce + entity decode + NFKC + zero-width strip protection to the new SPAC AI extractors at no extra design cost. Co-Authored-By: Claude <noreply@anthropic.com>

…eal yet H-4: processRedemption8K previously early-returned and wrote nothing when spacRepo.getDeals(cik) was empty. For SPACs where a 5.07 / 2.01 / 8.01 vote-results 8-K is ingested before any 1.01 definitive-agreement 8-K ("5.07-first ingestion"), the redemption was lost permanently — no row, no dead-letter, no correlation when the missing 1.01 later landed. Delete the deals.length === 0 guard. deriveDeals already reads the full redemption-extraction set on every recompute and partitions the timeline into deal windows (including the spacDealGrouping.redemption.test.ts:87-97 "completed-only deal" case), so an orphan extraction persisted here is automatically correlated by any future write that mints a deal. No schema change, no version bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LfEFT4C5ayZkU7157sNwTg

…kfill C-1: processRedemption8K never wrote to extractor_runs, so the major-bump coverage gate (`sec version coverage extractor redemption`) always read 0 and `sec version drop-previous extractor redemption` was a no-op. Wire recordRun around both the well-defined PARSE_ERROR catch and the section runner — success on return, failure on throw + rethrow. Trigger-item and SPAC gates remain unrecorded (defensive dead branches in production). H-1 / H-2: BackfillRedemptionsTask re-ran every known-SPAC trigger 8-K on every invocation. Add a left-anti-join against extractor_runs via hasSuccessfulRun(cik, accession, "redemption", activeVersion) so the sweep is idempotent; a new --force flag opts back into reprocessing. Replace the per-SPAC (form, cik) loop with two bulk filingRepo.query({ form }) calls + an in-memory CIK-set filter, matching the existing UpdateAllFormsTask pattern. H-3: emit a progress log every 100 processed and honor context.signal so a long sweep can be aborted cleanly; the next invocation resumes naturally since already-extracted filings are skipped. CLI: `sec spac backfill-redemptions [--force] [--dry-run]` now reports selected / processed / skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LfEFT4C5ayZkU7157sNwTg

…ndidates Two Copilot-review findings on #168: - getRedemptionModel() throw used to bypass the surrounding try/catch and abort the whole 8-K processing, leaving no extractor_runs row. Move the recordRedemptionRun helper above the model-resolution call and wrap that call in a try/catch that dead-letters with reason_code MODEL_RESOLUTION_ERROR, records a failed run, and returns cleanly — mirroring the PARSE_ERROR path. - BackfillRedemptionsTask's skip predicate ran one hasSuccessfulRun query per candidate AND used exact-semver matching, so a patch-only version bump would reprocess every previously-extracted filing. Switch to a single listFilingsWithoutSuccessfulRun call up front, which both narrows to one storage query and applies the codebase's standard major.minor.* gating.

…extractors The merger-proxy and redemption extractors that landed via PR #166 missed the new prompt-injection seal helpers introduced in PR #165. The seal — raw- byte verifyRowSpan at gate, boundSourceSpan at persist — is now applied to both extractors so an unbounded source_span can no longer ship through SpacMergerExtractionRepo / SpacRedemptionExtractionRepo via filer-controlled DEFM14A or post-vote 8-K narrative. Also widen the fence defang to neutralize the </UNTRUSTED&Tab;FILER&Tab; DOCUMENT> family of bypasses: add whitespace named entities (Tab, NewLine, nbsp, ensp, emsp, thinsp, zwsp, zwnj, zwj) to NAMED_ENTITY_TABLE and collapse numeric whitespace entities (	 /   etc.) to a single space before the TAG_SHAPED scan. The per-call 64-bit nonce on the real fence remains the primary defense; this closes the layered defang gap. No extractor version bumps: prompt is unchanged in non-adversarial inputs, the gate change is normalization-only.

…ompute Two SPAC correctness issues: 1. processMergerProxy never wrote to extractor_runs. The outer ProcessAccessionDocFormTask records a run for the form's extractor id (DEFM14A), but the merger-proxy nested extractor id was uncovered, so `sec version coverage extractor merger-proxy` always read zero and `drop-previous` was a no-op. Mirrors the redemption recordRun pattern from PR #168: success at the end, PARSE_ERROR in the segmenter catch, PROVIDER_ERROR around runSection. 2. SpacReportWriter.recomputeAndSaveDeals deleted orphan deal rows then wrote new deals in a non-atomic loop. A crash, AbortSignal, or DB error between the delete and the final saveDeal corrupted the SPAC report row. New SpacDealReplace helper wraps the delete+upsert pass in a real transaction: better-sqlite3 `db.transaction` for SQLite, BEGIN/COMMIT/ ROLLBACK on a checked-out PG client. In-memory fallback retains the sequential semantics (no concurrency in tests). No extractor version bump: merger-proxy stays at 1.0.0; `coverage` will simply start populating an empty table.

SpacReportWriter.snapshot() derived valid_from from wall-clock next.updated_at and only de-collided against the currently-open history row, so clock-skew or a stale-replay could invert the chain or back-date a history snapshot. rebuild() and snapshot() also read-modify-write without any lock, so two concurrent writers on the same CIK could leave two valid_to == null rows. Anchor valid_from to the data: filingDate for non-stale writes, the existing row's as_of for stale replays, with strict monotonicity enforced against the max of all prior closed/open valid_to values. Wrap the rebuild critical section in withSpacCikLock — SQLite BEGIN IMMEDIATE, Postgres pg_advisory_xact_lock keyed on CIK, in-memory keyed mutex fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01V3e3m8cMRy5stFhDzGmZrF

processRedemption8K joined the primary doc + every EX-99 exhibit markdown unconditionally into runStructured, with MAX_TOKENS=4096 bounding only the model's completion. A multi-megabyte EX-99 ran up token bills and widened the prompt-injection surface proportional to filing size. Cap per-exhibit at 200k chars and total at 400k chars; oversized exhibits are dropped (not truncated, since a partial span breaks source-span verification). Full-drop records an OVERSIZED_INPUT dead-letter without invoking the model; partial-drop records an additional informational partial-letter so operators can triage filings whose largest exhibit was skipped. Bump redemption extractor version 1.0.0 -> 1.1.0 - the model now sees a different prompt shape, so confidence calibration drifts; treat as a fresh dev cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01V3e3m8cMRy5stFhDzGmZrF

makeRunSection catches MODEL_INVALID_OUTPUT / LOW_CONFIDENCE_ALL / UNVERIFIED_SOURCE_SPAN, writes a dead-letter, and returns without throwing. ProcessAccessionDocFormTask then recorded a success extractor_run row even when every section dead-lettered, so sec version coverage counted them as covered and drop-previous purged the dead-letter rows operators needed for triage. Add a three-state outcome column (success / partial / failure) to extractor_runs. ProcessAccessionDocFormTask now queries the pending section-level dead-letters for the filing it just stored and writes outcome = partial when any exist. countSuccessfulAtVersion and listFilingsWithoutSuccessfulRun count only outcome = success; partial rows stay eligible for retry-dead-letters. Legacy rows backfill outcome from the existing success boolean - partial breakdown is unknowable for them; SQLite gets a one-shot ADD COLUMN migration in setupAllDatabases for pre-existing databases. Also tightens SpacWriteLock's backend dispatch to test the dealRepository class rather than the SEC_DB_TYPE token alone - tests register the token as sqlite while binding in-memory storages, so the env-only check spuriously opened a stray SQLite file via getDb(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01V3e3m8cMRy5stFhDzGmZrF

…ities + DOCTYPE strip PR #165 disabled entity processing entirely (processEntities: false) to defang billion-laughs payloads. That also silently corrupted every XML form value carrying one of the five predefined XML entities — e.g. `Mac Accounting Group & CPAs, LLP` was persisting as the literal four-character string `&` instead of the intended `&`. This restores the standard predefined-entity decode (so `&`, `<`, `>`, `"`, `'` round-trip normally) while keeping XXE / billion- laughs defenses in place: - fast-xml-parser's bounded processEntities config caps entity count, expansion depth, total expansions, and expanded length — a filer-declared chain bombs out at the limit instead of expanding geometrically. - A new stripDoctype() pass removes any leading <!DOCTYPE name [...]> block before parsing, so filer-declared entities never reach the parser at all. getParser() now returns a thin wrapper that runs stripDoctype() before parse(), keeping all callsites unchanged.

…loses 
 bypass) The prompt-injection defang scan ran the multi-pass HTML-entity decoder BEFORE the numeric-whitespace collapse pass, so a filer-controlled `
` (or `
` / `` / `` / `` / ``) got unwrapped to a literal `\n` / `\r` / `\v` / `\f` first — and the TAG_SHAPED middle character class `[\w \t-]` only admitted `\t` and space, so `</UNTRUSTED
FILER
DOCUMENT>` no longer matched the tag-shape regex once it decoded to `</UNTRUSTED\nFILER\nDOCUMENT>`. That left the lookalike intact and the fence un-redacted. Widening the mid-class to `[\w\s-]` admits every ASCII/Unicode whitespace codepoint, so the squash-and-compare callback fires on every variant and the fence redaction reaches its target. The `[_A-Z]` anchor is unchanged, so benign lowercase / non-fence tag shapes are still left literal. Tests cover 
, 
, , , 	 (regression), , , a mixed encoded+raw obfuscation, raw \r\n, raw \n, raw \v / \f, and two negative cases (`<NotAFence\nfoo>` and `<\nFOO>`) that must not redact.

… avoid deadlock with outer lock When a caller already holds an outer Postgres transaction wrapping a critical section (e.g. PR #170's `withSpacCikLock` will issue `BEGIN ... pg_advisory_xact_lock ...` on a checked-out client to serialize SPAC writes per CIK), `recomputeSpacDeals` checking out a *second* client from the shared pool to run its own BEGIN/COMMIT will deadlock the moment the pool is saturated by concurrent CIK locks — every connection in the pool holds an outer lock and waits on a second client that the pool can no longer hand out. Threading an optional `pgClient: PoolClient` through `RecomputeSpacDealsArgs` (and through `SpacReportWriter.recomputeAndSaveDeals`) lets the lock owner hand its connection to the inner ops; the Postgres branch then runs DELETE/INSERT directly on that client and skips its own BEGIN/COMMIT/ ROLLBACK/release (the caller still owns those). When `pgClient` is undefined the defensive default path is unchanged: own pool checkout + own transaction wrap. Adds a test asserting both paths — the back-compat case still emits BEGIN…COMMIT and releases its client, and the caller-supplied case runs the INSERT on the provided client without issuing any txn or release calls.

…oid concurrent BEGIN IMMEDIATE crash) The SQLite branch issues `BEGIN IMMEDIATE` on the singleton `better-sqlite3` connection that `getDb()` returns. The pre-existing per-CIK keyed mutex (`withInProcessLock`) only serializes writers on the same CIK — distinct CIKs race past the mutex and hit BEGIN concurrently, and SQLite responds with "cannot start a transaction within a transaction". Wrapping the SQLite branch body in a process-wide gate keyed by a sentinel value (`SQLITE_GLOBAL_LOCK_KEY = 0`) forces every SPAC writer to queue at the connection regardless of CIK. The connection-level SQLite database lock is single-writer anyway, so this matches the backend's actual concurrency model. Postgres and the in-memory fallback are unchanged (per-CIK serialization is correct there). Test: five parallel `recordRegistration` calls across CIKs [100, 200, 300, 400, 500] under a real SQLite backend all resolve fulfilled, with zero throws containing "transaction within a transaction". The fix is essential — reverting only the SQLITE_GLOBAL_LOCK_KEY wrap causes this test to throw on the second concurrent BEGIN.

… (informational only) `processRedemption8K` records a `redemption-partial-oversized` dead-letter when at least one exhibit was dropped over the per-exhibit cap but a non-empty survivor set still ran through extraction. The entry exists so operators can triage filings whose largest exhibit was elided. It is purely informational — the drop is deterministic (the cap doesn't move between runs) so no retry recovers the dropped exhibit. Today the entry sits in the `pending` worklist forever and pollutes `sec extractor dead-letters redemption` output with rows that no extractor-version bump can clear. This call adds a `markResolved` immediately after the `record` so the entry lands in the `resolved` state and is excluded from `listEligible`. The `attempts` counter keeps incrementing on each replay (so the audit trail of how many times the cap was hit for a given accession is preserved), and `listPending` / `listEligible` queries never surface it again. Test: a filing with one in-cap + one oversized exhibit runs through `processRedemption8K` twice; after each run the entry exists at status="resolved", `listEligible` filtered to its section returns 0 entries, and `attempts` increments (1 then 2).

The prior stripFormatChars only stripped 8 explicit BMP codepoints (ZWSP/ZWNJ/ZWJ/LRM/RLM/WJ/BOM/SHY). A filer could splice U+180E, the math invisibles U+2061-U+2064, or any variation selector (VS1-VS16 at U+FE00-U+FE0F, VS17-VS256 at U+E0100-U+E01EF) between the letters of `UNTRUSTED_FILER_DOCUMENT` and survive the strip — the `squashed.startsWith("UNTRUSTEDFILERDOCUMENT")` check then failed because non-letter codepoints stayed in the tag body. Widen the class to `\p{Cf}` (subsumes the 8 original codepoints plus U+180E + math invisibles) joined with the explicit VS ranges U+FE00-U+FE0F (VS-16 is `Mn`, not `Cf`) and U+E0100-U+E01EF. Add one test per residual class and a combined adversarial case. Also correct the misleading comments on the existing non-redaction tests: the actual rejection mechanism is the inner squashed-letters check, not the case-insensitive `[_A-Z]` anchor.

The `stripDoctype` regex anchors at the start of the document and only tolerates an optional `<?xml ...?>` declaration before the DOCTYPE. A filer who slips a leading XML comment, a non-xml processing instruction, or a `]>`/`[` inside a quoted PUBLIC id defeats the regex and lets `<!ENTITY ...>` reach the parser. Bounded `processEntities` would still expand the declared entity up to its caps, surfacing filer-controlled bytes in stored values. Move the seal to the parser layer: flip `processEntities` to `{ enabled: false }` so no entity ever expands. With expansion off the parser preserves every `&...;` byte sequence literally, including the five predefined ones (`&` `<` `>` `"` `'`), so round-tripping a value like `&` would now break. Restore the round-trip with a post-parse `decodePredefinedEntities` walker: single-pass regex over the predefined five, recursive over plain arrays and `constructor === Object` objects, untouched for non-string primitives / `Date` / typed arrays. The single-pass contract is load-bearing: `&lt;` decodes to `<` (one match consumes the `&`), not to `<`. Keep `stripDoctype` as best-effort hygiene; its JSDoc now spells out that it is no longer the security boundary. Tests pin the three bypass paths (leading comment, leading PI, quoted PUBLIC id with `]>` / `[`), the predefined-entity round-trip including the one-pass `&lt;` -> `<` rule, the billion-laughs bound, and the walker's recursion / primitive-preservation contracts.

… into claude/great-keller-o9gbrc

Resolved conflicts: - redemption8k.ts: keep #169 prompt-injection seal (verifyRowSpan/ boundSourceSpan) inside #168's extractor_runs try/catch recording. - BackfillRedemptionsTask.ts: take #168's idempotent candidate-selection + listFilingsWithoutSuccessfulRun anti-join (subsumes main's per-(form,cik) query); use context.signal directly.

…er resolve Reconciliation against main's #175 (per-CIK AsyncMutex + monotonic snapshot): - DROP #170/#173 SpacWriteLock/withSpacCikLock: main's withCikLock already serialises per-CIK writes; the second lock double-locked and #173's SQLite BEGIN-IMMEDIATE gate is moot. Removed SpacWriteLock.ts + test. - KEEP #170's monotonic snapshot (filing-date anchored, strict monotonicity vs all prior history rows, as_of stale-replay anchor) — additive correctness, orthogonal to the mutex. - KEEP #170 redemption AI input caps + #173 dead-letter auto-resolve + #170 partial-success extractor_runs outcome. - ProcessAccessionDocFormTask: one listPending scan now drives both the #175 reap-skip and the #170 partial/success outcome. - recomputeAndSaveDeals: dropped dormant pgClient param (no outer-txn caller remains); SpacDealReplace keeps its own optional pgClient.

…tion Correctness: - redemption8k full-drop (OVERSIZED_INPUT): record a successful run before returning, so #168's listFilingsWithoutSuccessfulRun backfill sweep is idempotent and never re-fetches/re-drops the oversized submission forever (the OVERSIZED dead-letter stays pending for version-bump retry). This was a cross-stack gap: #170 added the full-drop, #168 added the anti-join. - sectionExtractors defang: guard String.fromCodePoint with a Unicode code-point range check (0..0x10FFFF). Number.isFinite alone let a filer '&#x110000;' / '&#1114112;' through, throwing RangeError that aborted the defang and permanently dead-lettered the section. Three call sites. - ProcessAccessionDocFormTask: derive the success/partial outcome from the same hasBlockingSectionFailure predicate as the reap gate (recency + exclude SECTION_NOT_FOUND / -partial), not a raw 'any pending section' scan — a stale prior-version entry or an absent section no longer marks a clean run partial and reprocesses it forever via the version-gated sweep. Cleanup: - SpacDealReplace: remove the dead caller-owned pgClient path (its only purpose was the removed withSpacCikLock outer transaction; no production caller remains) and delete its now-obsolete test. - SpacReportWriter.snapshot: flatten the dead nested ternary in the stale anchor. - accessionNumber: add the explicit return type CLAUDE.md requires. Regression tests added for the RangeError crash and the full-drop idempotency.

claude added 24 commits June 23, 2026 08:31

merge: bring in #171 entity-fix branch as base for combined follow-up

a6b6f43

Merge remote-tracking branch 'origin/claude/wonderful-hypatia-ebsdur'…

816d778

… into claude/great-keller-o9gbrc

This was referenced Jun 30, 2026

fix(sec): gate SQLite SPAC lock through in-process mutex + auto-resolve oversized dead-letter #173

Closed

fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172) #174

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(sec): consolidated SPAC/S-1/8-K extractor hardening (supersedes #165, #168–#174)#176

fix(sec): consolidated SPAC/S-1/8-K extractor hardening (supersedes #165, #168–#174)#176
sroussey wants to merge 25 commits into
mainfrom
claude/great-keller-o9gbrc

sroussey commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sroussey commented Jun 30, 2026

What this is

The stack, and how it maps here

Reconciliation against main (#175)

Conflicts resolved by hand

Verification

Net change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Reconciliation against `main` (#175)