fix(sec): consolidated SPAC/S-1/8-K extractor hardening (supersedes #165, #168–#174)#176
Open
sroussey wants to merge 25 commits into
Open
fix(sec): consolidated SPAC/S-1/8-K extractor hardening (supersedes #165, #168–#174)#176sroussey wants to merge 25 commits into
sroussey wants to merge 25 commits into
Conversation
The prompt-injection seal around S-1/424 AI section extraction had two filer-controllable weak points: 1. The fence tag was a static literal (`UNTRUSTED_FILER_DOCUMENT`), so a filer could pre-stage a matching closing tag and end the fence early. The defang scan was case-insensitive but flat — only a single literal tag-shape was rewritten. 2. A model-emitted `source_span` was capped at the verifier (post- normalization) but persisted raw, so an attacker who slipped any verifier-passing row could ship unbounded raw bytes through the provenance column. This patch deepens the seal: - The fence tag carries a per-call 64-bit random nonce. The `UNTRUSTED_FILER_DOCUMENT_NONCE_<hex>` shape means a pre-staged closing tag in the prospectus cannot match the call's actual fence. - Before defang, the section body is HTML-entity-decoded (multi-pass, up to a fixed point), NFKC-normalized, and stripped of zero-width / bidi format chars. The defang scan matches any tag-shaped token whose alphabetic payload squashes to `UNTRUSTEDFILERDOCUMENT...`, so obfuscations via `<`, fullwidth letters, ZWSP, intra-tag spaces, and case-mixing all collapse to `[redacted-fence-tag]`. - A new `boundSourceSpan` caps stored spans at 1000 raw chars (returning `null` over the cap rather than truncating). A new `verifyRowSpan` rejects a span whose raw byte count exceeds the cap before the normalize-and-substring check runs, so a whitespace-inflated payload that would otherwise normalize under cap can no longer pass the gate. All `verifyRow:` callsites and `source_span:` persist sites in the S-1 storage and shared offering-sections layer route through these. - Bumps the S-1 extractor version to 1.3.0 and the 424 extractor version to 1.2.0: prompt-shape changes drift confidence calibration, and the span-storage shape changes too. Operators should run startDev/promote to roll the new version into production. Adds unit tests for `boundSourceSpan` / `verifyRowSpan` boundary cases, a 1500-raw-char whitespace-padded span dead-letter test in the storage layer, and obfuscation tests for fullwidth, HTML-entity, mixed-case + zero-width, intra-tag whitespace, wrong-nonce, and nonce uniqueness. Co-Authored-By: Claude <noreply@anthropic.com>
…ntity hardening
Four hardening fixes around the Form 8-K event-storage path:
1. fast-xml-parser entity expansion is disabled (`processEntities: false`)
on the shared Form XML parser. A filer-controlled SGML payload that
declared a chain of nested entity references would otherwise expand
into a multi-GB string ("billion laughs") and peg CPU at parse time.
A regression test feeds a 10-level billion-laughs DOCTYPE through
Form_8_K.parse and asserts the parse stays well under 50 ms.
2. EDGAR accession numbers cross trust boundaries unconstrained — the
filing-task input schema and the Form 8-K event table both accepted
any string, so an over-long or malformed accession could land in
storage. Introduces `TypeAccessionNumber()` (20-char fixed length,
`^\d{10}-\d{2}-\d{6}$`) and applies it at the
ProcessAccessionDocFormTask input and the event row schema.
3. The Form 8-K event row keyed `(cik, accession_number, item_code)` as
its primary key — a re-extract under a new extractor version would
overwrite the prior version's rows, erasing the time series. Switches
the table to a synthetic `event_id` AUTOINCREMENT PK plus an explicit
`(cik, accession_number, extractor_id, extractor_version, item_code)`
UNIQUE natural-key index, mirroring the PersonObservation /
CompanyObservation shape. Both extractor columns are now first-class
so coverage / drop-previous ceremonies can target a single version.
A one-shot legacy-schema migration drops the pre-versioned table on
the SQLite and Postgres paths (the natural-key PK cannot be ALTERed
away on either backend, and 8-K events are deterministic to re-extract).
4. `processForm8K` previously looped over items with one `put` per item,
so a mid-loop crash left the row set torn between old and new items
for the same (filing, version). Adds `Form8KEventRepo.replaceEvents`
— DELETE all rows for `(cik, accession_number, extractor_id,
extractor_version)` then bulk-insert the new set, wrapped in a
real transaction on the SQLite (better-sqlite3 `db.transaction`)
and Postgres (`BEGIN / COMMIT / ROLLBACK` on a checked-out client)
paths. The in-memory backend (tests only) is synchronous so a torn
write cannot interleave. A failure-injection test seeds a row,
then re-runs `replaceEvents` with a NOT NULL-violating second
insert and asserts the prior baseline is intact after rollback.
Also wires `extractor_id` + `extractor_version` through the task layer
into `processForm8K` so the same writer can run under any version slot.
Co-Authored-By: Claude <noreply@anthropic.com>
…B_TYPE token
In CI the 21 Form_8_K tests failed with `no such table: form_8k_events`
because `replaceForm8KEvents` was dispatching to `replaceSqlite` even
though the test harness had wired `FORM_8K_EVENT_REPOSITORY_TOKEN` to an
in-memory storage. The trigger was test-process global-DI contamination:
`FetchDailyIndexTask.test.ts` calls `EnvToDI()` at module-load time,
which registers `SEC_DB_TYPE = "sqlite"` in the `globalServiceRegistry`.
The ServiceRegistry has no unregister API, so once any earlier test in
the same Bun worker hits that path, `SEC_DB_TYPE` sticks for the rest of
the run. `resetDependencyInjectionsForTesting()` rebinds the repo tokens
to in-memory storages but cannot clear `SEC_DB_TYPE`, so the SQLite
branch in `replaceForm8KEvents` won and reached for `getDb()`, which
either fell over on an uninitialized SQLite handle (locally) or
write-attempted against a table that was never created (CI).
Fix: trust the actual repo. `InMemoryTabularStorage.isDurable()` returns
`false`; the production storages don't override it. When the resolved
repo is non-durable, take the repo path regardless of `SEC_DB_TYPE`.
This makes the dispatch correct even when global config and the
registered repo disagree, which is the steady-state in the test process.
Reproduces locally via:
bun test src/task/index/FetchDailyIndexTask.test.ts \\
src/sec/forms/miscellaneous-filings/Form_8_K.test.ts
(without the fix: 25 Form_8_K fails; with the fix: all 29 pass).
Co-Authored-By: Claude <noreply@anthropic.com>
Resolves conflicts created by PR #166 (SPAC de-SPAC lifecycle / merger-proxy / redemption extraction) landing on main after this PR opened. Conflicts resolved: - src/sec/forms/miscellaneous-filings/Form_8_K.storage.ts - Function signature combines both side's additive params: extractor_id, extractor_version (this PR), fullSubmissionText, model (#166). - Event writes go through replaceEvents() (this PR), threading extractor_id + extractor_version into the version-scoped delete-then-insert. - SPAC milestone mapping + redemption extraction blocks from #166 follow unchanged after the events are persisted. - src/task/forms/ProcessAccessionDocFormTask.ts - Keep TypeAccessionNumber import (this PR), processMergerProxy + hasRedemptionTriggerItem imports (#166). - 8-K dispatch call site passes both extractor_id/extractor_version and fullSubmissionText into processForm8K; merger-proxy case from #166 follows. - src/sec/forms/registration-statements/s1/sectionExtractors.ts (auto-merged cleanly by git but the new extractMergerDeal / extractRedemption functions still called the pre-PR wrapUntrusted shape + UNTRUSTED_PREAMBLE constant, which this PR removed. Both updated to the nonce-fence API (wrapUntrusted -> { wrapped, nonce }, buildUntrustedPreamble(nonce)) so the new SPAC AI extractors get the per-call nonce fence + multi-stage defang for free. Without this, the prompts would interpolate as "[object Object]" and the model receives garbage. Verification: - targeted: bun test src/sec/forms/miscellaneous-filings/ \ src/sec/forms/proxies-information-statements/ src/task/forms/ \ src/storage/spac/ src/storage/form-8k-event/ \ src/sec/forms/registration-statements/ -> 229 pass / 0 fail. - full: bun test -> 1410 pass / 7 fail. All 7 fails are pre-existing FetchDailyIndexTask + FetchQuarterlyIndexTask 5000ms network timeouts unrelated to this PR (sandbox can't reach SEC.gov reliably). - bun run build -> clean (bun build + tsc, no errors). Co-Authored-By: Claude <noreply@anthropic.com>
…e-fence API The new SPAC extractors added in PR #166 (extractMergerDeal, extractRedemption) called the pre-PR wrapUntrusted shape (returning a string) and the removed UNTRUSTED_PREAMBLE constant. After this PR swapped wrapUntrusted to return { wrapped, nonce } and replaced the constant with buildUntrustedPreamble(nonce), the surviving call sites template-interpolated UNTRUSTED_PREAMBLE as a free identifier -> compile error (TS2552), and even if the type had survived the { wrapped, nonce } object would have rendered as "[object Object]" in the prompt -> the model receives garbage and silently returns nothing (caught by Form_DEFM14A.storage.e2e.test.ts target_name=null assertions in the post-merge run). Both extractors now use the same nonce-fence + multi-stage defang as the other section extractors -- a forced consequence of the merge, extending the per-call nonce + entity decode + NFKC + zero-width strip protection to the new SPAC AI extractors at no extra design cost. Co-Authored-By: Claude <noreply@anthropic.com>
…eal yet
H-4: processRedemption8K previously early-returned and wrote nothing when
spacRepo.getDeals(cik) was empty. For SPACs where a 5.07 / 2.01 / 8.01
vote-results 8-K is ingested before any 1.01 definitive-agreement 8-K
("5.07-first ingestion"), the redemption was lost permanently — no row,
no dead-letter, no correlation when the missing 1.01 later landed.
Delete the deals.length === 0 guard. deriveDeals already reads the full
redemption-extraction set on every recompute and partitions the timeline
into deal windows (including the spacDealGrouping.redemption.test.ts:87-97
"completed-only deal" case), so an orphan extraction persisted here is
automatically correlated by any future write that mints a deal. No schema
change, no version bump.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LfEFT4C5ayZkU7157sNwTg
…kfill
C-1: processRedemption8K never wrote to extractor_runs, so the major-bump
coverage gate (`sec version coverage extractor redemption`) always read 0
and `sec version drop-previous extractor redemption` was a no-op. Wire
recordRun around both the well-defined PARSE_ERROR catch and the section
runner — success on return, failure on throw + rethrow. Trigger-item and
SPAC gates remain unrecorded (defensive dead branches in production).
H-1 / H-2: BackfillRedemptionsTask re-ran every known-SPAC trigger 8-K on
every invocation. Add a left-anti-join against extractor_runs via
hasSuccessfulRun(cik, accession, "redemption", activeVersion) so the
sweep is idempotent; a new --force flag opts back into reprocessing.
Replace the per-SPAC (form, cik) loop with two bulk filingRepo.query({
form }) calls + an in-memory CIK-set filter, matching the existing
UpdateAllFormsTask pattern.
H-3: emit a progress log every 100 processed and honor context.signal so
a long sweep can be aborted cleanly; the next invocation resumes
naturally since already-extracted filings are skipped.
CLI: `sec spac backfill-redemptions [--force] [--dry-run]` now reports
selected / processed / skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LfEFT4C5ayZkU7157sNwTg
…ndidates Two Copilot-review findings on #168: - getRedemptionModel() throw used to bypass the surrounding try/catch and abort the whole 8-K processing, leaving no extractor_runs row. Move the recordRedemptionRun helper above the model-resolution call and wrap that call in a try/catch that dead-letters with reason_code MODEL_RESOLUTION_ERROR, records a failed run, and returns cleanly — mirroring the PARSE_ERROR path. - BackfillRedemptionsTask's skip predicate ran one hasSuccessfulRun query per candidate AND used exact-semver matching, so a patch-only version bump would reprocess every previously-extracted filing. Switch to a single listFilingsWithoutSuccessfulRun call up front, which both narrows to one storage query and applies the codebase's standard major.minor.* gating.
…extractors The merger-proxy and redemption extractors that landed via PR #166 missed the new prompt-injection seal helpers introduced in PR #165. The seal — raw- byte verifyRowSpan at gate, boundSourceSpan at persist — is now applied to both extractors so an unbounded source_span can no longer ship through SpacMergerExtractionRepo / SpacRedemptionExtractionRepo via filer-controlled DEFM14A or post-vote 8-K narrative. Also widen the fence defang to neutralize the </UNTRUSTED	FILER	 DOCUMENT> family of bypasses: add whitespace named entities (Tab, NewLine, nbsp, ensp, emsp, thinsp, zwsp, zwnj, zwj) to NAMED_ENTITY_TABLE and collapse numeric whitespace entities (	 /   etc.) to a single space before the TAG_SHAPED scan. The per-call 64-bit nonce on the real fence remains the primary defense; this closes the layered defang gap. No extractor version bumps: prompt is unchanged in non-adversarial inputs, the gate change is normalization-only.
…ompute Two SPAC correctness issues: 1. processMergerProxy never wrote to extractor_runs. The outer ProcessAccessionDocFormTask records a run for the form's extractor id (DEFM14A), but the merger-proxy nested extractor id was uncovered, so `sec version coverage extractor merger-proxy` always read zero and `drop-previous` was a no-op. Mirrors the redemption recordRun pattern from PR #168: success at the end, PARSE_ERROR in the segmenter catch, PROVIDER_ERROR around runSection. 2. SpacReportWriter.recomputeAndSaveDeals deleted orphan deal rows then wrote new deals in a non-atomic loop. A crash, AbortSignal, or DB error between the delete and the final saveDeal corrupted the SPAC report row. New SpacDealReplace helper wraps the delete+upsert pass in a real transaction: better-sqlite3 `db.transaction` for SQLite, BEGIN/COMMIT/ ROLLBACK on a checked-out PG client. In-memory fallback retains the sequential semantics (no concurrency in tests). No extractor version bump: merger-proxy stays at 1.0.0; `coverage` will simply start populating an empty table.
SpacReportWriter.snapshot() derived valid_from from wall-clock next.updated_at and only de-collided against the currently-open history row, so clock-skew or a stale-replay could invert the chain or back-date a history snapshot. rebuild() and snapshot() also read-modify-write without any lock, so two concurrent writers on the same CIK could leave two valid_to == null rows. Anchor valid_from to the data: filingDate for non-stale writes, the existing row's as_of for stale replays, with strict monotonicity enforced against the max of all prior closed/open valid_to values. Wrap the rebuild critical section in withSpacCikLock — SQLite BEGIN IMMEDIATE, Postgres pg_advisory_xact_lock keyed on CIK, in-memory keyed mutex fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01V3e3m8cMRy5stFhDzGmZrF
processRedemption8K joined the primary doc + every EX-99 exhibit markdown unconditionally into runStructured, with MAX_TOKENS=4096 bounding only the model's completion. A multi-megabyte EX-99 ran up token bills and widened the prompt-injection surface proportional to filing size. Cap per-exhibit at 200k chars and total at 400k chars; oversized exhibits are dropped (not truncated, since a partial span breaks source-span verification). Full-drop records an OVERSIZED_INPUT dead-letter without invoking the model; partial-drop records an additional informational partial-letter so operators can triage filings whose largest exhibit was skipped. Bump redemption extractor version 1.0.0 -> 1.1.0 - the model now sees a different prompt shape, so confidence calibration drifts; treat as a fresh dev cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01V3e3m8cMRy5stFhDzGmZrF
makeRunSection catches MODEL_INVALID_OUTPUT / LOW_CONFIDENCE_ALL / UNVERIFIED_SOURCE_SPAN, writes a dead-letter, and returns without throwing. ProcessAccessionDocFormTask then recorded a success extractor_run row even when every section dead-lettered, so sec version coverage counted them as covered and drop-previous purged the dead-letter rows operators needed for triage. Add a three-state outcome column (success / partial / failure) to extractor_runs. ProcessAccessionDocFormTask now queries the pending section-level dead-letters for the filing it just stored and writes outcome = partial when any exist. countSuccessfulAtVersion and listFilingsWithoutSuccessfulRun count only outcome = success; partial rows stay eligible for retry-dead-letters. Legacy rows backfill outcome from the existing success boolean - partial breakdown is unknowable for them; SQLite gets a one-shot ADD COLUMN migration in setupAllDatabases for pre-existing databases. Also tightens SpacWriteLock's backend dispatch to test the dealRepository class rather than the SEC_DB_TYPE token alone - tests register the token as sqlite while binding in-memory storages, so the env-only check spuriously opened a stray SQLite file via getDb(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01V3e3m8cMRy5stFhDzGmZrF
…ities + DOCTYPE strip PR #165 disabled entity processing entirely (processEntities: false) to defang billion-laughs payloads. That also silently corrupted every XML form value carrying one of the five predefined XML entities — e.g. `Mac Accounting Group & CPAs, LLP` was persisting as the literal four-character string `&` instead of the intended `&`. This restores the standard predefined-entity decode (so `&`, `<`, `>`, `"`, `'` round-trip normally) while keeping XXE / billion- laughs defenses in place: - fast-xml-parser's bounded processEntities config caps entity count, expansion depth, total expansions, and expanded length — a filer-declared chain bombs out at the limit instead of expanding geometrically. - A new stripDoctype() pass removes any leading <!DOCTYPE name [...]> block before parsing, so filer-declared entities never reach the parser at all. getParser() now returns a thin wrapper that runs stripDoctype() before parse(), keeping all callsites unchanged.
…loses bypass) The prompt-injection defang scan ran the multi-pass HTML-entity decoder BEFORE the numeric-whitespace collapse pass, so a filer-controlled ` ` (or `
` / ` ` / `
` / `` / ``) got unwrapped to a literal `\n` / `\r` / `\v` / `\f` first — and the TAG_SHAPED middle character class `[\w \t-]` only admitted `\t` and space, so `</UNTRUSTED FILER DOCUMENT>` no longer matched the tag-shape regex once it decoded to `</UNTRUSTED\nFILER\nDOCUMENT>`. That left the lookalike intact and the fence un-redacted. Widening the mid-class to `[\w\s-]` admits every ASCII/Unicode whitespace codepoint, so the squash-and-compare callback fires on every variant and the fence redaction reaches its target. The `[_A-Z]` anchor is unchanged, so benign lowercase / non-fence tag shapes are still left literal. Tests cover , 
, , 
, 	 (regression), , , a mixed encoded+raw obfuscation, raw \r\n, raw \n, raw \v / \f, and two negative cases (`<NotAFence\nfoo>` and `<\nFOO>`) that must not redact.
… avoid deadlock with outer lock When a caller already holds an outer Postgres transaction wrapping a critical section (e.g. PR #170's `withSpacCikLock` will issue `BEGIN ... pg_advisory_xact_lock ...` on a checked-out client to serialize SPAC writes per CIK), `recomputeSpacDeals` checking out a *second* client from the shared pool to run its own BEGIN/COMMIT will deadlock the moment the pool is saturated by concurrent CIK locks — every connection in the pool holds an outer lock and waits on a second client that the pool can no longer hand out. Threading an optional `pgClient: PoolClient` through `RecomputeSpacDealsArgs` (and through `SpacReportWriter.recomputeAndSaveDeals`) lets the lock owner hand its connection to the inner ops; the Postgres branch then runs DELETE/INSERT directly on that client and skips its own BEGIN/COMMIT/ ROLLBACK/release (the caller still owns those). When `pgClient` is undefined the defensive default path is unchanged: own pool checkout + own transaction wrap. Adds a test asserting both paths — the back-compat case still emits BEGIN…COMMIT and releases its client, and the caller-supplied case runs the INSERT on the provided client without issuing any txn or release calls.
…oid concurrent BEGIN IMMEDIATE crash) The SQLite branch issues `BEGIN IMMEDIATE` on the singleton `better-sqlite3` connection that `getDb()` returns. The pre-existing per-CIK keyed mutex (`withInProcessLock`) only serializes writers on the same CIK — distinct CIKs race past the mutex and hit BEGIN concurrently, and SQLite responds with "cannot start a transaction within a transaction". Wrapping the SQLite branch body in a process-wide gate keyed by a sentinel value (`SQLITE_GLOBAL_LOCK_KEY = 0`) forces every SPAC writer to queue at the connection regardless of CIK. The connection-level SQLite database lock is single-writer anyway, so this matches the backend's actual concurrency model. Postgres and the in-memory fallback are unchanged (per-CIK serialization is correct there). Test: five parallel `recordRegistration` calls across CIKs [100, 200, 300, 400, 500] under a real SQLite backend all resolve fulfilled, with zero throws containing "transaction within a transaction". The fix is essential — reverting only the SQLITE_GLOBAL_LOCK_KEY wrap causes this test to throw on the second concurrent BEGIN.
… (informational only) `processRedemption8K` records a `redemption-partial-oversized` dead-letter when at least one exhibit was dropped over the per-exhibit cap but a non-empty survivor set still ran through extraction. The entry exists so operators can triage filings whose largest exhibit was elided. It is purely informational — the drop is deterministic (the cap doesn't move between runs) so no retry recovers the dropped exhibit. Today the entry sits in the `pending` worklist forever and pollutes `sec extractor dead-letters redemption` output with rows that no extractor-version bump can clear. This call adds a `markResolved` immediately after the `record` so the entry lands in the `resolved` state and is excluded from `listEligible`. The `attempts` counter keeps incrementing on each replay (so the audit trail of how many times the cap was hit for a given accession is preserved), and `listPending` / `listEligible` queries never surface it again. Test: a filing with one in-cap + one oversized exhibit runs through `processRedemption8K` twice; after each run the entry exists at status="resolved", `listEligible` filtered to its section returns 0 entries, and `attempts` increments (1 then 2).
The prior stripFormatChars only stripped 8 explicit BMP codepoints
(ZWSP/ZWNJ/ZWJ/LRM/RLM/WJ/BOM/SHY). A filer could splice U+180E,
the math invisibles U+2061-U+2064, or any variation selector
(VS1-VS16 at U+FE00-U+FE0F, VS17-VS256 at U+E0100-U+E01EF) between
the letters of `UNTRUSTED_FILER_DOCUMENT` and survive the strip —
the `squashed.startsWith("UNTRUSTEDFILERDOCUMENT")` check then
failed because non-letter codepoints stayed in the tag body.
Widen the class to `\p{Cf}` (subsumes the 8 original codepoints plus
U+180E + math invisibles) joined with the explicit VS ranges
U+FE00-U+FE0F (VS-16 is `Mn`, not `Cf`) and U+E0100-U+E01EF.
Add one test per residual class and a combined adversarial case.
Also correct the misleading comments on the existing non-redaction
tests: the actual rejection mechanism is the inner squashed-letters
check, not the case-insensitive `[_A-Z]` anchor.
The `stripDoctype` regex anchors at the start of the document and
only tolerates an optional `<?xml ...?>` declaration before the
DOCTYPE. A filer who slips a leading XML comment, a non-xml
processing instruction, or a `]>`/`[` inside a quoted PUBLIC id
defeats the regex and lets `<!ENTITY ...>` reach the parser.
Bounded `processEntities` would still expand the declared entity
up to its caps, surfacing filer-controlled bytes in stored values.
Move the seal to the parser layer: flip `processEntities` to
`{ enabled: false }` so no entity ever expands. With expansion off
the parser preserves every `&...;` byte sequence literally,
including the five predefined ones (`&` `<` `>` `"`
`'`), so round-tripping a value like `&` would now break.
Restore the round-trip with a post-parse `decodePredefinedEntities`
walker: single-pass regex over the predefined five, recursive over
plain arrays and `constructor === Object` objects, untouched for
non-string primitives / `Date` / typed arrays. The single-pass
contract is load-bearing: `&lt;` decodes to `<` (one match
consumes the `&`), not to `<`.
Keep `stripDoctype` as best-effort hygiene; its JSDoc now spells
out that it is no longer the security boundary. Tests pin the
three bypass paths (leading comment, leading PI, quoted PUBLIC id
with `]>` / `[`), the predefined-entity round-trip including the
one-pass `&lt;` -> `<` rule, the billion-laughs bound, and
the walker's recursion / primitive-preservation contracts.
… into claude/great-keller-o9gbrc
Resolved conflicts: - redemption8k.ts: keep #169 prompt-injection seal (verifyRowSpan/ boundSourceSpan) inside #168's extractor_runs try/catch recording. - BackfillRedemptionsTask.ts: take #168's idempotent candidate-selection + listFilingsWithoutSuccessfulRun anti-join (subsumes main's per-(form,cik) query); use context.signal directly.
…er resolve Reconciliation against main's #175 (per-CIK AsyncMutex + monotonic snapshot): - DROP #170/#173 SpacWriteLock/withSpacCikLock: main's withCikLock already serialises per-CIK writes; the second lock double-locked and #173's SQLite BEGIN-IMMEDIATE gate is moot. Removed SpacWriteLock.ts + test. - KEEP #170's monotonic snapshot (filing-date anchored, strict monotonicity vs all prior history rows, as_of stale-replay anchor) — additive correctness, orthogonal to the mutex. - KEEP #170 redemption AI input caps + #173 dead-letter auto-resolve + #170 partial-success extractor_runs outcome. - ProcessAccessionDocFormTask: one listPending scan now drives both the #175 reap-skip and the #170 partial/success outcome. - recomputeAndSaveDeals: dropped dormant pgClient param (no outer-txn caller remains); SpacDealReplace keeps its own optional pgClient.
This was referenced Jun 30, 2026
Closed
…tion Correctness: - redemption8k full-drop (OVERSIZED_INPUT): record a successful run before returning, so #168's listFilingsWithoutSuccessfulRun backfill sweep is idempotent and never re-fetches/re-drops the oversized submission forever (the OVERSIZED dead-letter stays pending for version-bump retry). This was a cross-stack gap: #170 added the full-drop, #168 added the anti-join. - sectionExtractors defang: guard String.fromCodePoint with a Unicode code-point range check (0..0x10FFFF). Number.isFinite alone let a filer '�' / '�' through, throwing RangeError that aborted the defang and permanently dead-lettered the section. Three call sites. - ProcessAccessionDocFormTask: derive the success/partial outcome from the same hasBlockingSectionFailure predicate as the reap gate (recency + exclude SECTION_NOT_FOUND / -partial), not a raw 'any pending section' scan — a stale prior-version entry or an absent section no longer marks a clean run partial and reprocesses it forever via the version-gated sweep. Cleanup: - SpacDealReplace: remove the dead caller-owned pgClient path (its only purpose was the removed withSpacCikLock outer transaction; no production caller remains) and delete its now-obsolete test. - SpacReportWriter.snapshot: flatten the dead nested ternary in the stale anchor. - accessionNumber: add the explicit return type CLAUDE.md requires. Regression tests added for the RangeError crash and the full-drop idempotency.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
A single, verified consolidation of the eight open hardening PRs against the SPAC / S-1 / 8-K / merger-proxy / redemption extractors — #165, #168, #169, #170, #171, #172, #173, #174 — rebuilt on top of current
main. Those eight were a tangled stack (three independent chains off the same pre-#175base); this branch merges their net effect, reconciled against whatmainhas since gained in #175, and drops the parts#175superseded. The eight source PRs should be closed in favor of this one.Verified end-to-end (see Verification):
tscclean,1537 pass / 7 fail, where all 7 fails are the pre-existingFetchDailyIndexTask/FetchQuarterlyIndexTasknetwork-timeout tests (no EDGAR outbound in CI) on files this branch does not touch.The stack, and how it maps here
The eight PRs were three independent chains off
7a0f271(the pre-#175main):mainadvanced by exactly one commit since they forked — #175 ("Reap stale observations & serialize concurrent writes per CIK"), which independently added a per-CIKAsyncMutex(withCikLock) serialising everySpacReportWriter.record*, an observation UNIQUE-key + insert-race recovery, and full db-reset table coverage. That overlap is the whole reason a naive "merge all eight" would be wrong, and is reconciled below.Reconciliation against
main(#175)Kept (genuinely additive, not in
main):source_spancap (boundSourceSpan/verifyRowSpan). Includes the follow-up bypass closures: whitespace-mid-tag ( ), Unicode-invisible (\p{Cf}+ variation selectors), and thestripDoctypeleading-comment/PI seam (fix(sec): close defang bypass + thread pool client through recomputeSpacDeals #172/fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172) #174).processEntitiesdisabled + post-parsedecodePredefinedEntitieswalker, so billion-laughs is defanged and the five predefined entities still round-trip (the bug fix(forms): restore predefined XML entity decoding without re-opening XXE #171 caught where&was corrupted).replaceEvents, versioned PK,TypeAccessionNumbervalidation, legacy-table migration.SpacDealReplace.ts, from fix(sec): seal merger-proxy + redemption + transactional SPAC deal recompute #169/fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172) #174): the delete-orphans + upsert pass now runs in one transaction. This is orthogonal to#175's mutex — the mutex prevents concurrent interleaving; the transaction prevents mid-pass-crash corruption.snapshot(from fix(sec): SPAC writer atomicity, redemption LLM input cap, partial-success extractor outcome #170):valid_fromanchored to filing data (not wall-clock), strict monotonicity against all prior history rows,as_ofstale-replay anchor. Also orthogonal to the mutex and stronger thanmain's de-collide.outcomeonextractor_runs(fix(sec): SPAC writer atomicity, redemption LLM input cap, partial-success extractor outcome #170), dead-letter auto-resolve forredemption-partial-oversized(fix(sec): gate SQLite SPAC lock through in-process mutex + auto-resolve oversized dead-letter #173).spac_dealexists;extractor_runsrecording; idempotent, bulk-query backfill with alistFilingsWithoutSuccessfulRunanti-join +--force.Dropped (superseded by #175 or tied to the dropped lock):
SpacWriteLock.ts/withSpacCikLock(from fix(sec): SPAC writer atomicity, redemption LLM input cap, partial-success extractor outcome #170) and its SQLiteBEGIN IMMEDIATEgate fix (fix(sec): gate SQLite SPAC lock through in-process mutex + auto-resolve oversized dead-letter #173) —main'swithCikLockalready serialises per-CIK writes; the second lock double-locked.pgClientthreading throughrecomputeAndSaveDeals(fix(sec): close defang bypass + thread pool client through recomputeSpacDeals #172 fix 2) — it existed only to feed an outerwithSpacCikLocktransaction that no longer exists.SpacDealReplacekeeps its own optionalpgClient(exercised by its direct tests).Conflicts resolved by hand
redemption8k.ts— kept the fix(sec): seal merger-proxy + redemption + transactional SPAC deal recompute #169 seal (verifyRowSpan/boundSourceSpan) inside fix(sec): persist orphan redemptions + record runs + idempotent backfill #168'sextractor_runsrecording try/catch; fix(sec): SPAC writer atomicity, redemption LLM input cap, partial-success extractor outcome #170's input caps coexist.BackfillRedemptionsTask.ts— took fix(sec): persist orphan redemptions + record runs + idempotent backfill #168's idempotent candidate-selection (subsumesmain's per-(form,cik)query) and usedcontext.signaldirectly.SpacReportWriter.ts— keptmain'swithCikLock+ fix(sec): SPAC writer atomicity, redemption LLM input cap, partial-success extractor outcome #170'ssnapshot+ fix(sec): seal merger-proxy + redemption + transactional SPAC deal recompute #169/fix(sec): close residual defang Unicode bypass + stripDoctype comment bypass (follow-ups to #171, #172) #174's atomic recompute; removed the redundant inner lock and dormantpgClientparam.ProcessAccessionDocFormTask.ts— a singlelistPendingscan now drives both#175's reap-skip and#170's success/partial outcome.SpacReportWriter.test.ts— kept#175's concurrency test and#170's three history-monotonicity tests.Verification
bun run build— clean (bundle +tsc, 0 type errors).bun test— 1537 pass / 7 fail; the 7 areFetchDailyIndexTask/FetchQuarterlyIndexTask(real EDGAR fetch, no outbound in CI) — files untouched by this branch.storage/spac(51),storage/versioning(103),storage/form-8k-event(13),task/forms(16),task/spac(8),sec/forms(483),sec/edgar(6),cli/queries(87),storage/observation(19).Net change
46 files changed, ~+4753 / −321.
🤖 Generated with Claude Code
Generated by Claude Code