Skip to content

Fix incorrect chapter numbers returned from UsfmStructureExtractor#430

Merged
pmachapman merged 1 commit into
masterfrom
fix_get_chapters
Jun 24, 2026
Merged

Fix incorrect chapter numbers returned from UsfmStructureExtractor#430
pmachapman merged 1 commit into
masterfrom
fix_get_chapters

Conversation

@pmachapman

@pmachapman pmachapman commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Fixes #434

Port of sillsdev/machine.py#315

Original issue: sillsdev/machine.py#308


This change is Reviewable

@pmachapman pmachapman requested review from Enkidu93 and ddaspit June 18, 2026 01:51
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.18%. Comparing base (c1bd2cc) to head (3b7d2a7).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #430   +/-   ##
=======================================
  Coverage   73.18%   73.18%           
=======================================
  Files         440      440           
  Lines       36882    36882           
  Branches     5075     5075           
=======================================
  Hits        26991    26991           
  Misses       8778     8778           
  Partials     1113     1113           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Enkidu93 Enkidu93 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@Enkidu93 reviewed 3 files and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on ddaspit).

@ddaspit ddaspit left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@ddaspit reviewed 3 files and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on pmachapman).

@pmachapman pmachapman merged commit f9ba7bb into master Jun 24, 2026
4 checks passed
@pmachapman pmachapman deleted the fix_get_chapters branch June 24, 2026 22:21
johnml1135 added a commit that referenced this pull request Jun 29, 2026
…ling

Add Morpher.MaxDegreeOfParallelism (1 = fully single-threaded), replacing the
dead compile-time SINGLE_THREADED flag with a runtime knob across all three
within-word parallel sites (synthesis, Unordered analysis cascade, affix-template
unapplication). This lets a caller (FieldWorks "Parse All Words") parallelize
across words without nested oversubscription.

Add MorpherStatistics (opt-in, zero overhead when disabled): Word.Clone count,
analysis/synthesis phase timing, parallel-section counter (proves the sequential
path runs under degree-1), and a corpus benchmark (Explicit) that reports
GC.GetTotalAllocatedBytes + Gen0/1/2 against a real FLEx-exported grammar.

Profiling the real Sena grammar showed ~8,793 Word.Clone and ~371 MB allocated
per word (the combinatorial unapplication search). First allocation win:
Shape.CopyTo builds the src->dest node map inline instead of
.Zip().ToDictionary() + double re-enumeration (-2.3% alloc/word, fewer Gen0).

Tests: 62 HermitCrab + 790 SIL.Machine pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: memoize multiApp cascade re-expansion; measure GC under parallel load

CombinationRuleCascade: in multiApp mode a word's expansion depends only on the
word, so memoize already-expanded words and skip re-descending them (collapses the
combinatorial re-exploration to a DAG; output set unchanged). Output-identical:
62 HC + 790 core tests pass. Measured ~0% on short Sena words (their clones come
from the phonological/synthesis layers, not morphological re-expansion) but it
bounds pathological re-expansion blow-up at no correctness cost.

Benchmark: measure GC (allocated bytes + Gen0/Gen2) under the parallel-ACROSS-words
load and report Server vs Workstation GC — this is where alloc/GC contention
actually bites, unlike a single-threaded run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC perf plan: record Sena optimization results + Server-GC dominance finding

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: COW design study — 3 scoped plans to cut the FeatureStruct clone firehose

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: add copy-on-write safety-net tests before the COW refactor

- FeatureStruct: clone-of-frozen + mutate-clone leaves source unchanged, for every
  mutator incl. nested-child recursion (PriorityUnion/Union/Subtract/AddValue/RemoveValue/
  Clear), plus clone-is-mutable, never-mutated-clone equality, re-entrancy sharing, and
  ReplaceVariables isolation. Asserts the SOURCE is unchanged (not just "no throw").
- Shape: clone + mutate a cloned node's FeatureStruct leaves the source shape unchanged.
- Morpher: concurrent repeated parsing is deterministic (guards COW under parallel load).

All pin CURRENT behavior (801 core + 63 HC pass) so the COW refactor can't silently regress.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Plan A: copy-on-write FeatureStruct.Clone for frozen structs

Clone() of a FROZEN feature struct now returns a shell that borrows the source's
immutable backing dictionary; the first mutation (EnsureWritable, replacing
CheckFrozen) inflates a private deep copy via the existing CloneImpl, so neither the
mutation nor any recursion into children can touch shared frozen data. Clone() of an
unfrozen FS still deep-copies. Single-file change; no public API change.

Most cloned feature structs are never mutated, so they stay O(1) shells. Measured on
the real Sena grammar: -11% managed allocation/word and ~-29% wall on the 16-way
parallel pass (less GC contention). 801 core + 63 HC tests pass, including the new
COW safety-net tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC COW doc: record Plan A result (-11% alloc) and Plan B subsumed/blocked finding

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: out-of-process Server-GC parser (worker host + reusable client)

New SIL.Machine.Morphology.HermitCrab.Server project lets a host application get
Server-GC parsing throughput WITHOUT changing its own GC mode, by running the morpher
in a child process:

- HermitCrabServerHost: loads a compiled HC config, serves analyze requests over
  stdin/stdout (newline-delimited JSON), parses each word single-threaded with
  parallelism across the batch. Launched with DOTNET_gcServer=1.
- HermitCrabServerClient: reusable IMorphologicalAnalyzer that launches/manages the
  worker, drives the batch protocol, and returns WordAnalysis. Morphemes cross the
  boundary as DTOs that implement IMorpheme, so the client needs no grammar load.
- Shared protocol DTOs guarantee the two ends agree.

Unlike XAmple (native, in-process, no managed GC), HC is managed, and GC mode is fixed
at process startup — so a worker subprocess is the only way to scope Server GC to the
parser. Grammar-config-driven, so any Machine HC consumer can use it; FieldWorks adds a
thin IParser adapter mapping morph Properties -> LCM.

End-to-end on the real Sena grammar: out-of-process results match in-process; worker
runs Server GC while the host runs Workstation GC (verified). 63 HC tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: apply CSharpier formatting + braces to satisfy CI (formatting + code-style)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: address Copilot review comments

- CombinationRuleCascade: seed the memoization set with the initial input so a cycle
  back to it (A->B->A) doesn't re-expand it.
- Morpher.ParseWord: drop the redundant origAnalyses copy (analyses is already
  materialized and Synthesize no longer drains it).
- Server host/client: handle null JsonSerializer.Deserialize results with a clear
  protocol error instead of an NRE.
- MorpherBenchmark: clamp across-word degree-of-parallelism to >= 1 so it doesn't
  throw on single-core (ProcessorCount-1 == 0) or when HC_ACROSS_DOP=0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove copy-on-write FeatureStruct; keep deep-clone

Revert FeatureStruct.Clone and Shape.CopyTo to the upstream deep-clone behavior.
The copy-on-write FeatureStruct (clone-of-frozen shares backing, inflate on first
write) measured ~-11% allocation but is being held back from this performance PR
to keep it scoped to the single-threaded option + instrumentation + out-of-process
Server-GC parser. COW can return as its own focused change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC: address Copilot review comments (round 2)

- Honor Morpher.MaxDegreeOfParallelism cap in the two within-word parallel
  sites that previously ran at the default scheduler degree:
  ParallelCombinationRuleCascade (new MaxDegreeOfParallelism property, wired
  from AnalysisStratumRule) and AnalysisAffixTemplateRule.ParallelApplySlots.
- Server host: catch JsonException on a malformed request line and reply with
  an empty response instead of terminating the worker.
- Server client: kill+dispose the worker process if it fails to report READY
  (no leaked process on the startup-failure path).
- Server client: validate the worker returns exactly one result per requested
  word; fail fast with a clear error instead of misaligning/indexing out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: plan for data-oriented C# perf work on HermitCrab

Capture Rust's memory-architecture wins (pooling, struct-of-arrays, Span,
indices-not-pointers) in C# to attack the measured allocation/GC bottleneck,
piece by piece with a measurement after each change. One engine, no native lib.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 2: copy-on-write FeatureStruct (-20% bytes/word on en-hc)

Re-apply the COW FeatureStruct (reverts 892816f): Clone() of a frozen feature
struct borrows the immutable backing and inflates (deep-copies) only on first
mutation. Inflate only reads the shared frozen backing, so it is thread-safe;
guarded by AnalyzeWord_ConcurrentRepeatedParsing_IsDeterministic.

Measured (en-hc toy grammar, 439 forms): managed allocated 106.5 -> 84.8 KB/word
(-20%), Gen0 3 -> 2, single-thread 91 -> 79 ms. 63 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: record Sena-too-slow measurement finding + harness strategy

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: fast single-pass Sena allocation probe (SenaQuick)

Budget-bounded, Console-flushed, single-pass probe usable on the real Sena
grammar (2789 words/20s) where the multi-pass MorpherBenchmark is too slow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: COW confirmed on real Sena grammar (-14% bytes/word, +9.5% throughput)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 1/4: pool the per-clone Shape.CopyTo mapping dictionary

Reuse a [ThreadStatic] src->dest node map across CopyTo calls instead of
allocating one per Word.Clone. The map is fully consumed before CopyTo returns
and CopyTo is not reentrant, so per-thread reuse is safe.

Measured (Sena, SenaQuick): 11,997 -> 11,943 KB/word, Gen0 2621 -> 2561.
Small (the per-clone ShapeNode/Annotation objects, not the map, are the bulk);
kept as a safe step. 63 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: finish the plan — Phase 6 decision gate + status

Record the mapping-pool result, mark phase statuses (Phase 2 done; Phase 1
partial/scoped; Phase 5 deferred to FW integration), and write the Phase 6
decision: continue capturing Rust's memory architecture in C# (COW shipped at
-14% Sena / -20% en-hc; next chunk = per-thread pooling of Word/ShapeNode/FST
buffers, now measurable via SenaQuick) rather than adopt Rust's runtime.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: measure 16-thread throughput (SenaParallel) — the parallel answer

Add SenaParallel: one shared serial-within-word morpher, same word set at
dop=1/4/8/16, wall-clock words/sec + scaling.

Measured (800 Sena words, 20-core box):
  Workstation GC: 3.4x @4, 3.55x @8 (peak), 3.14x @16 -> REGRESSES (GC ceiling,
    gen0 ~580 regardless of threads). Allocation is the parallel ceiling.
  Server GC:      5.7x @4, 8.1x @8, 10.3x @16 (gen0 ~88) -> ~11x vs 1-thread WS.

Confirms: the out-of-process Server-GC worker (PR #438) already delivers the
16-thread win; RUSTIFY pooling is what lifts the in-process Workstation curve.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: pinpoint allocation split — 20% Word.Clone, 80% FST traversal

Add an opt-in per-thread AllocationProbe hook (set from the net10 test via
GC.GetAllocatedBytesForCurrentThread) to attribute Word.Clone's allocation.

Measured (Sena, SenaQuick): of ~11.8 MB/word, ~20% is Word.Clone (Shape deep
copy) and ~80% is the FST traversal/cascade (a fresh TraversalMethod + List +
Queue + register snapshots + FstResults per rule application). Redirects the
plan: the FST traversal (esp. reusing the instance cache across Transduce
calls) is the high-ROI lever, Word/Shape pooling the secondary one.

Probe is zero-overhead when disabled and behavior-identical when no probe is
set. 63 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 1 (FST): pool traversal method per-thread under Server GC

Reuse one traversal method per thread per Fst (via a new Reset()) so its
instance free-list survives across the thousands of Transduce calls per parse,
instead of allocating + discarding a fresh traversal method + instance pool on
every rule application (measured: ~80% of parse allocation is the FST traversal).

Gated to Server GC (cached GCSettings.IsServerGC), because pooling trades
transient garbage for a larger LIVE working set: under Workstation GC that
triggers stop-the-world Gen2 pauses that serialize threads and REGRESS parallel
scaling (16T 3.1x -> 1.5x). Under Server GC it is a clear win.

Measured (Sena, 800 words, SenaParallel):
  Server GC 16T: 10.3x -> 11.2x; allocation -16% (7.0->5.9 GB); Gen0 88 -> 42.
  Workstation 16T: 3.16x unchanged (per-call path retained).
803 SIL.Machine + 63 HermitCrab tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: conclusion — GC no longer dominates at 16 threads (Server GC, 11.2x)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: back out FST traversal pooling (restore pre-pooling FST)

Removing the per-thread traversal-method pool: it only paid off under Server GC
and complicates the engine. Reverting to the original allocate-per-call FST
before restructuring to bit-packed feature vectors (the better lever).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 3: bit-packed feature-vector unify fast path

Add a flat ulong-per-feature vector to FeatureStruct and a bitwise IsUnifiable
fast path in Input.Matches for the common phonological case (no defaults, no
negation, fully-symbolic arc input). Gated so the arc INPUT must be fully
bit-packable while the SEGMENT may carry ignorable non-symbolic features (FLEx
stamps a StringFeatureValue on every segment); FlatIndex is globally unique
across feature systems and assigned lazily. FeatureStruct.FlatUnifyEnabled
toggles it for A/B.

Correct: 63 HermitCrab + 806 SIL.Machine tests pass; parity assertion found zero
divergence on en/Sena/Indonesian.

Measured (single-thread, SenaQuick):
  Indonesian: 12,463 -> 11,268 KB/word (-9.7%), Gen0 44->40, 100% fast coverage.
  Sena:        9,053 ->  9,018 KB/word (neutral), 22% coverage (Bantu agreement
               uses variable arcs that fall back) -- no regression.

Next lever for variable-heavy grammars: bit-pack variable bindings too.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: remove the out-of-process Server-GC worker/client architecture

Delete SIL.Machine.Morphology.HermitCrab.Server (worker host, HermitCrabServerClient,
protocol, Program) + its tests, and the .sln / test-project references. It was a
workaround for the in-process Workstation-GC parallel ceiling (separate Server-GC
process: ~100 MB worker, .NET 10 runtime dependency, a richer protocol + FieldWorks
adapter still to build). The RUSTIFY direction supersedes it: drive allocation low
enough in-process (COW + bit-packed unify + arena work) that plain .NET needs no
Server GC. Server GC stays available as a runtimeconfig flag if ever wanted.

63 HermitCrab tests pass; solution builds without the project.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: per-word FST-traversal arena (off by default) + key parallel finding

Add a per-thread arena that reuses traversal methods + instance free-lists across
a word (FstThreadPool, reset per word from Morpher.ParseWord) via a Reset() on the
traversal methods. Gated by Fst.TraversalPoolEnabled, DEFAULT OFF.

Measured (Sena, A/B same load): single-thread allocation -13%, BUT 16-thread
scaling collapses 2.87x -> 1.29x. Confirmed across 4 pooling variants. Cause:
under Workstation GC, pooled objects live across the word -> survive Gen0 ->
promote -> stop-the-world Gen2 serializes the threads. Short-lived (Gen0-only)
allocation is actually BETTER for parallel. So object-pooling is the wrong tool
for the no-Server-GC-at-16-threads goal; the right arena is struct/Span/stackalloc
(no GC retention). Kept off-by-default as a single-thread/Server-GC opt-in.

63 HermitCrab + 806 SIL.Machine tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: struct/Span FST traversal is blocked on the data model (record path)

Verified blocker: the FST offset type for HermitCrab is ShapeNode (a class), so
Register<ShapeNode> and the traversal instances are managed -> cannot stackalloc
or hold in a stack Span, and pooling them recreates the Phase 1b Gen2 regression
(Advance is also an iterator, forbidding stackalloc). The struct/Span no-GC
traversal therefore requires the foundational change: represent the shape as a
flat array with int-index offsets so Register<int> is unmanaged -> value-type
register/instance buffers, zero GC-heap allocation in the traversal, Gen0
pressure drops, parallel scales without Server GC. Large, foundational rewrite.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 3c: FstStatistics per-category allocation breakdown harness

Adds FstStatistics (SIL.Machine) to decompose the \"80% FST scaffolding\" into
four named buckets — VarBindings.Clone, Registers.Clone, per-Transduce Scaffold,
and TraversalMethod creation — so the flat-buffer investment can be gated on real
numbers from Sena (not theory).

Key findings from en-hc + WEB-PT run (439 words, 82.9 KB/word):
  Word.Clone         21%
  Pure scaffold       1%  (Register[], HashSet, List per Transduce)
  VarBindings         1%  (negligible on English; will be larger on Sena)
  Registers           0.1%
  TraversalMethod     0.7%
  Other (cascade)    55%  (MarkMorph/Annotation/stratum-rule overhead, NOT FST)

Flat-buffer addresses ~22% on the toy grammar; Sena breakdown needed to decide
whether to pursue the full int-offset Shape rewrite. See RUSTIFY.md § Phase 3c.

RustifyBenchmark now falls back to en-hc + WEB-PT when HC_GRAMMAR/HC_WORDS are
not set, so the breakdown harness is immediately runnable without a FLEx grammar.
63 HC tests green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

HC: expand cascade breakdown harness (Segment, Word.ctor, MarkMorph, analysis window)

Adds four new allocation probes to fully decompose the 55% 'Other' bucket:
- MorpherStatistics.SegmentBytes: wraps Segment() (initial Shape/ShapeNode creation)
- MorpherStatistics.WordCtorBytes: wraps new Word(stratum, shape) construction
- MorpherStatistics.MarkMorphBytes: wraps Word.MarkMorph() annotation allocation
- MorpherStatistics.AnalysisCascadeBytes: wraps _analysisRule.Apply().ToList() (superset)

English toy grammar result (439 words, 35 MB total):
  Segment (initial Shape)    7.2%   Scaffold (pure FST) ≈ 0%
  Word.ctor(new)             9.6%   Rule-chain machinery ~40.7%
  Word.Clone                21.3%   Synthesis + other  ~18.8%
  Scaffold (incl. clones)   21.9%
→ analysis window superset  64.4%

Key finding: MarkMorph ≈ 0%; pure FST scaffold ≈ 0%; dominant costs are
Word.Clone (21%), Word.ctor+Segment (17%), and rule-chain LINQ/FstResult (~41%).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

RUSTIFY: move Word.ctor allocation probe into the Word constructor

The Word.ctor probe lived in Morpher.AnalyzeWord, so it only measured the
single initial construction per word, not the cascade-created Words. Move it
into Word(Stratum, Shape) itself (gated on MorpherStatistics.Enabled, off in
production) and add WordCtorCount so the breakdown reports calls as well as
bytes. Harness-only; no production-path behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Phase 4a: hot-loop allocation eliminations (safe, no retention)

Four pure-elimination changes in the FST traversal + analysis cascade. Each
removes an allocation outright without extending any object lifetime, so none
can trigger the Phase-1b parallel regression (pooling promotes to Gen2 ->
serializes threads). All validated: 803 SIL.Machine + 63 HermitCrab tests
green; SenaParallel scaling unchanged.

1. ITraversalMethod.Traverse returns List<FstResult> (was IEnumerable) -> drop
   the redundant .ToList() in Fst.Transduce. All four concrete Traverse impls
   already return the curResults List; the interface was needlessly widened.
2. Remove redundant .Distinct(FreezableEqualityComparer<Word>.Default) x2 in
   AnalysisStratumRule.ApplyMorphologicalRules/ApplyTemplates. Both _mrulesRule
   and _templatesRule are built with that same comparer and return a HashSet
   already deduped by it, so the Distinct pass is a no-op DistinctIterator.
3. Skip the DistinctIterator for trivial result sets in Fst.Transduce:
   (allMatches && resultList.Count > 1) ? resultList.Distinct() : resultList.
   resultList is non-null Count>=1 there; Count==1 Distinct is identity.
4. TraversalMethodBase.Reset: replace the per-Transduce GetNodesDepthFirst
   yield iterator (heap state machine per top annotation, thousands/word) with
   the allocation-free PreorderTraverse(action) form; delegate cached as a
   field (allocated once in ctor, not per call).

Measured (en-hc, SenaQuick, 439 words): Other 38.5% -> 36.2%
(14,145KB -> 12,884KB), KB/word 83.6 -> 81.1. Toy-grammar deltas are small;
real magnitude needs the Sena grammar. See RUSTIFY.md Phase 4a/4b (4b documents
the rejected scaffold-buffer ThreadStatic pooling: re-entrant via
acceptInfo.Acceptable, lifetime extension against the thesis, unmeasurable).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Phase 4c: single-hash traversed.Add in nondet FST traversal

In NondeterministicFstTraversalMethod.Traverse (both epsilon and input-match
branches), the dedup check hashed the expensive structural key (state +
annotation index + register array + outputs array) twice: once for
Contains(key), then again for Add(key). HashSet.Add already returns false when
the element is present, so collapse to `if (traversed.Add(key)) Push(newInst);`
— a single hash/lookup in the innermost traversal loop. Byte-identical.

CPU-only cleanup (no allocation change), so KB/word is flat on the toy grammar;
the structural-hash cost it removes is not resolvable there. Gated:
803 SIL.Machine + 63 HermitCrab tests green; SenaQuick no regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 4c: drop per-call List<int> in TraversalMethodBase.Advance

Advance collected the same-offset annotation window into `var anns = new
List<int>()` and then iterated it. That window is a contiguous index range
[nextIndex, annsEnd) (the build loop adds every consecutive i whose start
offset matches), so track the end bound and iterate the range directly,
eliminating one List allocation per arc match (one of the hottest paths in
the traversal). cloneOutputs/first flow is unchanged.

Measured (en-hc toy, SenaQuick, 439 words): KB/word 80.3 -> 79.0, totalMB
34 -> 33. Gated: 803 SIL.Machine + 63 HermitCrab tests green; toy under-
measures so treat the delta as directional, no-regression is the bar.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 4c: collapse identity-map LINQ in TraversalInstance.CopyTo

Both Deterministic and Nondeterministic CopyTo built an `outputMappings`
dictionary by zipping this.Output's node sequence with ITSELF — a
deterministic Queue-based BFS enumeration paired element-for-element, i.e. the
identity map — then projected _mappings through it. Since outputMappings[v]==v,
the entire block reduces to copying _mappings unchanged. Replace with
`other.Mappings.AddRange(_mappings)`, removing a Dictionary + two
SelectMany(GetNodesBreadthFirst, each allocating a Queue + yield iterator) +
Zip + Select per instance copy. CopyTo runs on every branch of nondeterministic
traversal, so this is allocation-heavy at scale (Sena ~276 clones/word) though
the toy grammar (2 clones/word, few branches) can't resolve it.

Byte-identical (provable identity-map reduction); other.Mappings is empty pre-
AddRange (GetCachedInstance -> Clear). Removed now-unused usings (System.Linq,
SIL.Machine.DataStructures) to satisfy IDE0005-as-error.

Gated: 803 SIL.Machine + 63 HermitCrab tests green; SenaQuick no regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 4c: paired-walk clone mapping in InitializeStack (Det + Nondet)

Both InitializeStack methods built inst.Mappings (source annotation -> clone)
by zipping two BFS node sequences:
  Data.Annotations.SelectMany(GetNodesBreadthFirst)
    .Zip(inst.Output.Annotations.SelectMany(GetNodesBreadthFirst), KVP)
allocating, per Transduce, a Queue per top annotation (BFS), two SelectMany
state machines, a Zip state machine, and one KeyValuePair per node.

Data and inst.Output are isomorphic (inst.Output = Data.Clone()), and the
resulting dictionary is independent of traversal order, so replace with a new
allocation-light helper DataStructuresExtensions.PairedPreorderTraverse that
walks the two forests in lockstep (preorder) and writes pairs straight into the
dict via a static (closure-free) callback. Debug.Asserts guard the isomorphism
invariant (root/leaf/child-count) so any future violation fails loudly instead
of silently truncating like Zip.

Runs once per Transduce (thousands/word). Toy grammar has tiny annotation trees
so the delta is below its resolution; the win compounds on Sena's long words.
Removed now-unused usings (System.Linq, SIL.Extensions) in the deterministic
method to satisfy IDE0005-as-error.

Gated: 803 SIL.Machine + 63 HermitCrab tests green (incl. the concurrent-
determinism test); SenaQuick no regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 4c: precompute initializer partition at Fst.Freeze

Fst.Transduce rebuilt a List<TagMapCommand> on every call (and every outer
annIndex iteration), filtering _initializers into Dest!=0 (the per-call cmds
list) vs Dest==0 (which drive a per-annotation SetOffset). The partition is
identical every call for a frozen FST, and cmds is read-only downstream
(Initialize -> ExecuteCommands only iterate it).

Partition _initializers once in Freeze() into _zeroDestInitializers /
_nonZeroDestInitializers (built into locals, gating field published last so a
reader never sees a half-filled list). Transduce reuses the shared read-only
_nonZeroDestInitializers as cmds and walks _zeroDestInitializers for the
SetOffsets, eliminating the per-call list allocation + filter loop. When the FST
isn't frozen the fields are null and Transduce falls back to the exact inline
build, so unfrozen callers are unaffected. The frozen FST is shared read-only
across parsing threads, so concurrent reads of the shared cmds list are safe.

Measured (en-hc toy, SenaQuick, 439 words): Scaffold 22.4% -> 21.0%
(7897KB -> 7261KB), KB/word 79.0 -> 78.8. SenaParallel: scaling and allocation
unchanged (no parallel regression from sharing the list). Gated: 803
SIL.Machine + 63 HermitCrab tests green, incl. the concurrent-determinism test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: document Phase 4c (five safe no-retention FST eliminations)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY cleanup: strip unrelated USFM work + remove dead GC/pooling machinery

Audit of master..hc-rustify identified work to drop now that allocation is
driven down and the branch is being squashed:

1. Strip the unrelated USFM/versification change set (3 commits: #430, #432,
   normalization port) — restored src/SIL.Machine/Corpora,
   src/SIL.Machine/PunctuationAnalysis and their tests to master, removed the
   USFM-added files. None of it is perf work; it belongs in its own PR.

2. Remove the dead FST traversal pooling (measured to REGRESS parallel parsing,
   Phase 1b): the FstThreadPool class, Fst.TraversalPoolEnabled, the pooling
   branch in Fst.Transduce, FstThreadPool.Reset() in Morpher.ParseWord, and the
   HC_ARENA toggle in RustifyBenchmark. Transduce now always uses a fresh
   (die-in-Gen0) traversal method, which is the right tradeoff once allocation
   is low.

3. Remove the Shape.CopyTo [ThreadStatic] clone-map pool, keeping the
   value-added inline mapping build (no second GetNodes().Zip().ToDictionary()
   pass) — just a plain per-call Dictionary.

4. Revert Machine.sln to master (leftover x64/x86 platform configs) and remove
   three superseded planning docs (HERMITCRAB_ALLOCATION_STRATEGIES /
   COW_PLANS / PERF_PLAN), now consolidated in RUSTIFY.md.

Kept (per request): the allocation instrumentation (MorpherStatistics,
FstStatistics, probes) + both benchmarks for before/after measurement; the
MaxDegreeOfParallelism API + Synthesize refactor; COW FeatureStruct; bit-packed
unify; Phase 4a/4c eliminations.

Gated: 801 SIL.Machine + 63 HermitCrab tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: document Phase 4d cleanup (pooling/arena removed)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY cleanup: restore the safe Shape.CopyTo [ThreadStatic] clone-map pool

Re-examination showed this pool is NOT the regressive kind removed elsewhere.
The Phase-1b parallel regression came from objects retained ACROSS a word
(promoted to Gen2). Shape.CopyTo's CloneMapping is cleared and fully consumed
WITHIN each call (contents die immediately; only a small empty buffer persists),
so it cannot promote parse data to Gen2 — and it still buys a small allocation
win (~0.45% on Sena; on the toy grammar removing it had pushed Word.Clone
22.4% -> 23.6% and KB/word 78.8 -> 80.2). Restored, with RUSTIFY.md Phase 4d
note corrected to record it as KEPT (safe pool) rather than removed.

Gated: 801 SIL.Machine + 63 HermitCrab tests green; SenaQuick KB/word back to
78.8, Word.Clone 22.4%.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
johnml1135 added a commit that referenced this pull request Jun 30, 2026
…ling

Add Morpher.MaxDegreeOfParallelism (1 = fully single-threaded), replacing the
dead compile-time SINGLE_THREADED flag with a runtime knob across all three
within-word parallel sites (synthesis, Unordered analysis cascade, affix-template
unapplication). This lets a caller (FieldWorks "Parse All Words") parallelize
across words without nested oversubscription.

Add MorpherStatistics (opt-in, zero overhead when disabled): Word.Clone count,
analysis/synthesis phase timing, parallel-section counter (proves the sequential
path runs under degree-1), and a corpus benchmark (Explicit) that reports
GC.GetTotalAllocatedBytes + Gen0/1/2 against a real FLEx-exported grammar.

Profiling the real Sena grammar showed ~8,793 Word.Clone and ~371 MB allocated
per word (the combinatorial unapplication search). First allocation win:
Shape.CopyTo builds the src->dest node map inline instead of
.Zip().ToDictionary() + double re-enumeration (-2.3% alloc/word, fewer Gen0).

Tests: 62 HermitCrab + 790 SIL.Machine pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: memoize multiApp cascade re-expansion; measure GC under parallel load

CombinationRuleCascade: in multiApp mode a word's expansion depends only on the
word, so memoize already-expanded words and skip re-descending them (collapses the
combinatorial re-exploration to a DAG; output set unchanged). Output-identical:
62 HC + 790 core tests pass. Measured ~0% on short Sena words (their clones come
from the phonological/synthesis layers, not morphological re-expansion) but it
bounds pathological re-expansion blow-up at no correctness cost.

Benchmark: measure GC (allocated bytes + Gen0/Gen2) under the parallel-ACROSS-words
load and report Server vs Workstation GC — this is where alloc/GC contention
actually bites, unlike a single-threaded run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC perf plan: record Sena optimization results + Server-GC dominance finding

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: COW design study — 3 scoped plans to cut the FeatureStruct clone firehose

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: add copy-on-write safety-net tests before the COW refactor

- FeatureStruct: clone-of-frozen + mutate-clone leaves source unchanged, for every
  mutator incl. nested-child recursion (PriorityUnion/Union/Subtract/AddValue/RemoveValue/
  Clear), plus clone-is-mutable, never-mutated-clone equality, re-entrancy sharing, and
  ReplaceVariables isolation. Asserts the SOURCE is unchanged (not just "no throw").
- Shape: clone + mutate a cloned node's FeatureStruct leaves the source shape unchanged.
- Morpher: concurrent repeated parsing is deterministic (guards COW under parallel load).

All pin CURRENT behavior (801 core + 63 HC pass) so the COW refactor can't silently regress.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Plan A: copy-on-write FeatureStruct.Clone for frozen structs

Clone() of a FROZEN feature struct now returns a shell that borrows the source's
immutable backing dictionary; the first mutation (EnsureWritable, replacing
CheckFrozen) inflates a private deep copy via the existing CloneImpl, so neither the
mutation nor any recursion into children can touch shared frozen data. Clone() of an
unfrozen FS still deep-copies. Single-file change; no public API change.

Most cloned feature structs are never mutated, so they stay O(1) shells. Measured on
the real Sena grammar: -11% managed allocation/word and ~-29% wall on the 16-way
parallel pass (less GC contention). 801 core + 63 HC tests pass, including the new
COW safety-net tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC COW doc: record Plan A result (-11% alloc) and Plan B subsumed/blocked finding

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: out-of-process Server-GC parser (worker host + reusable client)

New SIL.Machine.Morphology.HermitCrab.Server project lets a host application get
Server-GC parsing throughput WITHOUT changing its own GC mode, by running the morpher
in a child process:

- HermitCrabServerHost: loads a compiled HC config, serves analyze requests over
  stdin/stdout (newline-delimited JSON), parses each word single-threaded with
  parallelism across the batch. Launched with DOTNET_gcServer=1.
- HermitCrabServerClient: reusable IMorphologicalAnalyzer that launches/manages the
  worker, drives the batch protocol, and returns WordAnalysis. Morphemes cross the
  boundary as DTOs that implement IMorpheme, so the client needs no grammar load.
- Shared protocol DTOs guarantee the two ends agree.

Unlike XAmple (native, in-process, no managed GC), HC is managed, and GC mode is fixed
at process startup — so a worker subprocess is the only way to scope Server GC to the
parser. Grammar-config-driven, so any Machine HC consumer can use it; FieldWorks adds a
thin IParser adapter mapping morph Properties -> LCM.

End-to-end on the real Sena grammar: out-of-process results match in-process; worker
runs Server GC while the host runs Workstation GC (verified). 63 HC tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: apply CSharpier formatting + braces to satisfy CI (formatting + code-style)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: address Copilot review comments

- CombinationRuleCascade: seed the memoization set with the initial input so a cycle
  back to it (A->B->A) doesn't re-expand it.
- Morpher.ParseWord: drop the redundant origAnalyses copy (analyses is already
  materialized and Synthesize no longer drains it).
- Server host/client: handle null JsonSerializer.Deserialize results with a clear
  protocol error instead of an NRE.
- MorpherBenchmark: clamp across-word degree-of-parallelism to >= 1 so it doesn't
  throw on single-core (ProcessorCount-1 == 0) or when HC_ACROSS_DOP=0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove copy-on-write FeatureStruct; keep deep-clone

Revert FeatureStruct.Clone and Shape.CopyTo to the upstream deep-clone behavior.
The copy-on-write FeatureStruct (clone-of-frozen shares backing, inflate on first
write) measured ~-11% allocation but is being held back from this performance PR
to keep it scoped to the single-threaded option + instrumentation + out-of-process
Server-GC parser. COW can return as its own focused change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC: address Copilot review comments (round 2)

- Honor Morpher.MaxDegreeOfParallelism cap in the two within-word parallel
  sites that previously ran at the default scheduler degree:
  ParallelCombinationRuleCascade (new MaxDegreeOfParallelism property, wired
  from AnalysisStratumRule) and AnalysisAffixTemplateRule.ParallelApplySlots.
- Server host: catch JsonException on a malformed request line and reply with
  an empty response instead of terminating the worker.
- Server client: kill+dispose the worker process if it fails to report READY
  (no leaked process on the startup-failure path).
- Server client: validate the worker returns exactly one result per requested
  word; fail fast with a clear error instead of misaligning/indexing out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: plan for data-oriented C# perf work on HermitCrab

Capture Rust's memory-architecture wins (pooling, struct-of-arrays, Span,
indices-not-pointers) in C# to attack the measured allocation/GC bottleneck,
piece by piece with a measurement after each change. One engine, no native lib.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 2: copy-on-write FeatureStruct (-20% bytes/word on en-hc)

Re-apply the COW FeatureStruct (reverts 892816f2): Clone() of a frozen feature
struct borrows the immutable backing and inflates (deep-copies) only on first
mutation. Inflate only reads the shared frozen backing, so it is thread-safe;
guarded by AnalyzeWord_ConcurrentRepeatedParsing_IsDeterministic.

Measured (en-hc toy grammar, 439 forms): managed allocated 106.5 -> 84.8 KB/word
(-20%), Gen0 3 -> 2, single-thread 91 -> 79 ms. 63 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: record Sena-too-slow measurement finding + harness strategy

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: fast single-pass Sena allocation probe (SenaQuick)

Budget-bounded, Console-flushed, single-pass probe usable on the real Sena
grammar (2789 words/20s) where the multi-pass MorpherBenchmark is too slow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: COW confirmed on real Sena grammar (-14% bytes/word, +9.5% throughput)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 1/4: pool the per-clone Shape.CopyTo mapping dictionary

Reuse a [ThreadStatic] src->dest node map across CopyTo calls instead of
allocating one per Word.Clone. The map is fully consumed before CopyTo returns
and CopyTo is not reentrant, so per-thread reuse is safe.

Measured (Sena, SenaQuick): 11,997 -> 11,943 KB/word, Gen0 2621 -> 2561.
Small (the per-clone ShapeNode/Annotation objects, not the map, are the bulk);
kept as a safe step. 63 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: finish the plan — Phase 6 decision gate + status

Record the mapping-pool result, mark phase statuses (Phase 2 done; Phase 1
partial/scoped; Phase 5 deferred to FW integration), and write the Phase 6
decision: continue capturing Rust's memory architecture in C# (COW shipped at
-14% Sena / -20% en-hc; next chunk = per-thread pooling of Word/ShapeNode/FST
buffers, now measurable via SenaQuick) rather than adopt Rust's runtime.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: measure 16-thread throughput (SenaParallel) — the parallel answer

Add SenaParallel: one shared serial-within-word morpher, same word set at
dop=1/4/8/16, wall-clock words/sec + scaling.

Measured (800 Sena words, 20-core box):
  Workstation GC: 3.4x @4, 3.55x @8 (peak), 3.14x @16 -> REGRESSES (GC ceiling,
    gen0 ~580 regardless of threads). Allocation is the parallel ceiling.
  Server GC:      5.7x @4, 8.1x @8, 10.3x @16 (gen0 ~88) -> ~11x vs 1-thread WS.

Confirms: the out-of-process Server-GC worker (PR #438) already delivers the
16-thread win; RUSTIFY pooling is what lifts the in-process Workstation curve.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: pinpoint allocation split — 20% Word.Clone, 80% FST traversal

Add an opt-in per-thread AllocationProbe hook (set from the net10 test via
GC.GetAllocatedBytesForCurrentThread) to attribute Word.Clone's allocation.

Measured (Sena, SenaQuick): of ~11.8 MB/word, ~20% is Word.Clone (Shape deep
copy) and ~80% is the FST traversal/cascade (a fresh TraversalMethod + List +
Queue + register snapshots + FstResults per rule application). Redirects the
plan: the FST traversal (esp. reusing the instance cache across Transduce
calls) is the high-ROI lever, Word/Shape pooling the secondary one.

Probe is zero-overhead when disabled and behavior-identical when no probe is
set. 63 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 1 (FST): pool traversal method per-thread under Server GC

Reuse one traversal method per thread per Fst (via a new Reset()) so its
instance free-list survives across the thousands of Transduce calls per parse,
instead of allocating + discarding a fresh traversal method + instance pool on
every rule application (measured: ~80% of parse allocation is the FST traversal).

Gated to Server GC (cached GCSettings.IsServerGC), because pooling trades
transient garbage for a larger LIVE working set: under Workstation GC that
triggers stop-the-world Gen2 pauses that serialize threads and REGRESS parallel
scaling (16T 3.1x -> 1.5x). Under Server GC it is a clear win.

Measured (Sena, 800 words, SenaParallel):
  Server GC 16T: 10.3x -> 11.2x; allocation -16% (7.0->5.9 GB); Gen0 88 -> 42.
  Workstation 16T: 3.16x unchanged (per-call path retained).
803 SIL.Machine + 63 HermitCrab tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: conclusion — GC no longer dominates at 16 threads (Server GC, 11.2x)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: back out FST traversal pooling (restore pre-pooling FST)

Removing the per-thread traversal-method pool: it only paid off under Server GC
and complicates the engine. Reverting to the original allocate-per-call FST
before restructuring to bit-packed feature vectors (the better lever).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 3: bit-packed feature-vector unify fast path

Add a flat ulong-per-feature vector to FeatureStruct and a bitwise IsUnifiable
fast path in Input.Matches for the common phonological case (no defaults, no
negation, fully-symbolic arc input). Gated so the arc INPUT must be fully
bit-packable while the SEGMENT may carry ignorable non-symbolic features (FLEx
stamps a StringFeatureValue on every segment); FlatIndex is globally unique
across feature systems and assigned lazily. FeatureStruct.FlatUnifyEnabled
toggles it for A/B.

Correct: 63 HermitCrab + 806 SIL.Machine tests pass; parity assertion found zero
divergence on en/Sena/Indonesian.

Measured (single-thread, SenaQuick):
  Indonesian: 12,463 -> 11,268 KB/word (-9.7%), Gen0 44->40, 100% fast coverage.
  Sena:        9,053 ->  9,018 KB/word (neutral), 22% coverage (Bantu agreement
               uses variable arcs that fall back) -- no regression.

Next lever for variable-heavy grammars: bit-pack variable bindings too.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: remove the out-of-process Server-GC worker/client architecture

Delete SIL.Machine.Morphology.HermitCrab.Server (worker host, HermitCrabServerClient,
protocol, Program) + its tests, and the .sln / test-project references. It was a
workaround for the in-process Workstation-GC parallel ceiling (separate Server-GC
process: ~100 MB worker, .NET 10 runtime dependency, a richer protocol + FieldWorks
adapter still to build). The RUSTIFY direction supersedes it: drive allocation low
enough in-process (COW + bit-packed unify + arena work) that plain .NET needs no
Server GC. Server GC stays available as a runtimeconfig flag if ever wanted.

63 HermitCrab tests pass; solution builds without the project.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: per-word FST-traversal arena (off by default) + key parallel finding

Add a per-thread arena that reuses traversal methods + instance free-lists across
a word (FstThreadPool, reset per word from Morpher.ParseWord) via a Reset() on the
traversal methods. Gated by Fst.TraversalPoolEnabled, DEFAULT OFF.

Measured (Sena, A/B same load): single-thread allocation -13%, BUT 16-thread
scaling collapses 2.87x -> 1.29x. Confirmed across 4 pooling variants. Cause:
under Workstation GC, pooled objects live across the word -> survive Gen0 ->
promote -> stop-the-world Gen2 serializes the threads. Short-lived (Gen0-only)
allocation is actually BETTER for parallel. So object-pooling is the wrong tool
for the no-Server-GC-at-16-threads goal; the right arena is struct/Span/stackalloc
(no GC retention). Kept off-by-default as a single-thread/Server-GC opt-in.

63 HermitCrab + 806 SIL.Machine tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: struct/Span FST traversal is blocked on the data model (record path)

Verified blocker: the FST offset type for HermitCrab is ShapeNode (a class), so
Register<ShapeNode> and the traversal instances are managed -> cannot stackalloc
or hold in a stack Span, and pooling them recreates the Phase 1b Gen2 regression
(Advance is also an iterator, forbidding stackalloc). The struct/Span no-GC
traversal therefore requires the foundational change: represent the shape as a
flat array with int-index offsets so Register<int> is unmanaged -> value-type
register/instance buffers, zero GC-heap allocation in the traversal, Gen0
pressure drops, parallel scales without Server GC. Large, foundational rewrite.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 3c: FstStatistics per-category allocation breakdown harness

Adds FstStatistics (SIL.Machine) to decompose the \"80% FST scaffolding\" into
four named buckets — VarBindings.Clone, Registers.Clone, per-Transduce Scaffold,
and TraversalMethod creation — so the flat-buffer investment can be gated on real
numbers from Sena (not theory).

Key findings from en-hc + WEB-PT run (439 words, 82.9 KB/word):
  Word.Clone         21%
  Pure scaffold       1%  (Register[], HashSet, List per Transduce)
  VarBindings         1%  (negligible on English; will be larger on Sena)
  Registers           0.1%
  TraversalMethod     0.7%
  Other (cascade)    55%  (MarkMorph/Annotation/stratum-rule overhead, NOT FST)

Flat-buffer addresses ~22% on the toy grammar; Sena breakdown needed to decide
whether to pursue the full int-offset Shape rewrite. See RUSTIFY.md § Phase 3c.

RustifyBenchmark now falls back to en-hc + WEB-PT when HC_GRAMMAR/HC_WORDS are
not set, so the breakdown harness is immediately runnable without a FLEx grammar.
63 HC tests green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

HC: expand cascade breakdown harness (Segment, Word.ctor, MarkMorph, analysis window)

Adds four new allocation probes to fully decompose the 55% 'Other' bucket:
- MorpherStatistics.SegmentBytes: wraps Segment() (initial Shape/ShapeNode creation)
- MorpherStatistics.WordCtorBytes: wraps new Word(stratum, shape) construction
- MorpherStatistics.MarkMorphBytes: wraps Word.MarkMorph() annotation allocation
- MorpherStatistics.AnalysisCascadeBytes: wraps _analysisRule.Apply().ToList() (superset)

English toy grammar result (439 words, 35 MB total):
  Segment (initial Shape)    7.2%   Scaffold (pure FST) ≈ 0%
  Word.ctor(new)             9.6%   Rule-chain machinery ~40.7%
  Word.Clone                21.3%   Synthesis + other  ~18.8%
  Scaffold (incl. clones)   21.9%
→ analysis window superset  64.4%

Key finding: MarkMorph ≈ 0%; pure FST scaffold ≈ 0%; dominant costs are
Word.Clone (21%), Word.ctor+Segment (17%), and rule-chain LINQ/FstResult (~41%).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

RUSTIFY: move Word.ctor allocation probe into the Word constructor

The Word.ctor probe lived in Morpher.AnalyzeWord, so it only measured the
single initial construction per word, not the cascade-created Words. Move it
into Word(Stratum, Shape) itself (gated on MorpherStatistics.Enabled, off in
production) and add WordCtorCount so the breakdown reports calls as well as
bytes. Harness-only; no production-path behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Phase 4a: hot-loop allocation eliminations (safe, no retention)

Four pure-elimination changes in the FST traversal + analysis cascade. Each
removes an allocation outright without extending any object lifetime, so none
can trigger the Phase-1b parallel regression (pooling promotes to Gen2 ->
serializes threads). All validated: 803 SIL.Machine + 63 HermitCrab tests
green; SenaParallel scaling unchanged.

1. ITraversalMethod.Traverse returns List<FstResult> (was IEnumerable) -> drop
   the redundant .ToList() in Fst.Transduce. All four concrete Traverse impls
   already return the curResults List; the interface was needlessly widened.
2. Remove redundant .Distinct(FreezableEqualityComparer<Word>.Default) x2 in
   AnalysisStratumRule.ApplyMorphologicalRules/ApplyTemplates. Both _mrulesRule
   and _templatesRule are built with that same comparer and return a HashSet
   already deduped by it, so the Distinct pass is a no-op DistinctIterator.
3. Skip the DistinctIterator for trivial result sets in Fst.Transduce:
   (allMatches && resultList.Count > 1) ? resultList.Distinct() : resultList.
   resultList is non-null Count>=1 there; Count==1 Distinct is identity.
4. TraversalMethodBase.Reset: replace the per-Transduce GetNodesDepthFirst
   yield iterator (heap state machine per top annotation, thousands/word) with
   the allocation-free PreorderTraverse(action) form; delegate cached as a
   field (allocated once in ctor, not per call).

Measured (en-hc, SenaQuick, 439 words): Other 38.5% -> 36.2%
(14,145KB -> 12,884KB), KB/word 83.6 -> 81.1. Toy-grammar deltas are small;
real magnitude needs the Sena grammar. See RUSTIFY.md Phase 4a/4b (4b documents
the rejected scaffold-buffer ThreadStatic pooling: re-entrant via
acceptInfo.Acceptable, lifetime extension against the thesis, unmeasurable).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Phase 4c: single-hash traversed.Add in nondet FST traversal

In NondeterministicFstTraversalMethod.Traverse (both epsilon and input-match
branches), the dedup check hashed the expensive structural key (state +
annotation index + register array + outputs array) twice: once for
Contains(key), then again for Add(key). HashSet.Add already returns false when
the element is present, so collapse to `if (traversed.Add(key)) Push(newInst);`
— a single hash/lookup in the innermost traversal loop. Byte-identical.

CPU-only cleanup (no allocation change), so KB/word is flat on the toy grammar;
the structural-hash cost it removes is not resolvable there. Gated:
803 SIL.Machine + 63 HermitCrab tests green; SenaQuick no regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 4c: drop per-call List<int> in TraversalMethodBase.Advance

Advance collected the same-offset annotation window into `var anns = new
List<int>()` and then iterated it. That window is a contiguous index range
[nextIndex, annsEnd) (the build loop adds every consecutive i whose start
offset matches), so track the end bound and iterate the range directly,
eliminating one List allocation per arc match (one of the hottest paths in
the traversal). cloneOutputs/first flow is unchanged.

Measured (en-hc toy, SenaQuick, 439 words): KB/word 80.3 -> 79.0, totalMB
34 -> 33. Gated: 803 SIL.Machine + 63 HermitCrab tests green; toy under-
measures so treat the delta as directional, no-regression is the bar.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 4c: collapse identity-map LINQ in TraversalInstance.CopyTo

Both Deterministic and Nondeterministic CopyTo built an `outputMappings`
dictionary by zipping this.Output's node sequence with ITSELF — a
deterministic Queue-based BFS enumeration paired element-for-element, i.e. the
identity map — then projected _mappings through it. Since outputMappings[v]==v,
the entire block reduces to copying _mappings unchanged. Replace with
`other.Mappings.AddRange(_mappings)`, removing a Dictionary + two
SelectMany(GetNodesBreadthFirst, each allocating a Queue + yield iterator) +
Zip + Select per instance copy. CopyTo runs on every branch of nondeterministic
traversal, so this is allocation-heavy at scale (Sena ~276 clones/word) though
the toy grammar (2 clones/word, few branches) can't resolve it.

Byte-identical (provable identity-map reduction); other.Mappings is empty pre-
AddRange (GetCachedInstance -> Clear). Removed now-unused usings (System.Linq,
SIL.Machine.DataStructures) to satisfy IDE0005-as-error.

Gated: 803 SIL.Machine + 63 HermitCrab tests green; SenaQuick no regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 4c: paired-walk clone mapping in InitializeStack (Det + Nondet)

Both InitializeStack methods built inst.Mappings (source annotation -> clone)
by zipping two BFS node sequences:
  Data.Annotations.SelectMany(GetNodesBreadthFirst)
    .Zip(inst.Output.Annotations.SelectMany(GetNodesBreadthFirst), KVP)
allocating, per Transduce, a Queue per top annotation (BFS), two SelectMany
state machines, a Zip state machine, and one KeyValuePair per node.

Data and inst.Output are isomorphic (inst.Output = Data.Clone()), and the
resulting dictionary is independent of traversal order, so replace with a new
allocation-light helper DataStructuresExtensions.PairedPreorderTraverse that
walks the two forests in lockstep (preorder) and writes pairs straight into the
dict via a static (closure-free) callback. Debug.Asserts guard the isomorphism
invariant (root/leaf/child-count) so any future violation fails loudly instead
of silently truncating like Zip.

Runs once per Transduce (thousands/word). Toy grammar has tiny annotation trees
so the delta is below its resolution; the win compounds on Sena's long words.
Removed now-unused usings (System.Linq, SIL.Extensions) in the deterministic
method to satisfy IDE0005-as-error.

Gated: 803 SIL.Machine + 63 HermitCrab tests green (incl. the concurrent-
determinism test); SenaQuick no regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 4c: precompute initializer partition at Fst.Freeze

Fst.Transduce rebuilt a List<TagMapCommand> on every call (and every outer
annIndex iteration), filtering _initializers into Dest!=0 (the per-call cmds
list) vs Dest==0 (which drive a per-annotation SetOffset). The partition is
identical every call for a frozen FST, and cmds is read-only downstream
(Initialize -> ExecuteCommands only iterate it).

Partition _initializers once in Freeze() into _zeroDestInitializers /
_nonZeroDestInitializers (built into locals, gating field published last so a
reader never sees a half-filled list). Transduce reuses the shared read-only
_nonZeroDestInitializers as cmds and walks _zeroDestInitializers for the
SetOffsets, eliminating the per-call list allocation + filter loop. When the FST
isn't frozen the fields are null and Transduce falls back to the exact inline
build, so unfrozen callers are unaffected. The frozen FST is shared read-only
across parsing threads, so concurrent reads of the shared cmds list are safe.

Measured (en-hc toy, SenaQuick, 439 words): Scaffold 22.4% -> 21.0%
(7897KB -> 7261KB), KB/word 79.0 -> 78.8. SenaParallel: scaling and allocation
unchanged (no parallel regression from sharing the list). Gated: 803
SIL.Machine + 63 HermitCrab tests green, incl. the concurrent-determinism test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: document Phase 4c (five safe no-retention FST eliminations)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY cleanup: strip unrelated USFM work + remove dead GC/pooling machinery

Audit of master..hc-rustify identified work to drop now that allocation is
driven down and the branch is being squashed:

1. Strip the unrelated USFM/versification change set (3 commits: #430, #432,
   normalization port) — restored src/SIL.Machine/Corpora,
   src/SIL.Machine/PunctuationAnalysis and their tests to master, removed the
   USFM-added files. None of it is perf work; it belongs in its own PR.

2. Remove the dead FST traversal pooling (measured to REGRESS parallel parsing,
   Phase 1b): the FstThreadPool class, Fst.TraversalPoolEnabled, the pooling
   branch in Fst.Transduce, FstThreadPool.Reset() in Morpher.ParseWord, and the
   HC_ARENA toggle in RustifyBenchmark. Transduce now always uses a fresh
   (die-in-Gen0) traversal method, which is the right tradeoff once allocation
   is low.

3. Remove the Shape.CopyTo [ThreadStatic] clone-map pool, keeping the
   value-added inline mapping build (no second GetNodes().Zip().ToDictionary()
   pass) — just a plain per-call Dictionary.

4. Revert Machine.sln to master (leftover x64/x86 platform configs) and remove
   three superseded planning docs (HERMITCRAB_ALLOCATION_STRATEGIES /
   COW_PLANS / PERF_PLAN), now consolidated in RUSTIFY.md.

Kept (per request): the allocation instrumentation (MorpherStatistics,
FstStatistics, probes) + both benchmarks for before/after measurement; the
MaxDegreeOfParallelism API + Synthesize refactor; COW FeatureStruct; bit-packed
unify; Phase 4a/4c eliminations.

Gated: 801 SIL.Machine + 63 HermitCrab tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: document Phase 4d cleanup (pooling/arena removed)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY cleanup: restore the safe Shape.CopyTo [ThreadStatic] clone-map pool

Re-examination showed this pool is NOT the regressive kind removed elsewhere.
The Phase-1b parallel regression came from objects retained ACROSS a word
(promoted to Gen2). Shape.CopyTo's CloneMapping is cleared and fully consumed
WITHIN each call (contents die immediately; only a small empty buffer persists),
so it cannot promote parse data to Gen2 — and it still buys a small allocation
win (~0.45% on Sena; on the toy grammar removing it had pushed Word.Clone
22.4% -> 23.6% and KB/word 78.8 -> 80.2). Restored, with RUSTIFY.md Phase 4d
note corrected to record it as KEPT (safe pool) rather than removed.

Gated: 801 SIL.Machine + 63 HermitCrab tests green; SenaQuick KB/word back to
78.8, Word.Clone 22.4%.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: staged implementation plan for the flat int-index shape

Records the chosen direction (go flat) + corrected feasibility findings (no
TOffset constraints; int-offset engine already tested; ShapeNode contained to
~95 in-repo refs), the accepted cost (ShapeNode -> handle, value identity), and
the 3-stage plan: (1) array-backed Shape + ShapeNode handle + array-copy Clone,
(2) int FST offset + unmanaged Span/stackalloc traversal (the parallel unlock),
(3) migrate rule sites to indices.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: real Sena measurement obtained — reshapes flat-shape priorities

Generated sena-hc.xml from the Sena 3 FieldWorks backup (GenerateHCConfig.exe)
and extracted sena-words.txt (7,121 words) from the project's seh running text.
SenaQuick (400 words, MaxUnapp=5) now gives the clone-heavy numbers the spike
needs:

  clones/word=345 (estimate ~276 confirmed), KB/word=14,116
  Scaffold 42.2% (per-Transduce Register[,] arrays)  <- biggest bucket
  Word.Clone 21.9%, Other 24.8%

Key finding: Scaffold (managed Register<ShapeNode>[,] per Transduce) is ~2x
Word.Clone, so the flat int-index foundation's biggest payoff is Stage 2
(Register<int> -> stackalloc/Span, zero heap), bigger than the Word.Clone bucket
the goal named, and unlocked by the same change. Confirms flat over COW (COW
cannot touch the traversal scaffold). Benchmark assets untracked in samples/data.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 1: array-backed Shape + ShapeNode handle (flat-shape foundation)

Re-represents the shape as a flat, int-indexed backing — the data-model
foundation the whole flat-shape plan (Phase 3b-impl) stands on, and the
prerequisite for Stage 2's Fst<Word,int> register-scaffold win (the measured
42% bucket) and Stage 3's Word.Clone cut (22%).

- Shape no longer inherits OrderedBidirList<ShapeNode>; it owns its nodes in
  flat arrays (_next/_prev int links = in-array doubly-linked list, per-node
  frozen flag, canonical handle) addressed by a stable ShapeNode.Index, and
  reimplements IOrderedBidirList/IOrderedBidirListNode over them.
- ShapeNode becomes a handle (Owner + Index); links/frozen delegate to the
  owner arrays. The added node IS retained as the canonical one-per-slot handle,
  so reference identity (==, dict keys, Range<ShapeNode> endpoints) is unchanged.
- Tag deliberately stays on the node so it survives a node moving between shapes
  (AddAfter sets the new tag before detaching from the old owner). The tag-relabel
  order maintenance, Freeze/Clone/CopyTo and annotation interactions are preserved.

Gate: 803 SIL.Machine + 63 HermitCrab tests green (incl. concurrent-determinism).
Measured neutral on en-hc toy (SenaQuick): KB/word 80.5 -> 80.5, clones/word 2,
gen0 2 — exactly the plan's "Stage 1 ~= 0" prediction; payoff is unlocked by, not
realized in, this increment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Stage 2 substrate: freeze-time NodeAt int<->node bridge + design blueprint

Adds Shape.NodeAt(int) backed by a dense _byPos[] table built in Freeze (content
nodes already get dense Tag 0..N-1 there), the int-offset -> ShapeNode bridge the
Fst<Word,int> binding will resolve against. Additive and behavior-preserving;
803 SIL.Machine tests green.

Records the resolved Stage 2 blueprint in RUSTIFY.md: offset = dense frozen tag
(HC always freezes before traversal), half-open [t,t+1) ranges reusing
IntegerRangeFactory (provably identical ordering/Overlaps/Contains to the
inclusive ShapeNode form for a one-unit-per-node model), Word/Shape become
IAnnotatedData<int> via a freeze-time AnnotationList<int> projection, rules resolve
int->node via NodeAt, and Register<int> goes unmanaged (the 42% Scaffold payoff).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Stage 2 blueprint: correct it after reading the rule-application flow

Reading IterativePhonologicalPatternRule.Apply + the rewrite SubruleSpecs +
the semantic-site catalog overturned two blueprint assumptions:

- Rewrite rules MUTATE match.Input.Shape in place while UNFROZEN and re-match
  repeatedly, so the traversed shape's tags are SPARSE, not dense 0..N-1.
  => offset must be the raw ordered Tag (the [Tag,Tag+1) half-open mapping is
  still provably correct for sparse tags: Tag+1 always lands at/<= the next tag).
- NodeAt must therefore work on unfrozen shapes => a Tag->node map maintained
  incrementally (AddAfter/Remove/Relabel), not a freeze-only dense array.

Records the real hazards to design for: End.Tag==int.MaxValue overflows [t,t+1)
(anchors are in the annotation list and ordered on add); the int annotation
projection must stay in sync with the live mutated ShapeNode list (not build-once
at freeze); and ~30 offset-navigation sites (match.Range.End.Next etc.) must route
through shape.NodeAt(tag).Next?.Tag preserving null-at-boundary. Net: the flip is
larger/subtler than a mechanical generic swap — a multi-session spike, each
sub-piece behind the byte-identical gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Stage 2: empirically validate the int-offset range-mapping thesis

Adds a parity test proving the assumption the whole TOffset=ShapeNode -> int flip
rests on: mapping each annotation [startNode,endNode] to the half-open int range
[startNode.Tag, endNode.Tag+1] preserves the range relationships the FST traversal
depends on — CompareTo ordering, Overlaps, Contains — for SPARSE tags (appended
unfrozen shape, as rewrite rules see it) and dense tags (frozen). Pairwise over
all annotations of a shape with a spanning (start!=end) annotation. Both cases green.

This de-risks the riskiest design point before any code is built on it: now 805
SIL.Machine (+2) + 63 HC tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Stage 1: measure the flat Word on the REAL Sena grammar (the clone-heavy case)

The toy-grammar SenaQuick (2 clones/word) was too clone-light to judge an
API-breaking rewrite. Ran SenaQuick against the real Sena grammar (sena-hc.xml,
400 words, HC_MAX_UNAPP=5) where the ~345 clones/word payoff lives, vs the
pre-Stage-1 baseline recorded in 2fd1a2d3:

  clones/word 345 -> 345  (byte-identical AND behavior-identical at scale)
  KB/word     14116 -> 14583  (+3.3%)
  gen0        442 -> 457
  Scaffold    42.2% -> 42.7%   Word.Clone 21.9% -> 22.3%  (split reproduced)

Findings: (1) the flat data model produces an identical clone count on the
pathological grammar, confirming correctness beyond the toy tests; (2) Stage 1
in isolation costs +3.3% allocation -- the four per-Shape backing arrays
(_nodes/_next/_prev/_frozen) x 138k clones -- exactly the plan's "Stage 1 ~= 0 or
slightly negative" prediction, hidden by the toy grammar's 2 clones/word. The
cost is the investment: the 42.7% Scaffold (Stage 2 Register<int>) and 22.3%
Word.Clone (Stage 3) are what the same flat foundation unlocks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Stage 2: int-offset annotation projection on Shape (the Fst<Word,int> bridge)

Adds the linchpin infrastructure for the FST flip, additively (Shape still
IAnnotatedData<ShapeNode>; nothing flipped yet, full suite green):

- AnnotationList<T> gains an internal Version counter (bumped on Add/Remove/Clear)
  so derived views can detect staleness cheaply.
- Shape builds a lazy, version-gated int-offset projection: AnnotationList<int>
  IntAnnotations (each [s,e] -> half-open [s.Tag, e.Tag+1], End margin clamped to
  avoid +1 overflow), Range<int> IntRange, and Dictionary<int,ShapeNode> for
  NodeAt(offset) (now works frozen AND unfrozen, via Tag). FeatureStruct is shared
  by reference so in-place rule edits stay visible. The projection is rebuilt only
  when the annotation Version or frozen-state changes, so a stable/frozen shape
  builds it once and reuses it across thousands of Transduce calls per word.

Tests: IntAnnotationProjection_MirrorsShapeNodeAnnotations verifies the projection
mirrors the ShapeNode tree (ranges, FeatureStruct identity, optional, children),
NodeAt round-trips every node by Tag, and the cache invalidates on mutation.
805 (+2) SIL.Machine green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

WIP RUSTIFY Stage 2: flip HC FST to Fst<Word,int> (57/63 HC green)

The full TOffset ShapeNode->int flip across HermitCrab (~71 files): Word is now
IAnnotatedData<int>; Shape exposes a lazy, version-gated int-offset annotation
projection with DENSE node positions (0..N+1) as offsets — dense (not sparse Tag)
to avoid the Range<int>.Null=-1 collision, +1 overflow at the End margin, and
empty anchors. NodeAt/OffsetOf/MatchStartOffset bridge int<->node; rule RHS code
resolves match/group int ranges back to nodes (half-open [off, off+1), so leftmost
= NodeAt(Start), rightmost = NodeAt(End-1)). MatchStartOffset(node,dir) handles the
inclusive->half-open asymmetry for right-to-left match-start offsets.

Down from 23 failures to 6 (all now logic, not crashes): metathesis SimpleRule/
ComplexRule, DeletionRules/MultipleDeletionRules, EpenthesisRules, ReduplicationRules
— node insertion/deletion/movement + group-capture rules still need correctness work.
Register<int> stackalloc (the payoff) comes after these are byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Stage 2: fix analysis under-generation (63/63 HC green)

Two byte-identical fixes to the int-offset projection, restoring the 4
analysis-direction rewrite tests (Epenthesis/Deletion/MultipleDeletion/
Reduplication) that the Fst<Word,int> flip broke:

1. Annotation.Optional must invalidate the projection. The Shape int
   projection copies Optional by value and caches against the annotation
   list Version, but the Optional setter is a non-structural change that
   never bumped Version. So once analysis flipped Optional=true on existing
   nodes, the matcher kept reading the stale Optional=false projection and
   never forked the optional-skip instances. The setter now bumps the root
   list's version (new AnnotationList.IncrementVersion). Fixes Epenthesis.

2. IntRange must be the half-open image of the inclusive [Begin, End], i.e.
   [off(Begin), off(End)+1) — not [off(Begin), off(End)]. The only consumer
   is Matcher.GetStartAnnotation via Range.GetStart(dir); a RtL match starts
   at GetStart(RtL)==End. The End anchor's dense range is [off(End),
   off(End)+1), whose RtL start coordinate is off(End)+1, so without the +1
   a RtL match began at the last content node and skipped any edit adjacent
   to End (e.g. inserting a deleted segment after the final vowel). Fixes
   the deletion/reduplication cases.

Adds two regression tests guarding both invariants. Also keeps the prior
working-tree Stage-2 fixes in IterativePhonologicalPatternRule and
SynthesisMetathesisRuleSpec (resolve int offsets to ShapeNode refs before
mutating the shape).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 2: record generic-flip-green milestone + post-flip measurement

The <Word,ShapeNode>->-<Word,int> generic flip is byte-identical green (63/63
HC + 808 SIL.Machine, full Release solution builds clean). Document the two
int-model correctness bugs found bringing it from 57/63 to green (Optional
cache invalidation + IntRange half-open End-anchor mapping), why 59 tests
masked them, and the post-flip en-hc baseline (KB/word 78.8 -> 86.1, the
projection "investment" before the Register<int> payoff). Records the
remaining Stage 2 payoff target.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 2: refute the register-payoff hypothesis with real-grammar data

Wire Indonesian (the classic HC nasalization demo, variable-light, ~150
clones/word) alongside Sena (~345 clones/word) as the two measurement
grammars, and use them to investigate the Stage-2 thesis that Register<int>
being unmanaged unlocks a stackalloc cut of the 42% Scaffold bucket.

Measurement refutes it:
- Registers.Clone (the escaping accept snapshots the redesign targets) = 0.2%
  on Sena. Not where the bytes are.
- Converting the per-push dedup-key Tuple<State,int,Register[,][,Output[]]> to
  an inline `readonly struct TraversalKey` in both nondeterministic traversal
  methods moved allocation ~0% (Sena KB/word 14588->14579; Indonesian flat).
  Kept anyway: zero-risk, byte-identical, removes a real per-push heap object,
  CPU-positive (single Add vs Contains+Add), consistent with Phase-4c micro-
  eliminations.
- The Scaffold 38.5% IS the clone explosion: it contains Word.Clone (22.4%, via
  the per-instance Output=Data.Clone() in InitializeStack) + the per-instance
  Mappings dictionary + Output graph. The int flip's allocation payoff is
  therefore Stage 3 (flat-shape clone), not a register trick.

Full suite green (808 SIL.Machine + 63 HC, incl. concurrent-determinism).
RUSTIFY.md records the finding + how to regenerate both grammars.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3: localize the clone cost — inherent per-node materialization

Two-phase allocation probe on Shape.CopyTo (Sena): Word.Clone's 22% splits
into CopyTo node-phase 11.4% (node.Clone + per-node dest.Add) + annotation-
phase 4.1% + ~6.9% Word/Shape ctor. The node-phase prize is inherent per-node
object materialization (ShapeNode + Annotation + COW FS + AnnotationList skip-
list entry per node), not intermediate churn.

Two incremental attacks measured ~0/negative and were reverted:
- pre-size the backing arrays vs AddAfter doubling: 666->688 MB (worse;
  source Count over-sizes partial-range CopyTo, doubling was never the cost).
- the per-push dedup Tuple->struct (prior commit): ~0.

Conclusion: the flat-clone payoff requires the deep redesign (lazy ShapeNode
handles + bulk AnnotationList clone + index-addressed annotations), the Stage-1-
deferred "Clone = Array.Copy" end-state — a multi-session foundational rewrite
needing a go/no-go. No incremental win exists short of it. Recorded in RUSTIFY.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3: design + sequencing doc for the flat-shape clone spike

Per the plan-then-proceed go/no-go: RUSTIFY-stage3-design.md lays out the
foundational Word.Clone rewrite before any code goes red.

- Goal: kill the inherent per-node materialization (node-phase 11.4% +
  anns-phase 4.1% of Word.Clone) by making Shape.Clone an Array.Copy.
- Entanglement: ShapeNode reference-identity + annotations-hold-handles +
  skip-list-tower-per-annotation must be undone together.
- Key resolution: materialize-on-touch two-state shape. A clone is a flat
  snapshot (no handles/Annotation objects); the int projection (Stage 2)
  reads it for the hot frozen-traverse path so nothing materializes; any
  ShapeNode/Annotation request or in-place mutation materializes lazily,
  one-per-slot, restoring exact reference identity. This resolves the
  dense-index-vs-mutation tension: frozen-read pays nothing, unfrozen-mutate
  pays the old price (far colder).
- Byte-identical risk register, I->V sub-increment order (I FeatureStruct
  flat array + II flat AnnotationList = gateable green and keepable alone;
  III lazy materialization + Array.Copy clone = the red phase; IV triage the
  189 HC ShapeNode refs frozen-read vs mutate; V re-validate + measure),
  and rollback to dbef327a.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3: advisor review + measure II first — towers are 7.4% (resequence)

Advisor review of the design front-loaded a read-only verification gate before
the risky III: the linchpin (int projection must rebuild from flat records,
handle-free), the make-or-break premise (UpdateOutput touches O(few) per FST-
transduction clone), result-consumer audit, and the I detached-FS caveat. Folded
into the design doc.

Then executed the advisor's "measure II before III" with a temporary tower-
allocation probe on Sena:

  annotation skip-list towers = 7.4% of total alloc (~432 MB, 6.31M arrays) —
  a THIRD of Word.Clone (22.4%), two-thirds of node-phase(11.4%)+anns-phase(4.1%).

Resequences the spike: increment II (flatten the BidirList tower arrays into
list-owned flat backing) is now the headline — ~7.4%, byte-identical, gateable
GREEN, zero laziness risk, independently keepable. Increment III's lazy-handle
materialization is downgraded to optional/gated: it buys only the residual ~8%
(the ShapeNode/Annotation objects) and carries the reference-identity risk. The
towers were the cheap two-thirds hiding behind the "inherent objects" framing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3 II-a: grow skip-list margins on demand (Word.Clone -123MB Sena)

First positive allocation increment of the flat-clone spike, byte-identical green.

Every BidirList ctor Init'd both Begin/End margin nodes at the 33-level skip-list
maximum (new TNode[33] x2) regardless of actual list height. Since lists almost
always stay shallow, that eager margin tower was a large slice of the per-
AnnotationList tower allocation that dominates Word.Clone. Now:
- margins start at level 0 (Init(1) + link level 0);
- GrowMargins ensures capacity + links Begin<->End at a new level only when a
  node first reaches it (EnsureLevelCapacity right-sizes; geometric growth was
  measured slightly worse - it over-allocates the shallow majority);
- Clear resets to level 0, higher levels relink lazily on regrowth.

Measured (SenaQuick, Release): Sena Word.Clone 1,306,476 -> 1,182,940 KB
(-123 MB, -9.5% of Word.Clone, stable across runs; total KB/word -~2% under GC
noise); Indonesian Word.Clone -~0.5pt similarly. Full SIL.Machine (808) + HC (63,
incl. concurrent-determinism) green.

Contained to BidirList/BidirListNode (used by AnnotationList x2, SkipList,
TreeBidirList); does not touch ShapeNode/Annotation reference identity, so it is
independently keepable regardless of the later II-b / III increments.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3 II-b: inline skip-list level 0 (Word.Clone -54MB more, Sena)

Second positive flat-clone increment, byte-identical green. Level 0 (the only
level ~50% of skip-list nodes have) moves from the per-node _next[0]/_prev[0]
arrays into inline _next0/_prev0 fields, so level-0 nodes allocate NO tower array
at all and every taller node's array is one slot shorter (levels 1.. in
_nextHigh/_prevHigh, null when Levels<=1).

Touches the hottest skip-list accessors (GetNext/SetNext/GetPrev/SetPrev/Next/
Prev/Init/Clear/EnsureLevelCapacity); gated on the full SIL.Machine (808) + HC
(63, incl. concurrent-determinism) suites - green, so the level<->field-or-array
dispatch is byte-identical.

Measured (SenaQuick, Release): Sena Word.Clone 1,182,940 -> 1,128,660 KB
(-54 MB on top of II-a); Indonesian 222,491 -> 212,911 KB (-9.6 MB).

Cumulative II-a + II-b vs pre-Stage-3: Sena Word.Clone -177 MB (-13.6%), total
allocation -4.2% (KB/word 14,556 -> 13,942); Indonesian total -4.1%. Pure
allocation reduction, no retention, independently keepable regardless of III.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: index the Stage 3 II-a/II-b green increments (-4.2% allocation)

Record in the main plan that two byte-identical flat-clone increments landed
(margin grow-on-demand + inline level 0), banking the cheap skip-list tower
wins: Sena Word.Clone -177 MB (-13.6%), total -4.2%; Indonesian -4.1%. Points
to RUSTIFY-stage3-design.md for the residual III go/no-go.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3: III feasibility measured (41% Sena clones never mutated) + choose copy-on-write Shape mechanism

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3 III: copy-on-write Shape — Word.Clone -59.6% (Sena), byte-identical

The flat-clone payoff. A clone of a *frozen* shape now stores _cowSource and
copies nothing. The asymmetry that makes this cheap + safe: the FST matcher (the
hot read path) consumes a clone only through the int-offset projection
(IntAnnotations/IntRange), which is served from the frozen source; while every
path that could mutate first hands out a ShapeNode/Annotation handle. So:
- serve IntAnnotations/IntRange/Count/GetFrozenHashCode/Freeze from the source
  while copy-on-write;
- gate EnsureInflated() (= the real CopyTo, then re-freeze if frozen-by-sharing)
  on the flat-backing link accessors, First/Last/enumeration, NodeAt/OffsetOf/
  MatchStartOffset/Annotations/GetNodes/CopyTo/ValueEquals, and every mutator.
A clone that is only traversed (matcher carrier) never inflates -> costs a shell
instead of N nodes + N annotations + their skip-list towers.

Thread-safety (the doc's non-negotiable): a frozen shape's int projection is now
built eagerly at Freeze() (single-threaded), so the new pattern of several parse
threads' COW clones delegating to one shared frozen grammar shape always hits a
complete cache rather than racing a lazy first build.

Measured (SenaQuick, Release): Sena Word.Clone 1,128,660 -> 528,071 KB (-53% on
top of II; 20.2% -> 9.9% of total); Indonesian 212,911 -> 85,566 KB (-60%).
Cumulative Stage 3 (II-a+II-b+III) vs pre-Stage-3: Sena Word.Clone -778 MB
(-59.6%), share 22.4% -> 9.9%; Indonesian -62%. Word.Clone is no longer a top
bucket. Full SIL.Machine (808) + HC (63, incl. concurrent-determinism) green;
full Release solution builds clean; SenaParallel scaling unregressed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3 III: verify byte-identical on real grammars + COW invariant tests

Validation the toy suite can't give (en-hc is ~2 clones/word; COW's never-
inflated path runs ~170x hotter on Sena at 345 clones/word):

- Added RustifyBenchmark.Signature ([Explicit], not CI): emits a deterministic
  per-word analysis signature (sorted set of Category|root|glosses per
  WordAnalysis) to HC_SIG_OUT. Diffed HEAD vs the pre-Stage-3 baseline (dbef327a,
  isolating II+III) on BOTH grammars via a worktree: Sena (400 words) and
  Indonesian (121 words, 100 non-empty) signatures are IDENTICAL. The COW change
  is byte-identical where it actually runs hot, not just on the toy grammar.

- Added 3 CI-running COW-invariant regression tests (AnnotationTests):
  never-inflated clone serves the source's projection/range/count; mutating a
  clone inflates it and leaves the frozen source uncorrupted; frozen-by-sharing
  hash equals the source and stays stable across forced inflation.

Full SIL.Machine (811) + HC (63) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY lever 2: lazily allocate Word's morphological-rule bookkeeping maps

_mrulesUnapplied / _mrulesApplied / _disjunctiveAllomorphIndices stay empty
through the phonological-analysis cascade (where ~345 clones/word happen) but
were cloned eagerly per candidate. Now null = empty, created on first write,
copied only when the source is non-empty. Byte-identical (63 HC green).

Measured (SenaQuick): Word.Clone 527,987 -> 499,121 KB (-29 MB), Word.ctor
184,858 -> 177,387 KB; total 5,267 -> 5,216 MB (~-1%).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY lever 1: hoist the initial-register scaffold out of the Transduce loop

Fst.Transduce allocated a fresh Register<TOffset>[regCount,2] per outer (start-
position) iteration. Traverse only Array.Copy's it into the initial instances
and never retains it, so it can be allocated once and Array.Clear'd per start
position - byte-identical, and AllMatches (analysis) runs one iteration per
start, so this removes (starts-1) register-array allocations per matcher call.

Measured (SenaQuick): Scaffold 2,264,486 -> 2,241,441 KB (-23 MB). Full suite
(811 SIL.Machine + 63 HC) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: record levers 1+2 (lean Word + hoisted register scaffold, ~-1%, byte-identical) and why the 42% Scaffold prize stays blocked

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY lever 1: replace per-instance Visited HashSet with an inline value bitset

Profiling showed the 42% Scaffold is instance churn: ~2,927 traversal instances
created per Sena word (only ~20% reused — the pool is per-Transduce, thrown away
each call, and pooling across calls re-triggers the Phase-1b Gen2 parallel
regression). So the fix is leaner instances, not pooling.

Each nondeterministic instance carried a HashSet<State> to avoid epsilon loops.
States have a dense Index, so this is now a value-type VisitedStates bitset:
states 0-63 in an inline ulong field (zero heap — HC rule FSTs are tiny), a lazy
ulong[] overflow only for 64+ state FSTs. The set is now part of the instance
object, not a separate ~1.17M/word heap allocation. Byte-identical (same dedup
semantics over state identity == Index).

Measured (SenaQuick): Scaffold 2,269,759 -> 2,169,001 KB (-100 MB), total
5,242 -> 5,145 MB (~-2%). Full suite (811 SIL.Machine + 63 HC) green.

The remaining per-instance allocation (the Register[,] array, ~1.17M/word) is the
bigger prize but is blocked here: the `traversed` dedup key holds each instance's
register array BY REFERENCE, so a shared register arena (slices reused across
instances) would corrupt dedup. Cutting it needs the deep de-iterator + snapshot-
dedup rewrite, not a drop-in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY lever 1 (deep): de-iterator Advance/Initialize into a reusable buffer

The core of the scaffold rewrite. Advance was a yield-based iterator and
Initialize allocated a fresh List per call (both recursive), so each of the
~2,482 Transduce/word -> millions of Advance calls minted an iterator state
machine / List. Both now fill ONE reusable per-method result buffer instead.

Safety: the buffer is a per-method (per-Transduce) field, so it carries no
cross-word retention (the Phase-1b Gen2 parallel regression) and cannot be a
thread-static (CheckAccepting's Acceptable predicate can re-enter Transduce on
the same thread). Initialize fills it once at the start of Traverse and the
caller fully consumes it building the work stack before the main loop's first
Advance reuses it, so the two never overlap (one buffer serves both). Advance is
not re-entrant within a method. Byte-identical: same results, same order.

Measured (SenaQuick): total 5,145 -> 5,029 MB (-116 MB, ~-2.3%); the per-call
iterator state machines (Scaffold -147 MB) replaced by one buffer List/method
(+~39 MB in TraversalMethod after merging the two buffers into one). Full suite
(811 SIL.Machine + 63 HC, incl. concurrent-determinism) green.

NOTE on the register stackalloc premise: it does NOT apply to the nondeterministic
matcher (the hot path). The `traversed` dedup retains a per-config register
snapshot during each Transduce, so the registers are not transient stack values -
they're the evolving, snapshotted match state. The achievable scaffold wins are
therefore the iterator garbage (this commit) + the Visited HashSet (prior), not
stackalloc'd registers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: record lever-1 deep rewrite (Visited bitset + de-iterator, ~-6% Sena, byte-identical) + the register-stackalloc constraint

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
johnml1135 added a commit that referenced this pull request Jun 30, 2026
…ling

Add Morpher.MaxDegreeOfParallelism (1 = fully single-threaded), replacing the
dead compile-time SINGLE_THREADED flag with a runtime knob across all three
within-word parallel sites (synthesis, Unordered analysis cascade, affix-template
unapplication). This lets a caller (FieldWorks "Parse All Words") parallelize
across words without nested oversubscription.

Add MorpherStatistics (opt-in, zero overhead when disabled): Word.Clone count,
analysis/synthesis phase timing, parallel-section counter (proves the sequential
path runs under degree-1), and a corpus benchmark (Explicit) that reports
GC.GetTotalAllocatedBytes + Gen0/1/2 against a real FLEx-exported grammar.

Profiling the real Sena grammar showed ~8,793 Word.Clone and ~371 MB allocated
per word (the combinatorial unapplication search). First allocation win:
Shape.CopyTo builds the src->dest node map inline instead of
.Zip().ToDictionary() + double re-enumeration (-2.3% alloc/word, fewer Gen0).

Tests: 62 HermitCrab + 790 SIL.Machine pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: memoize multiApp cascade re-expansion; measure GC under parallel load

CombinationRuleCascade: in multiApp mode a word's expansion depends only on the
word, so memoize already-expanded words and skip re-descending them (collapses the
combinatorial re-exploration to a DAG; output set unchanged). Output-identical:
62 HC + 790 core tests pass. Measured ~0% on short Sena words (their clones come
from the phonological/synthesis layers, not morphological re-expansion) but it
bounds pathological re-expansion blow-up at no correctness cost.

Benchmark: measure GC (allocated bytes + Gen0/Gen2) under the parallel-ACROSS-words
load and report Server vs Workstation GC — this is where alloc/GC contention
actually bites, unlike a single-threaded run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC perf plan: record Sena optimization results + Server-GC dominance finding

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: COW design study — 3 scoped plans to cut the FeatureStruct clone firehose

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: add copy-on-write safety-net tests before the COW refactor

- FeatureStruct: clone-of-frozen + mutate-clone leaves source unchanged, for every
  mutator incl. nested-child recursion (PriorityUnion/Union/Subtract/AddValue/RemoveValue/
  Clear), plus clone-is-mutable, never-mutated-clone equality, re-entrancy sharing, and
  ReplaceVariables isolation. Asserts the SOURCE is unchanged (not just "no throw").
- Shape: clone + mutate a cloned node's FeatureStruct leaves the source shape unchanged.
- Morpher: concurrent repeated parsing is deterministic (guards COW under parallel load).

All pin CURRENT behavior (801 core + 63 HC pass) so the COW refactor can't silently regress.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Plan A: copy-on-write FeatureStruct.Clone for frozen structs

Clone() of a FROZEN feature struct now returns a shell that borrows the source's
immutable backing dictionary; the first mutation (EnsureWritable, replacing
CheckFrozen) inflates a private deep copy via the existing CloneImpl, so neither the
mutation nor any recursion into children can touch shared frozen data. Clone() of an
unfrozen FS still deep-copies. Single-file change; no public API change.

Most cloned feature structs are never mutated, so they stay O(1) shells. Measured on
the real Sena grammar: -11% managed allocation/word and ~-29% wall on the 16-way
parallel pass (less GC contention). 801 core + 63 HC tests pass, including the new
COW safety-net tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC COW doc: record Plan A result (-11% alloc) and Plan B subsumed/blocked finding

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: out-of-process Server-GC parser (worker host + reusable client)

New SIL.Machine.Morphology.HermitCrab.Server project lets a host application get
Server-GC parsing throughput WITHOUT changing its own GC mode, by running the morpher
in a child process:

- HermitCrabServerHost: loads a compiled HC config, serves analyze requests over
  stdin/stdout (newline-delimited JSON), parses each word single-threaded with
  parallelism across the batch. Launched with DOTNET_gcServer=1.
- HermitCrabServerClient: reusable IMorphologicalAnalyzer that launches/manages the
  worker, drives the batch protocol, and returns WordAnalysis. Morphemes cross the
  boundary as DTOs that implement IMorpheme, so the client needs no grammar load.
- Shared protocol DTOs guarantee the two ends agree.

Unlike XAmple (native, in-process, no managed GC), HC is managed, and GC mode is fixed
at process startup — so a worker subprocess is the only way to scope Server GC to the
parser. Grammar-config-driven, so any Machine HC consumer can use it; FieldWorks adds a
thin IParser adapter mapping morph Properties -> LCM.

End-to-end on the real Sena grammar: out-of-process results match in-process; worker
runs Server GC while the host runs Workstation GC (verified). 63 HC tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: apply CSharpier formatting + braces to satisfy CI (formatting + code-style)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

HC: address Copilot review comments

- CombinationRuleCascade: seed the memoization set with the initial input so a cycle
  back to it (A->B->A) doesn't re-expand it.
- Morpher.ParseWord: drop the redundant origAnalyses copy (analyses is already
  materialized and Synthesize no longer drains it).
- Server host/client: handle null JsonSerializer.Deserialize results with a clear
  protocol error instead of an NRE.
- MorpherBenchmark: clamp across-word degree-of-parallelism to >= 1 so it doesn't
  throw on single-core (ProcessorCount-1 == 0) or when HC_ACROSS_DOP=0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove copy-on-write FeatureStruct; keep deep-clone

Revert FeatureStruct.Clone and Shape.CopyTo to the upstream deep-clone behavior.
The copy-on-write FeatureStruct (clone-of-frozen shares backing, inflate on first
write) measured ~-11% allocation but is being held back from this performance PR
to keep it scoped to the single-threaded option + instrumentation + out-of-process
Server-GC parser. COW can return as its own focused change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC: address Copilot review comments (round 2)

- Honor Morpher.MaxDegreeOfParallelism cap in the two within-word parallel
  sites that previously ran at the default scheduler degree:
  ParallelCombinationRuleCascade (new MaxDegreeOfParallelism property, wired
  from AnalysisStratumRule) and AnalysisAffixTemplateRule.ParallelApplySlots.
- Server host: catch JsonException on a malformed request line and reply with
  an empty response instead of terminating the worker.
- Server client: kill+dispose the worker process if it fails to report READY
  (no leaked process on the startup-failure path).
- Server client: validate the worker returns exactly one result per requested
  word; fail fast with a clear error instead of misaligning/indexing out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: plan for data-oriented C# perf work on HermitCrab

Capture Rust's memory-architecture wins (pooling, struct-of-arrays, Span,
indices-not-pointers) in C# to attack the measured allocation/GC bottleneck,
piece by piece with a measurement after each change. One engine, no native lib.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 2: copy-on-write FeatureStruct (-20% bytes/word on en-hc)

Re-apply the COW FeatureStruct (reverts 892816f2): Clone() of a frozen feature
struct borrows the immutable backing and inflates (deep-copies) only on first
mutation. Inflate only reads the shared frozen backing, so it is thread-safe;
guarded by AnalyzeWord_ConcurrentRepeatedParsing_IsDeterministic.

Measured (en-hc toy grammar, 439 forms): managed allocated 106.5 -> 84.8 KB/word
(-20%), Gen0 3 -> 2, single-thread 91 -> 79 ms. 63 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: record Sena-too-slow measurement finding + harness strategy

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: fast single-pass Sena allocation probe (SenaQuick)

Budget-bounded, Console-flushed, single-pass probe usable on the real Sena
grammar (2789 words/20s) where the multi-pass MorpherBenchmark is too slow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: COW confirmed on real Sena grammar (-14% bytes/word, +9.5% throughput)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 1/4: pool the per-clone Shape.CopyTo mapping dictionary

Reuse a [ThreadStatic] src->dest node map across CopyTo calls instead of
allocating one per Word.Clone. The map is fully consumed before CopyTo returns
and CopyTo is not reentrant, so per-thread reuse is safe.

Measured (Sena, SenaQuick): 11,997 -> 11,943 KB/word, Gen0 2621 -> 2561.
Small (the per-clone ShapeNode/Annotation objects, not the map, are the bulk);
kept as a safe step. 63 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: finish the plan — Phase 6 decision gate + status

Record the mapping-pool result, mark phase statuses (Phase 2 done; Phase 1
partial/scoped; Phase 5 deferred to FW integration), and write the Phase 6
decision: continue capturing Rust's memory architecture in C# (COW shipped at
-14% Sena / -20% en-hc; next chunk = per-thread pooling of Word/ShapeNode/FST
buffers, now measurable via SenaQuick) rather than adopt Rust's runtime.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: measure 16-thread throughput (SenaParallel) — the parallel answer

Add SenaParallel: one shared serial-within-word morpher, same word set at
dop=1/4/8/16, wall-clock words/sec + scaling.

Measured (800 Sena words, 20-core box):
  Workstation GC: 3.4x @4, 3.55x @8 (peak), 3.14x @16 -> REGRESSES (GC ceiling,
    gen0 ~580 regardless of threads). Allocation is the parallel ceiling.
  Server GC:      5.7x @4, 8.1x @8, 10.3x @16 (gen0 ~88) -> ~11x vs 1-thread WS.

Confirms: the out-of-process Server-GC worker (PR #438) already delivers the
16-thread win; RUSTIFY pooling is what lifts the in-process Workstation curve.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: pinpoint allocation split — 20% Word.Clone, 80% FST traversal

Add an opt-in per-thread AllocationProbe hook (set from the net10 test via
GC.GetAllocatedBytesForCurrentThread) to attribute Word.Clone's allocation.

Measured (Sena, SenaQuick): of ~11.8 MB/word, ~20% is Word.Clone (Shape deep
copy) and ~80% is the FST traversal/cascade (a fresh TraversalMethod + List +
Queue + register snapshots + FstResults per rule application). Redirects the
plan: the FST traversal (esp. reusing the instance cache across Transduce
calls) is the high-ROI lever, Word/Shape pooling the secondary one.

Probe is zero-overhead when disabled and behavior-identical when no probe is
set. 63 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 1 (FST): pool traversal method per-thread under Server GC

Reuse one traversal method per thread per Fst (via a new Reset()) so its
instance free-list survives across the thousands of Transduce calls per parse,
instead of allocating + discarding a fresh traversal method + instance pool on
every rule application (measured: ~80% of parse allocation is the FST traversal).

Gated to Server GC (cached GCSettings.IsServerGC), because pooling trades
transient garbage for a larger LIVE working set: under Workstation GC that
triggers stop-the-world Gen2 pauses that serialize threads and REGRESS parallel
scaling (16T 3.1x -> 1.5x). Under Server GC it is a clear win.

Measured (Sena, 800 words, SenaParallel):
  Server GC 16T: 10.3x -> 11.2x; allocation -16% (7.0->5.9 GB); Gen0 88 -> 42.
  Workstation 16T: 3.16x unchanged (per-call path retained).
803 SIL.Machine + 63 HermitCrab tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: conclusion — GC no longer dominates at 16 threads (Server GC, 11.2x)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: back out FST traversal pooling (restore pre-pooling FST)

Removing the per-thread traversal-method pool: it only paid off under Server GC
and complicates the engine. Reverting to the original allocate-per-call FST
before restructuring to bit-packed feature vectors (the better lever).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 3: bit-packed feature-vector unify fast path

Add a flat ulong-per-feature vector to FeatureStruct and a bitwise IsUnifiable
fast path in Input.Matches for the common phonological case (no defaults, no
negation, fully-symbolic arc input). Gated so the arc INPUT must be fully
bit-packable while the SEGMENT may carry ignorable non-symbolic features (FLEx
stamps a StringFeatureValue on every segment); FlatIndex is globally unique
across feature systems and assigned lazily. FeatureStruct.FlatUnifyEnabled
toggles it for A/B.

Correct: 63 HermitCrab + 806 SIL.Machine tests pass; parity assertion found zero
divergence on en/Sena/Indonesian.

Measured (single-thread, SenaQuick):
  Indonesian: 12,463 -> 11,268 KB/word (-9.7%), Gen0 44->40, 100% fast coverage.
  Sena:        9,053 ->  9,018 KB/word (neutral), 22% coverage (Bantu agreement
               uses variable arcs that fall back) -- no regression.

Next lever for variable-heavy grammars: bit-pack variable bindings too.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: remove the out-of-process Server-GC worker/client architecture

Delete SIL.Machine.Morphology.HermitCrab.Server (worker host, HermitCrabServerClient,
protocol, Program) + its tests, and the .sln / test-project references. It was a
workaround for the in-process Workstation-GC parallel ceiling (separate Server-GC
process: ~100 MB worker, .NET 10 runtime dependency, a richer protocol + FieldWorks
adapter still to build). The RUSTIFY direction supersedes it: drive allocation low
enough in-process (COW + bit-packed unify + arena work) that plain .NET needs no
Server GC. Server GC stays available as a runtimeconfig flag if ever wanted.

63 HermitCrab tests pass; solution builds without the project.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: per-word FST-traversal arena (off by default) + key parallel finding

Add a per-thread arena that reuses traversal methods + instance free-lists across
a word (FstThreadPool, reset per word from Morpher.ParseWord) via a Reset() on the
traversal methods. Gated by Fst.TraversalPoolEnabled, DEFAULT OFF.

Measured (Sena, A/B same load): single-thread allocation -13%, BUT 16-thread
scaling collapses 2.87x -> 1.29x. Confirmed across 4 pooling variants. Cause:
under Workstation GC, pooled objects live across the word -> survive Gen0 ->
promote -> stop-the-world Gen2 serializes the threads. Short-lived (Gen0-only)
allocation is actually BETTER for parallel. So object-pooling is the wrong tool
for the no-Server-GC-at-16-threads goal; the right arena is struct/Span/stackalloc
(no GC retention). Kept off-by-default as a single-thread/Server-GC opt-in.

63 HermitCrab + 806 SIL.Machine tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: struct/Span FST traversal is blocked on the data model (record path)

Verified blocker: the FST offset type for HermitCrab is ShapeNode (a class), so
Register<ShapeNode> and the traversal instances are managed -> cannot stackalloc
or hold in a stack Span, and pooling them recreates the Phase 1b Gen2 regression
(Advance is also an iterator, forbidding stackalloc). The struct/Span no-GC
traversal therefore requires the foundational change: represent the shape as a
flat array with int-index offsets so Register<int> is unmanaged -> value-type
register/instance buffers, zero GC-heap allocation in the traversal, Gen0
pressure drops, parallel scales without Server GC. Large, foundational rewrite.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 3c: FstStatistics per-category allocation breakdown harness

Adds FstStatistics (SIL.Machine) to decompose the \"80% FST scaffolding\" into
four named buckets — VarBindings.Clone, Registers.Clone, per-Transduce Scaffold,
and TraversalMethod creation — so the flat-buffer investment can be gated on real
numbers from Sena (not theory).

Key findings from en-hc + WEB-PT run (439 words, 82.9 KB/word):
  Word.Clone         21%
  Pure scaffold       1%  (Register[], HashSet, List per Transduce)
  VarBindings         1%  (negligible on English; will be larger on Sena)
  Registers           0.1%
  TraversalMethod     0.7%
  Other (cascade)    55%  (MarkMorph/Annotation/stratum-rule overhead, NOT FST)

Flat-buffer addresses ~22% on the toy grammar; Sena breakdown needed to decide
whether to pursue the full int-offset Shape rewrite. See RUSTIFY.md § Phase 3c.

RustifyBenchmark now falls back to en-hc + WEB-PT when HC_GRAMMAR/HC_WORDS are
not set, so the breakdown harness is immediately runnable without a FLEx grammar.
63 HC tests green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

HC: expand cascade breakdown harness (Segment, Word.ctor, MarkMorph, analysis window)

Adds four new allocation probes to fully decompose the 55% 'Other' bucket:
- MorpherStatistics.SegmentBytes: wraps Segment() (initial Shape/ShapeNode creation)
- MorpherStatistics.WordCtorBytes: wraps new Word(stratum, shape) construction
- MorpherStatistics.MarkMorphBytes: wraps Word.MarkMorph() annotation allocation
- MorpherStatistics.AnalysisCascadeBytes: wraps _analysisRule.Apply().ToList() (superset)

English toy grammar result (439 words, 35 MB total):
  Segment (initial Shape)    7.2%   Scaffold (pure FST) ≈ 0%
  Word.ctor(new)             9.6%   Rule-chain machinery ~40.7%
  Word.Clone                21.3%   Synthesis + other  ~18.8%
  Scaffold (incl. clones)   21.9%
→ analysis window superset  64.4%

Key finding: MarkMorph ≈ 0%; pure FST scaffold ≈ 0%; dominant costs are
Word.Clone (21%), Word.ctor+Segment (17%), and rule-chain LINQ/FstResult (~41%).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

RUSTIFY: move Word.ctor allocation probe into the Word constructor

The Word.ctor probe lived in Morpher.AnalyzeWord, so it only measured the
single initial construction per word, not the cascade-created Words. Move it
into Word(Stratum, Shape) itself (gated on MorpherStatistics.Enabled, off in
production) and add WordCtorCount so the breakdown reports calls as well as
bytes. Harness-only; no production-path behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Phase 4a: hot-loop allocation eliminations (safe, no retention)

Four pure-elimination changes in the FST traversal + analysis cascade. Each
removes an allocation outright without extending any object lifetime, so none
can trigger the Phase-1b parallel regression (pooling promotes to Gen2 ->
serializes threads). All validated: 803 SIL.Machine + 63 HermitCrab tests
green; SenaParallel scaling unchanged.

1. ITraversalMethod.Traverse returns List<FstResult> (was IEnumerable) -> drop
   the redundant .ToList() in Fst.Transduce. All four concrete Traverse impls
   already return the curResults List; the interface was needlessly widened.
2. Remove redundant .Distinct(FreezableEqualityComparer<Word>.Default) x2 in
   AnalysisStratumRule.ApplyMorphologicalRules/ApplyTemplates. Both _mrulesRule
   and _templatesRule are built with that same comparer and return a HashSet
   already deduped by it, so the Distinct pass is a no-op DistinctIterator.
3. Skip the DistinctIterator for trivial result sets in Fst.Transduce:
   (allMatches && resultList.Count > 1) ? resultList.Distinct() : resultList.
   resultList is non-null Count>=1 there; Count==1 Distinct is identity.
4. TraversalMethodBase.Reset: replace the per-Transduce GetNodesDepthFirst
   yield iterator (heap state machine per top annotation, thousands/word) with
   the allocation-free PreorderTraverse(action) form; delegate cached as a
   field (allocated once in ctor, not per call).

Measured (en-hc, SenaQuick, 439 words): Other 38.5% -> 36.2%
(14,145KB -> 12,884KB), KB/word 83.6 -> 81.1. Toy-grammar deltas are small;
real magnitude needs the Sena grammar. See RUSTIFY.md Phase 4a/4b (4b documents
the rejected scaffold-buffer ThreadStatic pooling: re-entrant via
acceptInfo.Acceptable, lifetime extension against the thesis, unmeasurable).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Phase 4c: single-hash traversed.Add in nondet FST traversal

In NondeterministicFstTraversalMethod.Traverse (both epsilon and input-match
branches), the dedup check hashed the expensive structural key (state +
annotation index + register array + outputs array) twice: once for
Contains(key), then again for Add(key). HashSet.Add already returns false when
the element is present, so collapse to `if (traversed.Add(key)) Push(newInst);`
— a single hash/lookup in the innermost traversal loop. Byte-identical.

CPU-only cleanup (no allocation change), so KB/word is flat on the toy grammar;
the structural-hash cost it removes is not resolvable there. Gated:
803 SIL.Machine + 63 HermitCrab tests green; SenaQuick no regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 4c: drop per-call List<int> in TraversalMethodBase.Advance

Advance collected the same-offset annotation window into `var anns = new
List<int>()` and then iterated it. That window is a contiguous index range
[nextIndex, annsEnd) (the build loop adds every consecutive i whose start
offset matches), so track the end bound and iterate the range directly,
eliminating one List allocation per arc match (one of the hottest paths in
the traversal). cloneOutputs/first flow is unchanged.

Measured (en-hc toy, SenaQuick, 439 words): KB/word 80.3 -> 79.0, totalMB
34 -> 33. Gated: 803 SIL.Machine + 63 HermitCrab tests green; toy under-
measures so treat the delta as directional, no-regression is the bar.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 4c: collapse identity-map LINQ in TraversalInstance.CopyTo

Both Deterministic and Nondeterministic CopyTo built an `outputMappings`
dictionary by zipping this.Output's node sequence with ITSELF — a
deterministic Queue-based BFS enumeration paired element-for-element, i.e. the
identity map — then projected _mappings through it. Since outputMappings[v]==v,
the entire block reduces to copying _mappings unchanged. Replace with
`other.Mappings.AddRange(_mappings)`, removing a Dictionary + two
SelectMany(GetNodesBreadthFirst, each allocating a Queue + yield iterator) +
Zip + Select per instance copy. CopyTo runs on every branch of nondeterministic
traversal, so this is allocation-heavy at scale (Sena ~276 clones/word) though
the toy grammar (2 clones/word, few branches) can't resolve it.

Byte-identical (provable identity-map reduction); other.Mappings is empty pre-
AddRange (GetCachedInstance -> Clear). Removed now-unused usings (System.Linq,
SIL.Machine.DataStructures) to satisfy IDE0005-as-error.

Gated: 803 SIL.Machine + 63 HermitCrab tests green; SenaQuick no regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 4c: paired-walk clone mapping in InitializeStack (Det + Nondet)

Both InitializeStack methods built inst.Mappings (source annotation -> clone)
by zipping two BFS node sequences:
  Data.Annotations.SelectMany(GetNodesBreadthFirst)
    .Zip(inst.Output.Annotations.SelectMany(GetNodesBreadthFirst), KVP)
allocating, per Transduce, a Queue per top annotation (BFS), two SelectMany
state machines, a Zip state machine, and one KeyValuePair per node.

Data and inst.Output are isomorphic (inst.Output = Data.Clone()), and the
resulting dictionary is independent of traversal order, so replace with a new
allocation-light helper DataStructuresExtensions.PairedPreorderTraverse that
walks the two forests in lockstep (preorder) and writes pairs straight into the
dict via a static (closure-free) callback. Debug.Asserts guard the isomorphism
invariant (root/leaf/child-count) so any future violation fails loudly instead
of silently truncating like Zip.

Runs once per Transduce (thousands/word). Toy grammar has tiny annotation trees
so the delta is below its resolution; the win compounds on Sena's long words.
Removed now-unused usings (System.Linq, SIL.Extensions) in the deterministic
method to satisfy IDE0005-as-error.

Gated: 803 SIL.Machine + 63 HermitCrab tests green (incl. the concurrent-
determinism test); SenaQuick no regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Phase 4c: precompute initializer partition at Fst.Freeze

Fst.Transduce rebuilt a List<TagMapCommand> on every call (and every outer
annIndex iteration), filtering _initializers into Dest!=0 (the per-call cmds
list) vs Dest==0 (which drive a per-annotation SetOffset). The partition is
identical every call for a frozen FST, and cmds is read-only downstream
(Initialize -> ExecuteCommands only iterate it).

Partition _initializers once in Freeze() into _zeroDestInitializers /
_nonZeroDestInitializers (built into locals, gating field published last so a
reader never sees a half-filled list). Transduce reuses the shared read-only
_nonZeroDestInitializers as cmds and walks _zeroDestInitializers for the
SetOffsets, eliminating the per-call list allocation + filter loop. When the FST
isn't frozen the fields are null and Transduce falls back to the exact inline
build, so unfrozen callers are unaffected. The frozen FST is shared read-only
across parsing threads, so concurrent reads of the shared cmds list are safe.

Measured (en-hc toy, SenaQuick, 439 words): Scaffold 22.4% -> 21.0%
(7897KB -> 7261KB), KB/word 79.0 -> 78.8. SenaParallel: scaling and allocation
unchanged (no parallel regression from sharing the list). Gated: 803
SIL.Machine + 63 HermitCrab tests green, incl. the concurrent-determinism test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: document Phase 4c (five safe no-retention FST eliminations)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY cleanup: strip unrelated USFM work + remove dead GC/pooling machinery

Audit of master..hc-rustify identified work to drop now that allocation is
driven down and the branch is being squashed:

1. Strip the unrelated USFM/versification change set (3 commits: #430, #432,
   normalization port) — restored src/SIL.Machine/Corpora,
   src/SIL.Machine/PunctuationAnalysis and their tests to master, removed the
   USFM-added files. None of it is perf work; it belongs in its own PR.

2. Remove the dead FST traversal pooling (measured to REGRESS parallel parsing,
   Phase 1b): the FstThreadPool class, Fst.TraversalPoolEnabled, the pooling
   branch in Fst.Transduce, FstThreadPool.Reset() in Morpher.ParseWord, and the
   HC_ARENA toggle in RustifyBenchmark. Transduce now always uses a fresh
   (die-in-Gen0) traversal method, which is the right tradeoff once allocation
   is low.

3. Remove the Shape.CopyTo [ThreadStatic] clone-map pool, keeping the
   value-added inline mapping build (no second GetNodes().Zip().ToDictionary()
   pass) — just a plain per-call Dictionary.

4. Revert Machine.sln to master (leftover x64/x86 platform configs) and remove
   three superseded planning docs (HERMITCRAB_ALLOCATION_STRATEGIES /
   COW_PLANS / PERF_PLAN), now consolidated in RUSTIFY.md.

Kept (per request): the allocation instrumentation (MorpherStatistics,
FstStatistics, probes) + both benchmarks for before/after measurement; the
MaxDegreeOfParallelism API + Synthesize refactor; COW FeatureStruct; bit-packed
unify; Phase 4a/4c eliminations.

Gated: 801 SIL.Machine + 63 HermitCrab tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: document Phase 4d cleanup (pooling/arena removed)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY cleanup: restore the safe Shape.CopyTo [ThreadStatic] clone-map pool

Re-examination showed this pool is NOT the regressive kind removed elsewhere.
The Phase-1b parallel regression came from objects retained ACROSS a word
(promoted to Gen2). Shape.CopyTo's CloneMapping is cleared and fully consumed
WITHIN each call (contents die immediately; only a small empty buffer persists),
so it cannot promote parse data to Gen2 — and it still buys a small allocation
win (~0.45% on Sena; on the toy grammar removing it had pushed Word.Clone
22.4% -> 23.6% and KB/word 78.8 -> 80.2). Restored, with RUSTIFY.md Phase 4d
note corrected to record it as KEPT (safe pool) rather than removed.

Gated: 801 SIL.Machine + 63 HermitCrab tests green; SenaQuick KB/word back to
78.8, Word.Clone 22.4%.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: staged implementation plan for the flat int-index shape

Records the chosen direction (go flat) + corrected feasibility findings (no
TOffset constraints; int-offset engine already tested; ShapeNode contained to
~95 in-repo refs), the accepted cost (ShapeNode -> handle, value identity), and
the 3-stage plan: (1) array-backed Shape + ShapeNode handle + array-copy Clone,
(2) int FST offset + unmanaged Span/stackalloc traversal (the parallel unlock),
(3) migrate rule sites to indices.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: real Sena measurement obtained — reshapes flat-shape priorities

Generated sena-hc.xml from the Sena 3 FieldWorks backup (GenerateHCConfig.exe)
and extracted sena-words.txt (7,121 words) from the project's seh running text.
SenaQuick (400 words, MaxUnapp=5) now gives the clone-heavy numbers the spike
needs:

  clones/word=345 (estimate ~276 confirmed), KB/word=14,116
  Scaffold 42.2% (per-Transduce Register[,] arrays)  <- biggest bucket
  Word.Clone 21.9%, Other 24.8%

Key finding: Scaffold (managed Register<ShapeNode>[,] per Transduce) is ~2x
Word.Clone, so the flat int-index foundation's biggest payoff is Stage 2
(Register<int> -> stackalloc/Span, zero heap), bigger than the Word.Clone bucket
the goal named, and unlocked by the same change. Confirms flat over COW (COW
cannot touch the traversal scaffold). Benchmark assets untracked in samples/data.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 1: array-backed Shape + ShapeNode handle (flat-shape foundation)

Re-represents the shape as a flat, int-indexed backing — the data-model
foundation the whole flat-shape plan (Phase 3b-impl) stands on, and the
prerequisite for Stage 2's Fst<Word,int> register-scaffold win (the measured
42% bucket) and Stage 3's Word.Clone cut (22%).

- Shape no longer inherits OrderedBidirList<ShapeNode>; it owns its nodes in
  flat arrays (_next/_prev int links = in-array doubly-linked list, per-node
  frozen flag, canonical handle) addressed by a stable ShapeNode.Index, and
  reimplements IOrderedBidirList/IOrderedBidirListNode over them.
- ShapeNode becomes a handle (Owner + Index); links/frozen delegate to the
  owner arrays. The added node IS retained as the canonical one-per-slot handle,
  so reference identity (==, dict keys, Range<ShapeNode> endpoints) is unchanged.
- Tag deliberately stays on the node so it survives a node moving between shapes
  (AddAfter sets the new tag before detaching from the old owner). The tag-relabel
  order maintenance, Freeze/Clone/CopyTo and annotation interactions are preserved.

Gate: 803 SIL.Machine + 63 HermitCrab tests green (incl. concurrent-determinism).
Measured neutral on en-hc toy (SenaQuick): KB/word 80.5 -> 80.5, clones/word 2,
gen0 2 — exactly the plan's "Stage 1 ~= 0" prediction; payoff is unlocked by, not
realized in, this increment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Stage 2 substrate: freeze-time NodeAt int<->node bridge + design blueprint

Adds Shape.NodeAt(int) backed by a dense _byPos[] table built in Freeze (content
nodes already get dense Tag 0..N-1 there), the int-offset -> ShapeNode bridge the
Fst<Word,int> binding will resolve against. Additive and behavior-preserving;
803 SIL.Machine tests green.

Records the resolved Stage 2 blueprint in RUSTIFY.md: offset = dense frozen tag
(HC always freezes before traversal), half-open [t,t+1) ranges reusing
IntegerRangeFactory (provably identical ordering/Overlaps/Contains to the
inclusive ShapeNode form for a one-unit-per-node model), Word/Shape become
IAnnotatedData<int> via a freeze-time AnnotationList<int> projection, rules resolve
int->node via NodeAt, and Register<int> goes unmanaged (the 42% Scaffold payoff).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Stage 2 blueprint: correct it after reading the rule-application flow

Reading IterativePhonologicalPatternRule.Apply + the rewrite SubruleSpecs +
the semantic-site catalog overturned two blueprint assumptions:

- Rewrite rules MUTATE match.Input.Shape in place while UNFROZEN and re-match
  repeatedly, so the traversed shape's tags are SPARSE, not dense 0..N-1.
  => offset must be the raw ordered Tag (the [Tag,Tag+1) half-open mapping is
  still provably correct for sparse tags: Tag+1 always lands at/<= the next tag).
- NodeAt must therefore work on unfrozen shapes => a Tag->node map maintained
  incrementally (AddAfter/Remove/Relabel), not a freeze-only dense array.

Records the real hazards to design for: End.Tag==int.MaxValue overflows [t,t+1)
(anchors are in the annotation list and ordered on add); the int annotation
projection must stay in sync with the live mutated ShapeNode list (not build-once
at freeze); and ~30 offset-navigation sites (match.Range.End.Next etc.) must route
through shape.NodeAt(tag).Next?.Tag preserving null-at-boundary. Net: the flip is
larger/subtler than a mechanical generic swap — a multi-session spike, each
sub-piece behind the byte-identical gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Stage 2: empirically validate the int-offset range-mapping thesis

Adds a parity test proving the assumption the whole TOffset=ShapeNode -> int flip
rests on: mapping each annotation [startNode,endNode] to the half-open int range
[startNode.Tag, endNode.Tag+1] preserves the range relationships the FST traversal
depends on — CompareTo ordering, Overlaps, Contains — for SPARSE tags (appended
unfrozen shape, as rewrite rules see it) and dense tags (frozen). Pairwise over
all annotations of a shape with a spanning (start!=end) annotation. Both cases green.

This de-risks the riskiest design point before any code is built on it: now 805
SIL.Machine (+2) + 63 HC tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Stage 1: measure the flat Word on the REAL Sena grammar (the clone-heavy case)

The toy-grammar SenaQuick (2 clones/word) was too clone-light to judge an
API-breaking rewrite. Ran SenaQuick against the real Sena grammar (sena-hc.xml,
400 words, HC_MAX_UNAPP=5) where the ~345 clones/word payoff lives, vs the
pre-Stage-1 baseline recorded in 2fd1a2d3:

  clones/word 345 -> 345  (byte-identical AND behavior-identical at scale)
  KB/word     14116 -> 14583  (+3.3%)
  gen0        442 -> 457
  Scaffold    42.2% -> 42.7%   Word.Clone 21.9% -> 22.3%  (split reproduced)

Findings: (1) the flat data model produces an identical clone count on the
pathological grammar, confirming correctness beyond the toy tests; (2) Stage 1
in isolation costs +3.3% allocation -- the four per-Shape backing arrays
(_nodes/_next/_prev/_frozen) x 138k clones -- exactly the plan's "Stage 1 ~= 0 or
slightly negative" prediction, hidden by the toy grammar's 2 clones/word. The
cost is the investment: the 42.7% Scaffold (Stage 2 Register<int>) and 22.3%
Word.Clone (Stage 3) are what the same flat foundation unlocks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Stage 2: int-offset annotation projection on Shape (the Fst<Word,int> bridge)

Adds the linchpin infrastructure for the FST flip, additively (Shape still
IAnnotatedData<ShapeNode>; nothing flipped yet, full suite green):

- AnnotationList<T> gains an internal Version counter (bumped on Add/Remove/Clear)
  so derived views can detect staleness cheaply.
- Shape builds a lazy, version-gated int-offset projection: AnnotationList<int>
  IntAnnotations (each [s,e] -> half-open [s.Tag, e.Tag+1], End margin clamped to
  avoid +1 overflow), Range<int> IntRange, and Dictionary<int,ShapeNode> for
  NodeAt(offset) (now works frozen AND unfrozen, via Tag). FeatureStruct is shared
  by reference so in-place rule edits stay visible. The projection is rebuilt only
  when the annotation Version or frozen-state changes, so a stable/frozen shape
  builds it once and reuses it across thousands of Transduce calls per word.

Tests: IntAnnotationProjection_MirrorsShapeNodeAnnotations verifies the projection
mirrors the ShapeNode tree (ranges, FeatureStruct identity, optional, children),
NodeAt round-trips every node by Tag, and the cache invalidates on mutation.
805 (+2) SIL.Machine green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

WIP RUSTIFY Stage 2: flip HC FST to Fst<Word,int> (57/63 HC green)

The full TOffset ShapeNode->int flip across HermitCrab (~71 files): Word is now
IAnnotatedData<int>; Shape exposes a lazy, version-gated int-offset annotation
projection with DENSE node positions (0..N+1) as offsets — dense (not sparse Tag)
to avoid the Range<int>.Null=-1 collision, +1 overflow at the End margin, and
empty anchors. NodeAt/OffsetOf/MatchStartOffset bridge int<->node; rule RHS code
resolves match/group int ranges back to nodes (half-open [off, off+1), so leftmost
= NodeAt(Start), rightmost = NodeAt(End-1)). MatchStartOffset(node,dir) handles the
inclusive->half-open asymmetry for right-to-left match-start offsets.

Down from 23 failures to 6 (all now logic, not crashes): metathesis SimpleRule/
ComplexRule, DeletionRules/MultipleDeletionRules, EpenthesisRules, ReduplicationRules
— node insertion/deletion/movement + group-capture rules still need correctness work.
Register<int> stackalloc (the payoff) comes after these are byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RUSTIFY Stage 2: fix analysis under-generation (63/63 HC green)

Two byte-identical fixes to the int-offset projection, restoring the 4
analysis-direction rewrite tests (Epenthesis/Deletion/MultipleDeletion/
Reduplication) that the Fst<Word,int> flip broke:

1. Annotation.Optional must invalidate the projection. The Shape int
   projection copies Optional by value and caches against the annotation
   list Version, but the Optional setter is a non-structural change that
   never bumped Version. So once analysis flipped Optional=true on existing
   nodes, the matcher kept reading the stale Optional=false projection and
   never forked the optional-skip instances. The setter now bumps the root
   list's version (new AnnotationList.IncrementVersion). Fixes Epenthesis.

2. IntRange must be the half-open image of the inclusive [Begin, End], i.e.
   [off(Begin), off(End)+1) — not [off(Begin), off(End)]. The only consumer
   is Matcher.GetStartAnnotation via Range.GetStart(dir); a RtL match starts
   at GetStart(RtL)==End. The End anchor's dense range is [off(End),
   off(End)+1), whose RtL start coordinate is off(End)+1, so without the +1
   a RtL match began at the last content node and skipped any edit adjacent
   to End (e.g. inserting a deleted segment after the final vowel). Fixes
   the deletion/reduplication cases.

Adds two regression tests guarding both invariants. Also keeps the prior
working-tree Stage-2 fixes in IterativePhonologicalPatternRule and
SynthesisMetathesisRuleSpec (resolve int offsets to ShapeNode refs before
mutating the shape).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 2: record generic-flip-green milestone + post-flip measurement

The <Word,ShapeNode>->-<Word,int> generic flip is byte-identical green (63/63
HC + 808 SIL.Machine, full Release solution builds clean). Document the two
int-model correctness bugs found bringing it from 57/63 to green (Optional
cache invalidation + IntRange half-open End-anchor mapping), why 59 tests
masked them, and the post-flip en-hc baseline (KB/word 78.8 -> 86.1, the
projection "investment" before the Register<int> payoff). Records the
remaining Stage 2 payoff target.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 2: refute the register-payoff hypothesis with real-grammar data

Wire Indonesian (the classic HC nasalization demo, variable-light, ~150
clones/word) alongside Sena (~345 clones/word) as the two measurement
grammars, and use them to investigate the Stage-2 thesis that Register<int>
being unmanaged unlocks a stackalloc cut of the 42% Scaffold bucket.

Measurement refutes it:
- Registers.Clone (the escaping accept snapshots the redesign targets) = 0.2%
  on Sena. Not where the bytes are.
- Converting the per-push dedup-key Tuple<State,int,Register[,][,Output[]]> to
  an inline `readonly struct TraversalKey` in both nondeterministic traversal
  methods moved allocation ~0% (Sena KB/word 14588->14579; Indonesian flat).
  Kept anyway: zero-risk, byte-identical, removes a real per-push heap object,
  CPU-positive (single Add vs Contains+Add), consistent with Phase-4c micro-
  eliminations.
- The Scaffold 38.5% IS the clone explosion: it contains Word.Clone (22.4%, via
  the per-instance Output=Data.Clone() in InitializeStack) + the per-instance
  Mappings dictionary + Output graph. The int flip's allocation payoff is
  therefore Stage 3 (flat-shape clone), not a register trick.

Full suite green (808 SIL.Machine + 63 HC, incl. concurrent-determinism).
RUSTIFY.md records the finding + how to regenerate both grammars.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3: localize the clone cost — inherent per-node materialization

Two-phase allocation probe on Shape.CopyTo (Sena): Word.Clone's 22% splits
into CopyTo node-phase 11.4% (node.Clone + per-node dest.Add) + annotation-
phase 4.1% + ~6.9% Word/Shape ctor. The node-phase prize is inherent per-node
object materialization (ShapeNode + Annotation + COW FS + AnnotationList skip-
list entry per node), not intermediate churn.

Two incremental attacks measured ~0/negative and were reverted:
- pre-size the backing arrays vs AddAfter doubling: 666->688 MB (worse;
  source Count over-sizes partial-range CopyTo, doubling was never the cost).
- the per-push dedup Tuple->struct (prior commit): ~0.

Conclusion: the flat-clone payoff requires the deep redesign (lazy ShapeNode
handles + bulk AnnotationList clone + index-addressed annotations), the Stage-1-
deferred "Clone = Array.Copy" end-state — a multi-session foundational rewrite
needing a go/no-go. No incremental win exists short of it. Recorded in RUSTIFY.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3: design + sequencing doc for the flat-shape clone spike

Per the plan-then-proceed go/no-go: RUSTIFY-stage3-design.md lays out the
foundational Word.Clone rewrite before any code goes red.

- Goal: kill the inherent per-node materialization (node-phase 11.4% +
  anns-phase 4.1% of Word.Clone) by making Shape.Clone an Array.Copy.
- Entanglement: ShapeNode reference-identity + annotations-hold-handles +
  skip-list-tower-per-annotation must be undone together.
- Key resolution: materialize-on-touch two-state shape. A clone is a flat
  snapshot (no handles/Annotation objects); the int projection (Stage 2)
  reads it for the hot frozen-traverse path so nothing materializes; any
  ShapeNode/Annotation request or in-place mutation materializes lazily,
  one-per-slot, restoring exact reference identity. This resolves the
  dense-index-vs-mutation tension: frozen-read pays nothing, unfrozen-mutate
  pays the old price (far colder).
- Byte-identical risk register, I->V sub-increment order (I FeatureStruct
  flat array + II flat AnnotationList = gateable green and keepable alone;
  III lazy materialization + Array.Copy clone = the red phase; IV triage the
  189 HC ShapeNode refs frozen-read vs mutate; V re-validate + measure),
  and rollback to dbef327a.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3: advisor review + measure II first — towers are 7.4% (resequence)

Advisor review of the design front-loaded a read-only verification gate before
the risky III: the linchpin (int projection must rebuild from flat records,
handle-free), the make-or-break premise (UpdateOutput touches O(few) per FST-
transduction clone), result-consumer audit, and the I detached-FS caveat. Folded
into the design doc.

Then executed the advisor's "measure II before III" with a temporary tower-
allocation probe on Sena:

  annotation skip-list towers = 7.4% of total alloc (~432 MB, 6.31M arrays) —
  a THIRD of Word.Clone (22.4%), two-thirds of node-phase(11.4%)+anns-phase(4.1%).

Resequences the spike: increment II (flatten the BidirList tower arrays into
list-owned flat backing) is now the headline — ~7.4%, byte-identical, gateable
GREEN, zero laziness risk, independently keepable. Increment III's lazy-handle
materialization is downgraded to optional/gated: it buys only the residual ~8%
(the ShapeNode/Annotation objects) and carries the reference-identity risk. The
towers were the cheap two-thirds hiding behind the "inherent objects" framing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3 II-a: grow skip-list margins on demand (Word.Clone -123MB Sena)

First positive allocation increment of the flat-clone spike, byte-identical green.

Every BidirList ctor Init'd both Begin/End margin nodes at the 33-level skip-list
maximum (new TNode[33] x2) regardless of actual list height. Since lists almost
always stay shallow, that eager margin tower was a large slice of the per-
AnnotationList tower allocation that dominates Word.Clone. Now:
- margins start at level 0 (Init(1) + link level 0);
- GrowMargins ensures capacity + links Begin<->End at a new level only when a
  node first reaches it (EnsureLevelCapacity right-sizes; geometric growth was
  measured slightly worse - it over-allocates the shallow majority);
- Clear resets to level 0, higher levels relink lazily on regrowth.

Measured (SenaQuick, Release): Sena Word.Clone 1,306,476 -> 1,182,940 KB
(-123 MB, -9.5% of Word.Clone, stable across runs; total KB/word -~2% under GC
noise); Indonesian Word.Clone -~0.5pt similarly. Full SIL.Machine (808) + HC (63,
incl. concurrent-determinism) green.

Contained to BidirList/BidirListNode (used by AnnotationList x2, SkipList,
TreeBidirList); does not touch ShapeNode/Annotation reference identity, so it is
independently keepable regardless of the later II-b / III increments.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3 II-b: inline skip-list level 0 (Word.Clone -54MB more, Sena)

Second positive flat-clone increment, byte-identical green. Level 0 (the only
level ~50% of skip-list nodes have) moves from the per-node _next[0]/_prev[0]
arrays into inline _next0/_prev0 fields, so level-0 nodes allocate NO tower array
at all and every taller node's array is one slot shorter (levels 1.. in
_nextHigh/_prevHigh, null when Levels<=1).

Touches the hottest skip-list accessors (GetNext/SetNext/GetPrev/SetPrev/Next/
Prev/Init/Clear/EnsureLevelCapacity); gated on the full SIL.Machine (808) + HC
(63, incl. concurrent-determinism) suites - green, so the level<->field-or-array
dispatch is byte-identical.

Measured (SenaQuick, Release): Sena Word.Clone 1,182,940 -> 1,128,660 KB
(-54 MB on top of II-a); Indonesian 222,491 -> 212,911 KB (-9.6 MB).

Cumulative II-a + II-b vs pre-Stage-3: Sena Word.Clone -177 MB (-13.6%), total
allocation -4.2% (KB/word 14,556 -> 13,942); Indonesian total -4.1%. Pure
allocation reduction, no retention, independently keepable regardless of III.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: index the Stage 3 II-a/II-b green increments (-4.2% allocation)

Record in the main plan that two byte-identical flat-clone increments landed
(margin grow-on-demand + inline level 0), banking the cheap skip-list tower
wins: Sena Word.Clone -177 MB (-13.6%), total -4.2%; Indonesian -4.1%. Points
to RUSTIFY-stage3-design.md for the residual III go/no-go.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3: III feasibility measured (41% Sena clones never mutated) + choose copy-on-write Shape mechanism

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3 III: copy-on-write Shape — Word.Clone -59.6% (Sena), byte-identical

The flat-clone payoff. A clone of a *frozen* shape now stores _cowSource and
copies nothing. The asymmetry that makes this cheap + safe: the FST matcher (the
hot read path) consumes a clone only through the int-offset projection
(IntAnnotations/IntRange), which is served from the frozen source; while every
path that could mutate first hands out a ShapeNode/Annotation handle. So:
- serve IntAnnotations/IntRange/Count/GetFrozenHashCode/Freeze from the source
  while copy-on-write;
- gate EnsureInflated() (= the real CopyTo, then re-freeze if frozen-by-sharing)
  on the flat-backing link accessors, First/Last/enumeration, NodeAt/OffsetOf/
  MatchStartOffset/Annotations/GetNodes/CopyTo/ValueEquals, and every mutator.
A clone that is only traversed (matcher carrier) never inflates -> costs a shell
instead of N nodes + N annotations + their skip-list towers.

Thread-safety (the doc's non-negotiable): a frozen shape's int projection is now
built eagerly at Freeze() (single-threaded), so the new pattern of several parse
threads' COW clones delegating to one shared frozen grammar shape always hits a
complete cache rather than racing a lazy first build.

Measured (SenaQuick, Release): Sena Word.Clone 1,128,660 -> 528,071 KB (-53% on
top of II; 20.2% -> 9.9% of total); Indonesian 212,911 -> 85,566 KB (-60%).
Cumulative Stage 3 (II-a+II-b+III) vs pre-Stage-3: Sena Word.Clone -778 MB
(-59.6%), share 22.4% -> 9.9%; Indonesian -62%. Word.Clone is no longer a top
bucket. Full SIL.Machine (808) + HC (63, incl. concurrent-determinism) green;
full Release solution builds clean; SenaParallel scaling unregressed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY Stage 3 III: verify byte-identical on real grammars + COW invariant tests

Validation the toy suite can't give (en-hc is ~2 clones/word; COW's never-
inflated path runs ~170x hotter on Sena at 345 clones/word):

- Added RustifyBenchmark.Signature ([Explicit], not CI): emits a deterministic
  per-word analysis signature (sorted set of Category|root|glosses per
  WordAnalysis) to HC_SIG_OUT. Diffed HEAD vs the pre-Stage-3 baseline (dbef327a,
  isolating II+III) on BOTH grammars via a worktree: Sena (400 words) and
  Indonesian (121 words, 100 non-empty) signatures are IDENTICAL. The COW change
  is byte-identical where it actually runs hot, not just on the toy grammar.

- Added 3 CI-running COW-invariant regression tests (AnnotationTests):
  never-inflated clone serves the source's projection/range/count; mutating a
  clone inflates it and leaves the frozen source uncorrupted; frozen-by-sharing
  hash equals the source and stays stable across forced inflation.

Full SIL.Machine (811) + HC (63) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY lever 2: lazily allocate Word's morphological-rule bookkeeping maps

_mrulesUnapplied / _mrulesApplied / _disjunctiveAllomorphIndices stay empty
through the phonological-analysis cascade (where ~345 clones/word happen) but
were cloned eagerly per candidate. Now null = empty, created on first write,
copied only when the source is non-empty. Byte-identical (63 HC green).

Measured (SenaQuick): Word.Clone 527,987 -> 499,121 KB (-29 MB), Word.ctor
184,858 -> 177,387 KB; total 5,267 -> 5,216 MB (~-1%).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY lever 1: hoist the initial-register scaffold out of the Transduce loop

Fst.Transduce allocated a fresh Register<TOffset>[regCount,2] per outer (start-
position) iteration. Traverse only Array.Copy's it into the initial instances
and never retains it, so it can be allocated once and Array.Clear'd per start
position - byte-identical, and AllMatches (analysis) runs one iteration per
start, so this removes (starts-1) register-array allocations per matcher call.

Measured (SenaQuick): Scaffold 2,264,486 -> 2,241,441 KB (-23 MB). Full suite
(811 SIL.Machine + 63 HC) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: record levers 1+2 (lean Word + hoisted register scaffold, ~-1%, byte-identical) and why the 42% Scaffold prize stays blocked

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY lever 1: replace per-instance Visited HashSet with an inline value bitset

Profiling showed the 42% Scaffold is instance churn: ~2,927 traversal instances
created per Sena word (only ~20% reused — the pool is per-Transduce, thrown away
each call, and pooling across calls re-triggers the Phase-1b Gen2 parallel
regression). So the fix is leaner instances, not pooling.

Each nondeterministic instance carried a HashSet<State> to avoid epsilon loops.
States have a dense Index, so this is now a value-type VisitedStates bitset:
states 0-63 in an inline ulong field (zero heap — HC rule FSTs are tiny), a lazy
ulong[] overflow only for 64+ state FSTs. The set is now part of the instance
object, not a separate ~1.17M/word heap allocation. Byte-identical (same dedup
semantics over state identity == Index).

Measured (SenaQuick): Scaffold 2,269,759 -> 2,169,001 KB (-100 MB), total
5,242 -> 5,145 MB (~-2%). Full suite (811 SIL.Machine + 63 HC) green.

The remaining per-instance allocation (the Register[,] array, ~1.17M/word) is the
bigger prize but is blocked here: the `traversed` dedup key holds each instance's
register array BY REFERENCE, so a shared register arena (slices reused across
instances) would corrupt dedup. Cutting it needs the deep de-iterator + snapshot-
dedup rewrite, not a drop-in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY lever 1 (deep): de-iterator Advance/Initialize into a reusable buffer

The core of the scaffold rewrite. Advance was a yield-based iterator and
Initialize allocated a fresh List per call (both recursive), so each of the
~2,482 Transduce/word -> millions of Advance calls minted an iterator state
machine / List. Both now fill ONE reusable per-method result buffer instead.

Safety: the buffer is a per-method (per-Transduce) field, so it carries no
cross-word retention (the Phase-1b Gen2 parallel regression) and cannot be a
thread-static (CheckAccepting's Acceptable predicate can re-enter Transduce on
the same thread). Initialize fills it once at the start of Traverse and the
caller fully consumes it building the work stack before the main loop's first
Advance reuses it, so the two never overlap (one buffer serves both). Advance is
not re-entrant within a method. Byte-identical: same results, same order.

Measured (SenaQuick): total 5,145 -> 5,029 MB (-116 MB, ~-2.3%); the per-call
iterator state machines (Scaffold -147 MB) replaced by one buffer List/method
(+~39 MB in TraversalMethod after merging the two buffers into one). Full suite
(811 SIL.Machine + 63 HC, incl. concurrent-determinism) green.

NOTE on the register stackalloc premise: it does NOT apply to the nondeterministic
matcher (the hot path). The `traversed` dedup retains a per-config register
snapshot during each Transduce, so the registers are not transient stack values -
they're the evolving, snapshotted match state. The achievable scaffold wins are
therefore the iterator garbage (this commit) + the Visited HashSet (prior), not
stackalloc'd registers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RUSTIFY: record lever-1 deep rewrite (Visited bitset + de-iterator, ~-6% Sena, byte-identical) + the register-stackalloc constraint

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Port 'Fix incorrect chapter numbers returned from usfm_structure_extractor'

4 participants