HermitCrab performance: single-threaded option, allocation instrumentation, and out-of-process Server-GC parser#438
HermitCrab performance: single-threaded option, allocation instrumentation, and out-of-process Server-GC parser#438johnml1135 wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR targets HermitCrab parser throughput under real-world FLEx grammars by reducing allocation/GC pressure and avoiding nested parallelism, while adding diagnostics/benchmarks and an optional out-of-process Server-GC worker.
Changes:
- Adds a runtime
maxDegreeOfParallelismknob to force fully single-threaded within-word parsing (enabling callers to parallelize across words safely). - Introduces copy-on-write behavior for
FeatureStruct.Clone()when the source is frozen, plus additional small allocation reductions and memoization. - Adds opt-in
MorpherStatistics, explicit benchmarks/tests, and a newSIL.Machine.Morphology.HermitCrab.Serverworker + client implementing newline-delimited JSON over stdin/stdout.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/SIL.Machine.Tests/FeatureModel/FeatureStructTests.cs | Adds COW characterization tests to ensure frozen sources aren’t mutated via clone writes (incl. nested + re-entrant cases). |
| tests/SIL.Machine.Tests/Annotations/AnnotationTests.cs | Adds a safety-net test for cloning frozen Shape and mutating a cloned node’s FeatureStruct. |
| tests/SIL.Machine.Morphology.HermitCrab.Tests/SIL.Machine.Morphology.HermitCrab.Tests.csproj | References the new HermitCrab.Server project for end-to-end tests. |
| tests/SIL.Machine.Morphology.HermitCrab.Tests/MorpherTests.cs | Adds tests asserting single-threaded mode avoids parallel sections and that concurrent parsing is deterministic. |
| tests/SIL.Machine.Morphology.HermitCrab.Tests/MorpherBenchmark.cs | Adds an [Explicit] benchmark harness for throughput + allocation/GC measurement across modes. |
| tests/SIL.Machine.Morphology.HermitCrab.Tests/HermitCrabServerTests.cs | Adds [Explicit] end-to-end validation and benchmark for the out-of-process Server-GC worker. |
| src/SIL.Machine/Rules/CombinationRuleCascade.cs | Adds multi-application memoization to avoid re-expanding already-seen words. |
| src/SIL.Machine/FeatureModel/FeatureStruct.cs | Implements copy-on-write FeatureStruct.Clone() for frozen sources; mutations inflate a deep copy on first write. |
| src/SIL.Machine/Annotations/Shape.cs | Optimizes Shape.CopyTo by building node mapping during cloning (avoids LINQ allocations). |
| src/SIL.Machine.Morphology.HermitCrab/Word.cs | Adds Word.Clone instrumentation hook (MorpherStatistics.CountWordClone()). |
| src/SIL.Machine.Morphology.HermitCrab/MorpherStatistics.cs | New opt-in process-wide counters/timers for parsing diagnostics. |
| src/SIL.Machine.Morphology.HermitCrab/Morpher.cs | Adds runtime MaxDegreeOfParallelism, single-threaded synthesis path, and instrumentation timing/counters. |
| src/SIL.Machine.Morphology.HermitCrab/AnalysisStratumRule.cs | Selects sequential vs parallel cascade based on MaxDegreeOfParallelism; instruments parallel entry. |
| src/SIL.Machine.Morphology.HermitCrab/AnalysisAffixTemplateRule.cs | Uses sequential vs parallel unapplication based on MaxDegreeOfParallelism; instruments parallel entry. |
| src/SIL.Machine.Morphology.HermitCrab.Server/SIL.Machine.Morphology.HermitCrab.Server.csproj | Adds new worker host project (exe) targeting net10.0 with nullable enabled. |
| src/SIL.Machine.Morphology.HermitCrab.Server/Program.cs | Worker entrypoint; --serve --config ... [--max-dop N]. |
| src/SIL.Machine.Morphology.HermitCrab.Server/HermitCrabServerProtocol.cs | Defines request/response DTOs for newline-delimited JSON protocol. |
| src/SIL.Machine.Morphology.HermitCrab.Server/HermitCrabServerHost.cs | Implements worker loop: load grammar, emit READY + GC mode, analyze batches with across-word parallelism. |
| src/SIL.Machine.Morphology.HermitCrab.Server/HermitCrabServerClient.cs | Client that launches worker with Server GC and exposes AnalyzeWord/AnalyzeWords. |
| Machine.sln | Adds the new server project and additional solution configurations/platform mappings. |
| HERMITCRAB_PERF_PLAN.md | Adds performance plan, measurements, and rationale for changes. |
| HERMITCRAB_COW_PLANS.md | Documents COW design options and rationale; reports measured results. |
| HERMITCRAB_ALLOCATION_STRATEGIES.md | Documents allocation/GC strategy analysis and ranking for this codebase. |
Comments suppressed due to low confidence (2)
src/SIL.Machine/FeatureModel/FeatureStruct.cs:207
AddValue(IEnumerable<Feature> path, FeatureValue value)currently always throwsArgumentExceptioneven whenFollowPath(...)succeeds, making the path-based overload unusable.
if (FollowPath(path, out lastFeature, out lastFS))
lastFS._definite[lastFeature] = value;
throw new ArgumentException("The feature path is invalid.", "path");
src/SIL.Machine/FeatureModel/FeatureStruct.cs:230
RemoveValue(IEnumerable<Feature> path)currently always throwsArgumentExceptioneven whenFollowPath(...)succeeds, so the path-based removal overload cannot be used.
if (FollowPath(path, out lastFeature, out lastFS))
lastFS._definite.Remove(lastFeature);
throw new ArgumentException("The feature path is invalid.", "path");
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| HashSet<TData> expanded = MultipleApplication ? new HashSet<TData>(Comparer) : null; | ||
| ApplyRules(input, !MultipleApplication ? new HashSet<int>() : null, output, expanded); |
There was a problem hiding this comment.
Good catch. Seeded the expanded memo set with the initial input so a cycle back to it (A→B→A) won't re-expand it. Fixed in 3bed5b3.
4d6f266 to
f85dcde
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #438 +/- ##
==========================================
- Coverage 73.20% 72.94% -0.27%
==========================================
Files 440 445 +5
Lines 36931 37219 +288
Branches 5077 5126 +49
==========================================
+ Hits 27037 27150 +113
- Misses 8781 8964 +183
+ Partials 1113 1105 -8 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Scoping update: the copy-on-write The FST grammar-advisor groundwork that briefly rode on this branch locally has been split into its own independent PR (#441), off |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
src/SIL.Machine.Morphology.HermitCrab/AnalysisAffixTemplateRule.cs:84
ParallelApplySlotsselects the parallel path based onMaxDegreeOfParallelism, but theParallel.ForEach(from, ...)call doesn't passParallelOptions. That means any value > 1 will still use the default scheduler degree (effectively unbounded), so the runtime parallelism cap isn't actually honored for this within-word parallel site.
MorpherStatistics.EnterParallelSection();
var outStack = new ConcurrentStack<Word>();
var from = new ConcurrentStack<Tuple<Word, int>>();
from.Push(Tuple.Create(inWord, _rules.Count - 1));
var to = new ConcurrentStack<Tuple<Word, int>>();
| EndProject | ||
| Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "SIL.Machine.Tokenization.SentencePiece.Tests", "tests\SIL.Machine.Tokenization.SentencePiece.Tests\SIL.Machine.Tokenization.SentencePiece.Tests.csproj", "{AB5F75C1-64B7-4E0F-A4B5-B14EB16E6DDC}" | ||
| EndProject | ||
| Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "SIL.Machine.Morphology.HermitCrab.Server", "src\SIL.Machine.Morphology.HermitCrab.Server\SIL.Machine.Morphology.HermitCrab.Server.csproj", "{7BE85F08-C119-4B64-B2B0-625482484C50}" |
| HermitCrabAnalyzeRequest? request = JsonSerializer.Deserialize<HermitCrabAnalyzeRequest>(line); | ||
| if (request == null) | ||
| { | ||
| output.WriteLine(JsonSerializer.Serialize(new HermitCrabAnalyzeResponse())); | ||
| output.Flush(); | ||
| continue; | ||
| } |
| string? ready = _process.StandardOutput.ReadLine(); | ||
| if (ready == null || !ready.StartsWith("READY", StringComparison.Ordinal)) | ||
| throw new InvalidOperationException("HermitCrab worker failed to start: " + (ready ?? "<no output>")); | ||
| WorkerGarbageCollectorMode = ready.Length > 6 ? ready.Substring(6) : "unknown"; | ||
| } |
| var results = new IReadOnlyList<WordAnalysis>[response.Results.Count]; | ||
| for (int i = 0; i < response.Results.Count; i++) | ||
| { |
| // Single-threaded when the caller caps within-word parallelism (e.g. it | ||
| // parallelizes across words itself); parallel cascade otherwise. | ||
| _mrulesRule = | ||
| morpher.MaxDegreeOfParallelism == 1 | ||
| ? (IRule<Word, ShapeNode>) |
…ling Add Morpher.MaxDegreeOfParallelism (1 = fully single-threaded), replacing the dead compile-time SINGLE_THREADED flag with a runtime knob across all three within-word parallel sites (synthesis, Unordered analysis cascade, affix-template unapplication). This lets a caller (FieldWorks "Parse All Words") parallelize across words without nested oversubscription. Add MorpherStatistics (opt-in, zero overhead when disabled): Word.Clone count, analysis/synthesis phase timing, parallel-section counter (proves the sequential path runs under degree-1), and a corpus benchmark (Explicit) that reports GC.GetTotalAllocatedBytes + Gen0/1/2 against a real FLEx-exported grammar. Profiling the real Sena grammar showed ~8,793 Word.Clone and ~371 MB allocated per word (the combinatorial unapplication search). First allocation win: Shape.CopyTo builds the src->dest node map inline instead of .Zip().ToDictionary() + double re-enumeration (-2.3% alloc/word, fewer Gen0). Tests: 62 HermitCrab + 790 SIL.Machine pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC: memoize multiApp cascade re-expansion; measure GC under parallel load CombinationRuleCascade: in multiApp mode a word's expansion depends only on the word, so memoize already-expanded words and skip re-descending them (collapses the combinatorial re-exploration to a DAG; output set unchanged). Output-identical: 62 HC + 790 core tests pass. Measured ~0% on short Sena words (their clones come from the phonological/synthesis layers, not morphological re-expansion) but it bounds pathological re-expansion blow-up at no correctness cost. Benchmark: measure GC (allocated bytes + Gen0/Gen2) under the parallel-ACROSS-words load and report Server vs Workstation GC — this is where alloc/GC contention actually bites, unlike a single-threaded run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC perf plan: record Sena optimization results + Server-GC dominance finding Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC: COW design study — 3 scoped plans to cut the FeatureStruct clone firehose Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC: add copy-on-write safety-net tests before the COW refactor - FeatureStruct: clone-of-frozen + mutate-clone leaves source unchanged, for every mutator incl. nested-child recursion (PriorityUnion/Union/Subtract/AddValue/RemoveValue/ Clear), plus clone-is-mutable, never-mutated-clone equality, re-entrancy sharing, and ReplaceVariables isolation. Asserts the SOURCE is unchanged (not just "no throw"). - Shape: clone + mutate a cloned node's FeatureStruct leaves the source shape unchanged. - Morpher: concurrent repeated parsing is deterministic (guards COW under parallel load). All pin CURRENT behavior (801 core + 63 HC pass) so the COW refactor can't silently regress. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Plan A: copy-on-write FeatureStruct.Clone for frozen structs Clone() of a FROZEN feature struct now returns a shell that borrows the source's immutable backing dictionary; the first mutation (EnsureWritable, replacing CheckFrozen) inflates a private deep copy via the existing CloneImpl, so neither the mutation nor any recursion into children can touch shared frozen data. Clone() of an unfrozen FS still deep-copies. Single-file change; no public API change. Most cloned feature structs are never mutated, so they stay O(1) shells. Measured on the real Sena grammar: -11% managed allocation/word and ~-29% wall on the 16-way parallel pass (less GC contention). 801 core + 63 HC tests pass, including the new COW safety-net tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC COW doc: record Plan A result (-11% alloc) and Plan B subsumed/blocked finding Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC: out-of-process Server-GC parser (worker host + reusable client) New SIL.Machine.Morphology.HermitCrab.Server project lets a host application get Server-GC parsing throughput WITHOUT changing its own GC mode, by running the morpher in a child process: - HermitCrabServerHost: loads a compiled HC config, serves analyze requests over stdin/stdout (newline-delimited JSON), parses each word single-threaded with parallelism across the batch. Launched with DOTNET_gcServer=1. - HermitCrabServerClient: reusable IMorphologicalAnalyzer that launches/manages the worker, drives the batch protocol, and returns WordAnalysis. Morphemes cross the boundary as DTOs that implement IMorpheme, so the client needs no grammar load. - Shared protocol DTOs guarantee the two ends agree. Unlike XAmple (native, in-process, no managed GC), HC is managed, and GC mode is fixed at process startup — so a worker subprocess is the only way to scope Server GC to the parser. Grammar-config-driven, so any Machine HC consumer can use it; FieldWorks adds a thin IParser adapter mapping morph Properties -> LCM. End-to-end on the real Sena grammar: out-of-process results match in-process; worker runs Server GC while the host runs Workstation GC (verified). 63 HC tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ode-style) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- CombinationRuleCascade: seed the memoization set with the initial input so a cycle back to it (A->B->A) doesn't re-expand it. - Morpher.ParseWord: drop the redundant origAnalyses copy (analyses is already materialized and Synthesize no longer drains it). - Server host/client: handle null JsonSerializer.Deserialize results with a clear protocol error instead of an NRE. - MorpherBenchmark: clamp across-word degree-of-parallelism to >= 1 so it doesn't throw on single-core (ProcessorCount-1 == 0) or when HC_ACROSS_DOP=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Revert FeatureStruct.Clone and Shape.CopyTo to the upstream deep-clone behavior. The copy-on-write FeatureStruct (clone-of-frozen shares backing, inflate on first write) measured ~-11% allocation but is being held back from this performance PR to keep it scoped to the single-threaded option + instrumentation + out-of-process Server-GC parser. COW can return as its own focused change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5e4678c to
892816f
Compare
Summary
Profiling the HermitCrab parser against the real Sena FLEx grammar showed it is allocation/GC-bound: ~371 MB and ~8,800
Word.Clonecalls per word, which caps parallel ("Parse All Words") scaling — the collector saturates well before the cores do. This PR lays the groundwork to fix that, in safe, measured increments. Further allocation work continues as its own focused follow-up (seeRUSTIFY.md).What's in this PR
1. Single-threaded runtime option —
new Morpher(tm, lang, maxDegreeOfParallelism: 1)makes the morpher fully sequential within a word (replacing the dead compile-timeSINGLE_THREADEDflag with a runtime knob across all three within-word parallel sites). This lets a caller parallelize across words without nested oversubscription. The cap is honored at every within-word parallel site (analysis cascade, affix-template unapplication, synthesis).2. Allocation/GC instrumentation — opt-in
MorpherStatistics(zero overhead when off):Word.Clonecount, analysis/synthesis phase timing, parallel-section counter, plus an[Explicit]corpus benchmark reportingGC.GetTotalAllocatedBytes+ Gen0/1/2 against a real grammar.3. Cascade memoization — skip re-expanding already-seen words in the unordered cascade (output-identical; bounds pathological re-expansion).
4. Out-of-process Server-GC parser — new
SIL.Machine.Morphology.HermitCrab.Serverproject: a worker host serves analyze requests over stdin/stdout JSON and is launched with Server GC, plus a reusableHermitCrabServerClient : IMorphologicalAnalyzerthat manages it. This gets Server-GC throughput (~2×) without the host application changing its own GC mode. Verified end-to-end on Sena: out-of-process results match in-process; worker runs Server GC while the host runs Workstation GC.5. Design/analysis docs —
HERMITCRAB_PERF_PLAN.md,HERMITCRAB_ALLOCATION_STRATEGIES.md,HERMITCRAB_COW_PLANS.mdcapture the profiling, the strategy comparison, and the Server-GC finding.Measured (real Sena grammar, 16-way parallel)
Server GC cuts Gen0 collections from ~680 to ~50 and roughly doubles parallel throughput; the out-of-process worker delivers that without touching the host's GC.
Testing
AnalyzeWord_ConcurrentRepeatedParsing_IsDeterministic) pins shared-morpher determinism under many threads.[Explicit](not run in CI; require a grammar viaHC_GRAMMAR/HC_WORDS).Review follow-ups addressed
MaxDegreeOfParallelismat the two within-word parallel sites that previously ran at the default scheduler degree (ParallelCombinationRuleCascade,AnalysisAffixTemplateRule).JsonExceptioncaught → empty response).Follow-ups (later)
IParseradapter for the out-of-process worker (needs a richer protocol carrying the stamped morph Properties; net48-consumable client split).FeatureStruct, struct-of-arrays — tracked inRUSTIFY.md.🤖 Generated with Claude Code
This change is