HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor)#441
HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor)#441johnml1135 wants to merge 19 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new static “FST-readiness” grammar linter for the HermitCrab parser (GrammarFstAdvisor.Analyze(Language)) that walks a compiled grammar, emits per-rule advisories (Escape/Cost/Info + reclaim notes like Regular/Probeable), and produces an overall tier verdict intended for authoring-time/CI use. This lays groundwork for future FST compilation work by making “what blocks FST / what is slow today” visible and actionable.
Changes:
- Introduces
GrammarFstAdvisor,GrammarFstReport, andGrammarAdvisoryto classify expensive/non-FST-able constructs across morphological and phonological rules. - Adds NUnit tests covering concatenative cases, reduplication (bounded/unbounded), infixation, rewrite-rule harmony behavior, and opacity/probe-ability.
- Adds planning docs for the advisor and the broader HermitCrab FST acceleration roadmap, plus an explicit local benchmark test for running the advisor on an external grammar.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorTests.cs | Adds coverage for the advisor’s tiering and key escape classifications (reduplication/infix/harmony/opacity). |
| tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorBenchmark.cs | Adds an [Explicit] helper test to run and print the advisor report on an external HC XML grammar. |
| src/SIL.Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs | Implements the advisor, report model, and the core static analyses for affix and phonological rules. |
| HERMITCRAB_FST_PLAN.md | Documents the planned FST compiler/runtime approach, tiered hybrid design, and decision gate. |
| fst.md | Documents the advisor’s classification rules, tier model, and the orthogonal Regular/Probeable axes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Four parallel audits (formal-language status × HC impl × FST impl) of every HC construct,
classified covered / partial / coverable / not-coverable, with architecture proposals and
appendices on closing the non-regular gap.
Headline findings:
- Almost all of HC is REGULAR (Kaplan-Kay) hence 1-way-FST-able; the only genuinely
non-regular core is unbounded full-stem reduplication ({ww}) + an unbounded self-feeding
rewrite cycle (HC caps at 256).
- Critical coverage ceiling: the proposer is only correct for 0-PHONOLOGY grammars (arcs are
underlying segments, walk is surface) — it silently under-generates (fails safe; parity gate
refuses to certify) for any grammar with phonological rules. Phonology-by-composition is the
biggest coverage win.
- Robustness bug: the proposer THROWS on infix/circumfix/reduplication/process slots, aborting
the whole build instead of degrading to the engine. Graceful degradation is the top this-PR fix.
- Other gaps: true zero-segment affix dropped; bounded compounding needs proposer + FstReplay
changes; MPR/co-occurrence/env/stemname correctly left to verify (sound).
Appendix A: length-cap fold / detect-and-peel (compile-replace) / 2-way FST (Dolatian-Heinz) /
engine backstop for the non-FST-able constructs. Appendix B: verify-by-re-analysis + escape-aware
codec + certified-skip interlock all HELP later non-regular work; only the 2-way reduplication
solution would need a new execution model.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eview fixes A — Graceful degradation: the proposer no longer THROWS on infix/circumfix/reduplication/ process slots; it skips the unbuildable construct, builds the rest, and sets CoversAllConstructs=false so the grammar can't certify (those words fall to the engine/cache; the parity gate enforces it). Was a NotSupportedException that aborted the whole build, making the FST unusable on any grammar with such a slot. B — True zero-segment affix (CopyFromInput only, no InsertSegments) now emits its morpheme token with no segment arcs instead of throwing / being silently dropped. SlotOp treats a zero-only slot as a position-less suffix so it still builds. Certification guard: FromLanguage (Caching + CompleteHybrid) now requires proposer.CoversAllConstructs in addition to closed + parity — a degraded build can't certify. Copilot review fixes (advisor, still in PR): - Examine RealizationalAffixProcessRule (it implements IMorphologicalRule + has Allomorphs; can encode reduplication/infix) — previously silently skipped, undercounting escapes. AnalyzeAffix refactored to (name, allomorphs) and the switch handles both rule types. - GrammarFstReport counts are now PER-RULE (group advisories by Rule/Stratum/Kind, worst severity) instead of per-advisory, so per-allomorph advisories don't overcount and the partitions are consistent (Probeable+Opaque = Escape, Regular+NonRegular = Escape). Tests: Build_ReduplicationSlot_DegradesGracefully_DoesNotThrow, Analyze_ZeroSegmentSuffix_IsEmitted_NotDropped, Analyze_RealizationalReduplication_IsExamined. Unit suite 96 green; Sena unchanged (certifies, 0 parallel mismatches, 0 false positives). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks @copilot — both advisor comments addressed in
|
…lure) CI runs `dotnet csharpier check .` and the new/edited FST files were not formatted. Ran `dotnet csharpier format .` (1.2.6) — only the 11 FST/advisor/test files changed; no unrelated files touched. Unit suite 96 green; csharpier check clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Let the FST proposer match phonologically-altered surfaces, the C-internal tier of Solution 1 (surface-allomorph precompile, docs/FST_FULL_COVERAGE_PLAN.md Appendix C). For each root the grammar allows to stand bare, build a proposer arc not just for the underlying shape but for every bare surface realization HC synthesizes (phonology applied) — reusing the obligatoriness GenerateWords call, so zero extra build cost. The emitted token is always the underlying morpheme; verify re-runs HC with real phonology to confirm. - FstTemplateAnalyzer: _bareRootValid -> _bareRootSurfaces; add BareRootSurfaces, UnderlyingForm, BuildRootChainFromSurface. Underlying arcs kept (union), so the 0-phonology path is unchanged. Fix a latent verify bug this exposed: AnalysisRewriteRule/AnalysisMetathesisRule gate on Morpher.RuleSelector, and FstReplay pinned the selector to just the candidate's morphological rules — silently disabling ALL phonology during verify. The propose-and-verify spine could therefore never confirm any phonologically- altered candidate. Phonological rules are obligatory deterministic rewrites, not a fan-out choice, so FstReplay now always lets IPhonologicalRule through; the morphological fan-out is still collapsed by gating the leaf rules + root, and soundness is still enforced by the unchanged candidate-signature match. Add Verified_CoversPhonologicallyAlteredBareRoot: an unconditional t->d rule makes bare root "dat" surface only as "dad"; a baseline assertion proves the underlying- only proposer misses "dad", the surface-precompile proposer covers it, verify confirms it as a genuine HC analysis, and a non-word still yields nothing. Full HermitCrab suite green (97 passed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FST_FULL_PLAN.md — implementation plan for the four expansion points. The propose-and-verify split means correctness lives in verify + certification, never in the proposer, so coverage expansion can only change the acceleration ratio, never produce a wrong answer. Architecture: a CompositeProposer unions candidate generators (FST + reduplication + infix scanners) into the one verify gate. - Point 2 (infix) and Point 3 (reduplication): bounded candidate generators that strip/remove their material and RECURSE the residual through the FST proposer (so inflected reduplicants / infixed forms are covered), feeding the verify gate. - Point 1 (all phonology): affix surface-precompile + C-boundary neighbor context, extending the shipped bare-root C-internal tier. - Point 4 (C-exact composition): design recorded + deferred with rationale — it is a spine redesign (token side-table -> transducer outputs) whose only marginal gain over C-boundary is rare cross-boundary opacity that already falls back to the engine correctly. C-boundary subsumes its practical value. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reduplication (copy the whole base, surface = base·base) is the one provably non-regular construct — an FST cannot represent it. Handle it BESIDE the FST: a bounded candidate generator feeds the same propose-and-verify gate, so it is sound without being regular. - CompositeProposer: unions several proposers (FST + generators) into one IMorphologicalAnalyzer, deduping candidates by order-sensitive morpheme-identity signature before the verify gate. Aggregates coverage at the MorphOp level (CoversAllConstructs = FST's uncovered ops minus what generators cover) so a grammar can certify once a sibling generator covers the FST's skipped construct. New IConstructProposer interface lets a generator declare its covered ops. - ReduplicationProposer (IConstructProposer): detects an adjacent doubling X·X, strips one copy, RECURSES the residual through the FST proposer (so an inflected reduplicant is covered, not just a bare root), and appends the reduplication morpheme in HC application order (root·…·RED). A coincidental doubling is pruned by verify (HC synthesis won't reproduce it). - FstTemplateAnalyzer: replace the _hasUnbuiltConstructs bool with an _uncoveredOps set (records WHICH MorphOp was skipped — slot rules, in-slot affixes, and standalone morphological rules); expose UncoveredOps. CoversAllConstructs == (UncoveredOps empty). Test: a full-reduplication grammar; the FST alone misses "sagsag" (and reports not-fully-covered), the composite covers it (and reports covered), verify confirms the genuine HC analysis, and a non-word still yields nothing. Full suite green (98). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Infixation (an affix inserted inside the stem, e.g. Tagalog -um-) is regular; the
FST proposer recognizes but does not build infix slots. Handle it as a sibling
generator feeding the same propose-and-verify gate.
- InfixProposer (IConstructProposer): for each infix and each interior position
where the infix's surface segments occur, remove them and RECURSE the residual
through the FST proposer (so an infixed form of an inflected stem is covered),
then append the infix morpheme in HC application order (root·…·INF).
Over-approximation — every interior occurrence is tried; verify prunes the wrong
splits. O(surface-length × infixes) candidates, bounded.
- First cut: the infix must be a single contiguous run of inserted segments,
matched against its underlying representation. Templatic multi-slot infixes and
phonologically-altered infix surfaces are left to the engine (parity gate keeps
results correct).
Test: an "a"-infix grammar ("sag" -> "saag"); the FST alone misses "saag" (and
reports not-fully-covered), the composite covers it (and reports covered), verify
confirms the genuine HC analysis, and a non-word still yields nothing. Full suite
green (99).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extend the surface-allomorph precompile from bare roots to AFFIXES: build each affix's segment arcs from its underlying form AND each phonologically-altered surface realization, so an affix whose surface differs from its underlying segments (e.g. a suffix devoiced/changed by a rule) is matched by the proposer. - SurfacePhonology: a forward-phonology helper that compiles each stratum's synthesis phonological rules (reusing HC's CompileSynthesisRule, exactly what SynthesisStratumRule runs) and applies them to a segment string in isolation, returning the distinct surface variants (C-internal tier: catches edge- and morpheme-internal alternations; cross-boundary ones ride the engine). - FstTemplateAnalyzer.BuildAffixArcs: shared by both affix-arc sites (derivational layers + template slots) — builds the underlying path plus a path per altered surface variant. Default ctor passes an identity variant function, so the 0-phonology path is byte-identical; the morpher ctor wires SurfacePhonology. Tests: Proposer_CoversPhonologicallyAlteredAffix (a "t" suffix that surfaces only as "d" via t->d: the underlying-only proposer misses "sagd", the surface-precompile proposer covers it, verify stays sound) and SurfacePhonology_AppliesRulesForward. Full suite green (101). FST_FULL_PLAN.md updated with the shipped/deferred matrix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Apply CSharpier to FstTemplateAnalyzer.cs (a reflowed method signature) so Check-formatting passes, and update the now-stale class summary: the proposer precompiles bounded phonology into its arcs and degrades gracefully on constructs it cannot model (recording the MorphOp in UncoveredOps for the composite), rather than throwing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The reduplication/infix generators were only constructed in test code — both production factories built a bare FstTemplateAnalyzer, so a reduplicating/infixing grammar never certified and the generators never ran. Wire them in. - CompositeProposer.ForLanguage(language, fst): the standard production proposer (FST + reduplication + infix generators). Inert for grammars without those constructs (generators hold no rules, yield nothing; CoversAllConstructs is vacuously true) — near-zero overhead, byte-identical behavior. - CompleteHybridMorpher.FromLanguage and CachingMorphologicalAnalyzer.FromLanguage now build the composite and certify on its CoversAllConstructs. Integration test CompleteHybrid_WiresGenerators_...: a reduplicating grammar certifies through the production factory and the fast path matches the engine on bare/reduplicated/homograph/non-word — the test whose absence let the feature be inert. Docs note the wiring + the extended empirical-certification caveat (a certified grammar skips the engine, so the certification corpus must exercise the reduplication/infix patterns). Full suite green (102). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ition Complete the phonology story — all four enhancement points are now implemented and wired into production. Point 4 (C-exact, the complete path): ComposedPhonologyProposer composes HC's phonology INVERSE with the morphotactic FST. It un-applies the grammar's phonological rules to the surface (reusing each stratum's CompileAnalysisRule — exactly what AnalysisStratumRule runs, strata surface->inner, rules reversed) to recover the underlying form, then walks the underlying-arc FST on it (FstTemplateAnalyzer.AnalyzeShape, newly exposed). Because the inverse is applied to the ASSEMBLED surface, this covers all bounded phonology including the cross-boundary, stem-conditioned alternations the per-morpheme precompile cannot see. Under-specified analysis nodes match via unification; verify prunes spurious candidates. Chosen over literal Fst.Compose because the proposer accumulates tokens in a side-table, not transducer outputs — composing HC's existing inverse reaches the same coverage while reusing the engine's real phonology. Point 1b (C-boundary, the cheap fast-path): SurfacePhonology now also probes each surface-alphabet segment as a left/right neighbor and, when the rule is length-preserving, reads back the morpheme's own surface portion — catching an affix whose surface is conditioned by a neighbor across the seam. Bounded by alphabet size; length-changing contexts are skipped (sound superset). Both wired into CompositeProposer.ForLanguage (inert when the grammar lacks phonology — short-circuits). Tests: ComposedPhonology_CoversCrossBoundaryAlternation (g->k / _t across the boundary: precompile misses "sakt", composition recovers it) and SurfacePhonology_BoundaryTier (t->d / g_: isolation keeps "t", boundary recovers "d"). Full suite green (104); full solution builds; CSharpier clean. Plan updated: all four points shipped + wired. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ComposedPhonologyProposer runs HC's analysis phonology at analyze time on the concurrent path (both factories advertise parallel parsing). Harden + verify: - Compile the inverse cascade against a PRIVATE Morpher with its own TraceManager (not the factory's shared one), mirroring how MorpherPool gives each rented morpher its own — the analysis rules read _morpher.TraceManager/selectors, so the proposer must not share them. Each AnalyzeWord applies the cascade to a fresh local Word (no per-call mutation of shared state). ForLanguage no longer threads a morpher through. - Add Composite_WithPhonologyAndReduplication_ParallelMatchesSequential: drives the production CompleteHybridMorpher (phonology inverse + reduplication generator both live) over a corpus in parallel and asserts parallel == sequential, no exceptions. Full suite green (105). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…honemics Real-grammar validation (Indonesian meN- nasal substitution) exposed that the phonology INVERSE cannot be cleanly composed when rules are conditioned on the morpheme boundary: ComposedPhonologyProposer un-applies on the boundary-less surface, so meN- rules fire everywhere and over-generate (menulis -> ⁿmeⁿnⁿpuⁿlis) — the mess HC only prunes via interleaved morphology + re-synthesis (the slow search). Inversion stays valid/sound for SEGMENT-conditioned phonology; it is just not the tool for boundary-conditioned morphophonemics. Forward synthesis IS boundary-correct (GenerateWords applies rules with the boundary present). New ForwardSynthesisProposer precompiles, at build time, each root × every ORDERED affix combo (permutations — order matters) up to maxAffixes, synthesizes the surface, and tabulates surface->analysis; analysis is a dictionary lookup and verify still confirms. Covers reduplication and infixation for free. - ForwardSynthesisProposer (IConstructProposer): sound by construction (a tabulated entry is a real synthesized word), bounded by maxAffixes + a hard entry budget. - Opt-in via CompositeProposer.ForLanguage(language, fst, forwardSynthesis: true): build cost grows with lexicon × permutations — right for bounded-affixation grammars / fixed corpora, not heavily-inflecting templatic systems. Default behavior unchanged. - CI test ForwardSynthesis_CoversAffixedForms_AndIsSound; reusable real-grammar harnesses added to FstSenaBenchmark (Benchmark_ForwardSynthVsSearch, etc.). Indonesian result (depth 2): full coverage 42 -> 69 of 70 words, 0 unsound, build ~5s. The 1 holdout is a 3-affix realizational combo. Does not flip the grammar to certified (holdout breaks parity; grammar not FST-closed) — the win is on the explicit verified-FST path, correct everywhere. FST_FULL_PLAN.md updated with the inversion-vs-synthesis finding and scope. Full suite green (106). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design + blocker analysis for the grammar-sized composition approach (compose morphotactics ⊗ phonology into one surface↔analysis transducer; build scales with the grammar, not the language). Three blockers: (1) tokens via side-table not output tape — keep state-based through composition; (2) HC phonology is match-then-mutate, not a transducer — build a compiler (probe-synthesis Mealy transducer reusing HC's phonology, substitution then deletion); (3) unification-arc composition — already solved by Fst.Compose. Spike-first incremental plan, verify gates soundness throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…strained
Algorithm-level spike (symbol alphabet) proving the Lever 2 architecture before the
big build. Lazy on-the-fly composition of an inverse-phonology transducer (Pinv:
surface->underlying, with ε-input arcs that restore deleted segments) with a
morphotactic acceptor (Lex: underlying, tokens on states), walked as a product
automaton over configs (pinvState, lexState, tokens).
Targets DELETION specifically (t->∅ / _d, so sat+d = "satd" -> "sad") — the case
every prior approach died on; substitution would pass and lie. Three tests:
- recovers the deleted t: "sad" -> [sat, -d];
- restoration is LEXICON-CONSTRAINED: with a bare root "sad" too, exactly the two
valid analyses {sat+-d, sad}, no garbage — the property the runtime inverse
lacked (it restored everywhere -> ⁿmeⁿnⁿpuⁿlis);
- non-word yields nothing.
Resolves Blocker 1 (tokens stay state-based in the config — no output-tape hack)
and Blocker 3 (no Fst.Compose — the walk unifies Pinv output against Lex input
directly). Only Blocker 2 (building Pinv) remains. LEVER_2.md updated to the
lazy-composition design.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ry deletion Build the consuming engine for forward FST∘FST composition (LEVER_2.md), proven end-to-end with real HC types. - InversePhonology: a surface→underlying transducer (states + arcs carrying a surface-input FS, null = ε-input restoration, and an underlying-output FS). - FstTemplateAnalyzer.AnalyzeComposed: lazy product walk of Pinv ⊗ the underlying morphotactic acceptor over configs (pinvState, lexState, tokens). Pinv consumes surface and emits underlying, which must unify a lexicon arc (advancing it and accruing its token); the closure handles both lexicon ε-arcs and Pinv ε-input restorations. Tokens stay state-based (Blocker 1 dissolved); no Fst.Compose needed (Blocker 3 moot) — the walk unifies Pinv output against Lex input directly. Test LeverTwo_LazyComposition_RecoversBoundaryDeletion_RealTypes: a kd-suffix whose k deletes before d surfaces as "d"; "sagd" recovers [sag, KD] by restoring the deleted k — constrained by the lexicon (the over-restoration that broke the runtime inverse is pruned in lockstep) — sound (⊆ engine), non-word yields nothing. This is the deletion case (not substitution, which would pass and lie). All three blockers worked through: 1 & 3 resolved; Blocker 2's consuming engine built and proven incl. deletion. The remaining frontier is the general Pinv COMPILER (auto-build InversePhonology from grammar rules + cascades) — the spikes use a hand-built Pinv. LEVER_2.md records proven-vs-frontier honestly. Suite 110 green; CSharpier clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…scade The single-rule deletion spike would pass and lie about cascades — the real meN- case is assimilation + deletion interacting (what produced ⁿmeⁿnⁿpuⁿlis). So LazyComposition_RecoversOpaqueTwoRuleCascade hand-builds a Pinv for a feeding/ opacity cascade: N→n/_t then t→∅/n_, underlying aN+t = "aNt" -> "ant" -> "an" (the t that triggered the assimilation then deletes; counterbleeding opacity). Result: it works. A bounded-context Pinv that COUPLES un-assimilation (n→N) with deletion-restoration (ε→t) through a state recovers the opaque "aNt" from "an" -> [aN, -t], lexicon-constrained. A bounded transducer CAN represent the inverse of an opaque cascade — the case that defeated every prior approach — so the Lever 2 architecture is real for cascades. Corollary recorded in LEVER_2.md: the Pinv COMPILER must be B-direct (compile each rule to a transducer, compose the cascade, invert), NOT naive context-probing — because the t-deletion is conditioned on the surface n that assimilation fed from N, which an underlying-context probe would misread. Honest headline: architecture proven incl. cascades with HAND-BUILT inverses; the phonology→transducer compiler is UNSTARTED, so Lever 2 does not yet accelerate a real grammar — Lever 1 (42→69 on Indonesian) remains the only real-grammar accelerator. Suite 111 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
All of Indonesian's closure escapes are reduplication (the meN- nasal substitution is regular; "Nasalization in reduplication" is a phonological rule, not a closure escape). Reduplication over a fixed lexicon with bounded copy is finite-hence- regular (compile-replace), so: - GrammarFstClosure.Analyze(language, boundedReduplication: true): opt-in flag that treats reduplication/infix as FST-able feeders (not escapes) under the fixed- lexicon/bounded-copy assertion. A grammar whose only escapes are reduplication/ infix then becomes FstClosed. - CachingMorphologicalAnalyzer.FromLanguage gains forwardSynthesis + boundedReduplication params, threading the flag into the closure check and wiring the forward-synth precompile into the composite. - ForwardSynthesisProposer.CoveredOps broadened to claim circumfix (CircumfixPrefix/ Suffix) and process — synthesis already produces them, so a tabulated entry is genuine coverage; this was the missing piece for CoversAllConstructs. Measured (Indonesian, forwardSynthesis+boundedReduplication): closed False→True, CoversAllConstructs True, parity 69/70 → CERTIFIES on the covered corpus → default path is FST-only (engine skipped). The 1 holdout (mengamat-amati, a 3-affix realizational combo) is a coverage-depth gap, not closure. Soundness unaffected (verify + parity gate; flags are explicit opt-in assertions). CI test GrammarFstClosure_BoundedReduplication_TreatsReduplicationAsRegular; suite 112 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
What
Accelerates HermitCrab morphological analysis with a precompiled FST, behind a caching front end
that keeps the engine as the source of truth. No second morphology engine, no reimplemented constraints.
Entry point —
CachingMorphologicalAnalyzer(fast + slow + cache):AnalyzeWord= guaranteed complete (backwards-compatible). On a certified grammar(FST-closed per the census and FST==engine set-parity over a corpus) the FST is proven complete,
so this runs FST-only with no full search. Otherwise it returns the cached engine result, or runs
the engine on a miss and caches it. Either way: complete.
AnalyzeWordFast= opt-in immediate. Cached-complete if warm (or if the grammar is certified),else a sound but possibly under-generating verified-FST result, flagged
IsComplete=false. Neverruns the engine.
Warm(corpus)fills the cache in parallel;AnalysisCacheSerializerpersists it acrosssessions (fixed corpora), keyed by
MorphemeRegistryand guarded by a grammar-version string(stale cache rejected → re-warm). Confirmed non-words are cached too.
The FST pipeline behind the fast path:
FstTemplateAnalyzer(proposer; immutable, shared;derivation depth tunable) →
VerifiedFstAnalyzer(confirms each candidate by restricted re-analysis,FstReplay, against HC's own engine from aMorpherPool; emits the genuine HC analysis) →CompleteHybridMorpher(certified→FST / else→engine, with per-wordAnalyzeWord(word, useFst)).GrammarFstAdvisor+GrammarFstClosureare the grammar census/linter (this PR's original core).Guarantees
always complete. The fast path is sound (0 false positives on 50 generated non-words), a yes-only
detector for "is this a word" (can under-generate, even to zero on un-built constructs), which the
cache/certification corrects.
corpus that certifies, the default complete path runs at ~18 ms/word (~11×).
engine/cache, no silent miss).
mismatches on 200 words (and a CI test).
Tests
CI unit tests on the in-repo toy grammar cover the proposer, the verify chain, the caching/persistence
layer (default==engine, provisional→complete after warm, certified-skip, round-trip + version guard),
soundness/negatives, the category fix, per-word opt-out, and thread-safety; plus advisor/closure. An
[Explicit]benchmark measures speed/parity/soundness/concurrency/certification on an external grammar.Full unit suite green (93).
Honest limitations / out of scope
depth-3 derivation, one suffix-order case. They resolve via the engine/cache (no silent miss) and keep
the full corpus from certifying. Compounding is the highest-value next coverage build (an additive,
shared-root-chain design) and is scoped as a focused follow-on.
and abandoned; completeness is delivered by certification + cache + engine.
residual.
Design + research record:
docs/HERMITCRAB_FST_PLAN.md(§13 = caching front end); advisor:docs/HERMITCRAB_FST_ADVISOR.md.🤖 Generated with Claude Code
This change is