Skip to content

Avoid construct unnecessary string when query bigram frequency.#133

Merged
wengxt merged 1 commit into
masterfrom
optimize
Jun 20, 2026
Merged

Avoid construct unnecessary string when query bigram frequency.#133
wengxt merged 1 commit into
masterfrom
optimize

Conversation

@wengxt

@wengxt wengxt commented Jun 20, 2026

Copy link
Copy Markdown
Member

Summary by CodeRabbit

Release Notes

  • New Features

    • Added variadic trie traverse/traverseRaw overloads for multi-argument traversal with early termination when no path is possible.
  • Bug Fixes

    • Improved history bigram key serialization and frequency matching for word/code combinations, including correct handling when either side has empty code.
  • Tests

    • Expanded trie traversal coverage for raw results, noValue, and noPath.
    • Added unit tests covering history bigram behavior with empty code fields.

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 485a56d7-c509-4ba0-a38f-f5fb5176bfa6

📥 Commits

Reviewing files that changed from the base of the PR and between e84fea6 and 57ddfd2.

📒 Files selected for processing (4)
  • src/libime/core/datrie.h
  • src/libime/core/historybigram.cpp
  • test/testhistorybigram.cpp
  • test/testtrie.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/testtrie.cpp
  • src/libime/core/datrie.h

📝 Walkthrough

Walkthrough

Adds C++20-constrained variadic traverse/traverseRaw overloads to DATrie that iterate over multiple string segments and stop on NO_PATH. Rewrites historybigram.cpp to use new two-argument bigram frequency APIs backed by these overloads, replacing pre-built combined key strings throughout serialization, lookup, increment, and decrement paths. Adds test coverage for empty code fields.

Changes

DATrie variadic API and historybigram bigram key rework

Layer / File(s) Summary
DATrie variadic traverse/traverseRaw overloads
src/libime/core/datrie.h
Adds <concepts> include and two variadic template overloads (traverse, traverseRaw) constrained with std::convertible_to<Args, std::string_view>; each iterates over arguments calling single-string traverseRaw and stops when NO_PATH is reached.
DATrie traverseRaw tests
test/testtrie.cpp
Changes "aaab" stored value to 0xeadbeef, scopes the no-value erase block, and adds three scoped traverseRaw checks: stored value equality, no-value intermediate node, and no-path dead-end.
Bigram key serialization helpers
src/libime/core/historybigram.cpp
Renames wordAndCodeToStringwordWithCodeToString and replaces bigramWordWithCode with bigramWordWithCodeToString, appending code segments only when at least one side has non-empty code.
WeightedTrie frequency lookup rework
src/libime/core/historybigram.cpp
Reworks WeightedTrie::freq(WordWithCodeView) branching and introduces WeightedTrie::freq(prev, next) to traverse the word-pair trie segment then conditionally match code content; updates incFreq and decFreq overloads for both single-word-with-code and bigram forms.
HistoryBigramPool frequency and serialization
src/libime/core/historybigram.cpp
Updates save() to use wordWithCodeToString; changes bigramFreq to two-argument form forwarding to bigram_.freq(s1, s2); removes combined bigram key construction from incBigram/decBigram to call trie directly with individual word arguments.
HistoryBigram call site updates
src/libime/core/historybigram.cpp
Updates HistoryBigramPrivate::bigramFreq, containsBigram, and rawBigramFrequency to query pools using the new two-argument bigramFreq(prev, cur) API instead of pre-building combined bigram key strings.
Empty code variant tests
test/testhistorybigram.cpp
Adds testWithEmptyCode() and testWithEmptyAndNonEmptyCode() test functions to validate unigram and bigram frequencies when code fields are empty or mixed with non-empty codes; both are invoked in main() before existing tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐇 Hop, hop through the trie we go,
Variadic args in a row!
Bigrams split, no more combined,
Two-arg freq leaves old keys behind.
Empty codes now find their place—
The rabbit's work is crisp and clean! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main optimization in the PR: avoiding unnecessary string construction when querying bigram frequency. It accurately reflects the core changes across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch optimize

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/libime/core/historybigram.cpp (1)

66-67: 💤 Low value

Reserve underestimates by one byte when codes are present.

When codes are non-empty, the string uses three separators (lines 69, 73, 75) but the reserve only accounts for two. This may cause one reallocation.

🔧 Suggested fix
-    s.reserve(word1.size() + word2.size() + 1 + code1.size() + code2.size() +
-              1);
+    s.reserve(word1.size() + word2.size() + 1 + code1.size() + code2.size() +
+              2);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/libime/core/historybigram.cpp` around lines 66 - 67, The reserve() call
on the string variable does not account for all separators used when
constructing the string. The current calculation adds 1 twice for two
separators, but according to the code on lines 69, 73, and 75, three separators
are actually used. Update the reserve() calculation to add one additional byte
to account for the third separator that gets appended to the string.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/libime/core/datrie.h`:
- Around line 118-127: The `traverseRaw` template function with variadic Args
(lines 118-127) and the similar `traverse` function currently allow zero path
segments to be passed. When called with no arguments, the fold expression
executes zero times leaving result at its initial value, producing invalid
silent results. Add a compile-time constraint `sizeof...(Args) > 0 &&` to the
beginning of the requires clauses in both the `traverseRaw` variadic template
and the `traverse` variadic template to prevent empty variadic calls from being
instantiated.

In `@src/libime/core/historybigram.cpp`:
- Around line 156-171: The ends_with check for next.second in the lambda
function within trie_.foreach is causing false-positive matches because it only
verifies that codeInTrie ends with the queried code without checking the
separator boundary. To fix this, after computing codeInTrie using
trie().suffix(), verify not only that codeInTrie ends with next.second, but also
that there is a proper separator boundary (such as the expected delimiter
character) immediately before next.second in the codeInTrie string. This ensures
an exact match on the code2 field rather than matching any substring that
happens to end with the queried value.

---

Nitpick comments:
In `@src/libime/core/historybigram.cpp`:
- Around line 66-67: The reserve() call on the string variable does not account
for all separators used when constructing the string. The current calculation
adds 1 twice for two separators, but according to the code on lines 69, 73, and
75, three separators are actually used. Update the reserve() calculation to add
one additional byte to account for the third separator that gets appended to the
string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f61e8e55-a777-4350-be16-3c4d5fade16b

📥 Commits

Reviewing files that changed from the base of the PR and between f044cbe and d8d261e.

📒 Files selected for processing (3)
  • src/libime/core/datrie.h
  • src/libime/core/historybigram.cpp
  • test/testtrie.cpp

Comment thread src/libime/core/datrie.h
Comment thread src/libime/core/historybigram.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/libime/core/datrie.h`:
- Around line 124-127: In the lambda function `handle` (around line 124-127),
the return statement checks `isNoPath(result)` where `result` is a raw int32_t
value from traverseRaw. This check fails for non-int32_t value_type instances
(like DATrie<float>) due to implicit conversion issues. Replace the
`isNoPath(result)` check with a raw sentinel check that directly compares the
int32_t result against the CEDAR_NO_PATH sentinel value to correctly detect
dead-ends regardless of the DATrie's value_type.

In `@src/libime/core/historybigram.cpp`:
- Around line 173-183: The foreach callback in the else block contains a
condition that skips entries when len equals zero, but this incorrectly filters
out valid bigram entries that have an empty second code component (stored with
trailing separators like word1|word2\x02code1|). Remove or modify the `if (len
== 0) return true;` check in the callback lambda to allow these entries to be
included in the result. The check should not skip valid entries that
legitimately have zero length for their descendants, as these represent
intentional empty code2 values that should be returned by the freq() query.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e12038a4-1a6c-4fb6-bc21-947a861e872f

📥 Commits

Reviewing files that changed from the base of the PR and between d8d261e and e84fea6.

📒 Files selected for processing (3)
  • src/libime/core/datrie.h
  • src/libime/core/historybigram.cpp
  • test/testtrie.cpp

Comment thread src/libime/core/datrie.h
Comment thread src/libime/core/historybigram.cpp
@wengxt wengxt merged commit ff09f2d into master Jun 20, 2026
6 checks passed
@wengxt wengxt deleted the optimize branch June 20, 2026 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant