Fix LinkPreview dropping links by truncating before deduplication#2032
Open
jichaowang02-lang wants to merge 1 commit into
Open
Fix LinkPreview dropping links by truncating before deduplication#2032jichaowang02-lang wants to merge 1 commit into
jichaowang02-lang wants to merge 1 commit into
Conversation
LinkPreview._filter_links applied the max_links limit before removing duplicate URLs. Link extraction routinely yields the same href many times (repeated nav / footer / CTA links), so the [:max_links] slice spent the budget on duplicate copies and the subsequent dedup then collapsed them, returning far fewer than max_links unique URLs. With three duplicate copies of one URL at the head of the list and max_links=3, the result was a single URL instead of three distinct ones. Deduplicate first, then apply max_links so the limit counts distinct URLs. Adds a TestFilterLinksDeduplication regression: the dedup-before-limit case fails on the old code and passes with the fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
LinkPreview._filter_linksapplies themax_linkslimit beforededuplicating, so the limit is spent on duplicate copies of the same URL
instead of distinct URLs.
Link extraction routinely produces the same
hrefmany times (repeatednav / footer / CTA links), so the
[:max_links]slice keeps duplicate copiesand the later dedup collapses them — yielding far fewer than
max_linksunique URLs.
Example
Internal hrefs
["a", "a", "a", "b", "c", "d"]withmax_links=3:["a", "b", "c"](3 distinct)["a"](slice takes the 3as, dedup → 1)Fix
Deduplicate first, then apply
max_links, so the limit counts distinctURLs. Order is still preserved, and cases where the unique count is already
below
max_linksare unchanged.Testing
Adds
TestFilterLinksDeduplicationtotests/test_merge_head_data_scoring.py:The dedup-before-limit case fails on the current code and passes with the
fix; the "fewer uniques than the limit" and "cap distinct URLs" cases are
preserved.