Skip to content

Fix LinkPreview dropping links by truncating before deduplication#2032

Open
jichaowang02-lang wants to merge 1 commit into
unclecode:developfrom
jichaowang02-lang:fix/link-preview-dedup-before-maxlinks
Open

Fix LinkPreview dropping links by truncating before deduplication#2032
jichaowang02-lang wants to merge 1 commit into
unclecode:developfrom
jichaowang02-lang:fix/link-preview-dedup-before-maxlinks

Conversation

@jichaowang02-lang

Copy link
Copy Markdown

Summary

LinkPreview._filter_links applies the max_links limit before
deduplicating, so the limit is spent on duplicate copies of the same URL
instead of distinct URLs.

# Limit number of links
if max_links > 0 and len(filtered_urls) > max_links:
    filtered_urls = filtered_urls[:max_links]      # truncate first ...

# Remove duplicates while preserving order
seen = set(); unique_urls = []
for url in filtered_urls:                            # ... then dedup
    ...

Link extraction routinely produces the same href many times (repeated
nav / footer / CTA links), so the [:max_links] slice keeps duplicate copies
and the later dedup collapses them — yielding far fewer than max_links
unique URLs.

Example

Internal hrefs ["a", "a", "a", "b", "c", "d"] with max_links=3:

result
expected ["a", "b", "c"] (3 distinct)
actual (before fix) ["a"] (slice takes the 3 as, dedup → 1)

Fix

Deduplicate first, then apply max_links, so the limit counts distinct
URLs. Order is still preserved, and cases where the unique count is already
below max_links are unchanged.

Testing

Adds TestFilterLinksDeduplication to tests/test_merge_head_data_scoring.py:

$ pytest tests/test_merge_head_data_scoring.py -q
13 passed

The dedup-before-limit case fails on the current code and passes with the
fix
; the "fewer uniques than the limit" and "cap distinct URLs" cases are
preserved.

LinkPreview._filter_links applied the max_links limit before removing
duplicate URLs. Link extraction routinely yields the same href many times
(repeated nav / footer / CTA links), so the [:max_links] slice spent the
budget on duplicate copies and the subsequent dedup then collapsed them,
returning far fewer than max_links unique URLs. With three duplicate copies
of one URL at the head of the list and max_links=3, the result was a single
URL instead of three distinct ones.

Deduplicate first, then apply max_links so the limit counts distinct URLs.

Adds a TestFilterLinksDeduplication regression: the dedup-before-limit case
fails on the old code and passes with the fix.
Copilot AI review requested due to automatic review settings June 21, 2026 18:00

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants