Skip to content

Fix fast_urljoin mangling root-absolute paths on pages with a path#2030

Open
jichaowang02-lang wants to merge 1 commit into
unclecode:developfrom
jichaowang02-lang:fix/fast-urljoin-root-absolute
Open

Fix fast_urljoin mangling root-absolute paths on pages with a path#2030
jichaowang02-lang wants to merge 1 commit into
unclecode:developfrom
jichaowang02-lang:fix/fast-urljoin-root-absolute

Conversation

@jichaowang02-lang

Copy link
Copy Markdown

Summary

fast_urljoin() mangles root-absolute links (/path) when the base URL has
a path of its own — which is the normal case for a crawled sub-page. A
root-absolute reference must replace the base's entire path (RFC 3986), but the
function appended it to the full base:

fast_urljoin("https://example.com/docs/guide.html", "/api/reference")
#   ->  "https://example.com/docs/guide.html/api/reference"   ❌ (broken)
urllib.parse.urljoin("https://example.com/docs/guide.html", "/api/reference")
#   ->  "https://example.com/api/reference"                   ✅

fast_urljoin documents urllib.parse.urljoin as its fallback, so this also
diverges from its own intended contract. Because
DefaultMarkdownGenerator.convert_links_to_citations() joins every link with
fast_urljoin(base_url, href) (and base_url is the actual crawled page URL,
which routinely carries a path), every root-absolute link in a crawled page
produced a broken citation/reference URL
in the generated markdown.

The fix resolves root-absolute paths against base's scheme://authority root,
keeping the fast path for the common case and deferring to urljoin for
unusual bases (e.g. a query/fragment but no path). Output now matches urljoin
for every case; links on a bare-host base (https://example.com) were already
correct and stay correct.

List of files changed and why

  • crawl4ai/markdown_generation_strategy.py — fix the root-absolute branch of
    fast_urljoin to join against the host root instead of the full base path.
  • tests/test_fast_urljoin.py — new regression tests.

How Has This Been Tested?

tests/test_fast_urljoin.py (new) asserts:

  • Parity with urllib.parse.urljoin across root-absolute hrefs on
    path-bearing bases (the bug), root-absolute hrefs on bare-host bases (already
    correct), and relative/../sibling forms.
  • The specific regression: /api/reference on …/docs/guide.html resolves to
    https://example.com/api/reference, not …/guide.html/api/reference.
  • Pre-existing fast paths (absolute URLs, mailto:) are unchanged.
  • End-to-end through the public convert_links_to_citations, the References
    section now contains the corrected URL.
$ pytest tests/test_fast_urljoin.py -q
13 passed

Confirmed the tests fail on the current code (7 failed: all root-absolute
cases + the end-to-end citation check) and pass with this fix (13 passed).

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

fast_urljoin() handled a root-absolute href ("/path") by appending it to
the full base URL (base + url), which is only correct when base has no path
beyond the host. Per RFC 3986 a root-absolute reference replaces the base's
entire path, so for a real crawled sub-page the join was wrong:

    fast_urljoin("https://example.com/docs/guide.html", "/api/reference")
      ->  "https://example.com/docs/guide.html/api/reference"   (broken)
    urllib.parse.urljoin(...)  ->  "https://example.com/api/reference"

The function documents urljoin as its fallback, so this diverges from its own
intended contract. DefaultMarkdownGenerator.convert_links_to_citations() joins
every link with fast_urljoin(base_url, href), so every root-absolute link in a
crawled page produced a broken citation/reference URL.

Resolve root-absolute paths against base's scheme://authority root, keeping
the fast path for the common case and deferring to urljoin for unusual bases.

Adds tests/test_fast_urljoin.py asserting parity with urllib.parse.urljoin for
root-absolute, already-correct, and relative cases, plus an end-to-end check
through convert_links_to_citations. The root-absolute cases fail on the old
code and pass with this fix.
Copilot AI review requested due to automatic review settings June 21, 2026 16:34

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants