Skip to content

Improve scrapers content extraction#19

Open
mikesiez wants to merge 6 commits into
mainfrom
improve-scrapers-content-extraction
Open

Improve scrapers content extraction#19
mikesiez wants to merge 6 commits into
mainfrom
improve-scrapers-content-extraction

Conversation

@mikesiez

Copy link
Copy Markdown

Implemented scraping with bs4 instead of trifilatura.
Reduced unnecessary elements being scraped, and fixed necessary elements being ignored / improperly parsed such as links & accordions.
Added offline test files in tests/ingestion/fixtures and a script to run assertions at tests/ingestion/offline_scraper.py
Updated dependency list to include bs4 via uv add beautifulsoup4

Linked to issue #6

mikesiez and others added 3 commits June 14, 2026 20:12
@mikesiez mikesiez linked an issue Jun 20, 2026 that may be closed by this pull request
@mikesiez mikesiez requested a review from AJaccP June 20, 2026 15:37
Comment thread pyproject.toml

FIXTURES = Path(__file__).parent / "fixtures"

# run with python -m tests.ingestion.scraper_offline

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytest's test file discovery expects filenames to start with test, e.g. test_repository.py, which is why running make test currently skips this file. Test functions also don't need to be called at the end of the file

  • Rename this file to test_scraper.py
  • Remove the function calls on lines 59-60, the functions just needs to be defined
  • In addition to the SCS page fixtures, add fixtures for 1-2 ccss resource pages too (e.g. one faq and one article) since they have a different structure


assert title == "Courses and Registration (B.Cyber.) - School of Computer Science"
assert "Electives and Prohibited Courses" in text
assert "Skip to Main Content" not in text

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion could be a bit more broad (e.g. some pages say "Skip to Content" instead of "Skip to Main Content", so even if that line isn't skipped, the assertion passes)

A better approach may be having a list of "junk" phrases that should be asserted to be skipped, e.g. :

for junk in ["Skip to Content", "Skip to Main Content", "Search this website"... <any other commonly present phrases>]:
    assert junk not in text

Comment thread src/ingestion/scrapers/html_scraper.py
Comment thread src/ingestion/scrapers/html_scraper.py
html = BeautifulSoup(html, "html.parser")
# removing elements by class
for el in html.select(
".footer, .global-nav, .navigation, " ".topbar, .content__meta, .visuallyhidden"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting this and the tag list on line 18 to separate constant lists at the top can make it easier to maintain if the lists grow in the future, something like

_BOILERPLATE_TAGS = ["header", "footer", "nav", "script", "style", "noscript"]
_BOILERPLATE_SELECTORS = (
    ".navbar-container, .navbar, .footer, .global-nav, "
    ".navigation, .topbar, .content__meta, .visuallyhidden"
)

return text, title
import httpx
from bs4 import BeautifulSoup

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rewrite dropped the logger and logs for scrape starting/completing. For observability, please re add the logger and appropriate logs throughout the scraper code

else:
link.string = f"{link.string} [Link: {href}]"

text = html.get_text("\n", strip=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_text("\n", strip=True) seems to put every text node (including tags like <a>, <b>, <i> etc on their own new line, which can lead to sentences getting fragmented across lines. This could interact badly with the chunker's \n splitter and affect answer formatting.

I tried this and it normalizes the text content well:

Add a _BLOCK_TAGS list constant to the top for tags that can define new text blocks but skips common inline tags like <strong>, <i> etc. , along with the boilerplate list constants:

_BLOCK_TAGS = ["p", "div", "li", "h1", "h2", "h3", "h4", "h5", "h6",
               "tr", "section", "article", "ul", "ol", "table", "br", "blockquote"]

(there may be more tags worth including)

Then:

for tag in root.find_all(_BLOCK_TAGS):                
        tag.append("\n")                                          # append a newline as each block's last child

    text = content_root.get_text(" ", strip=False)                # flatten: space within blocks, \n between blocks
    text = re.sub(r"[ \t]+", " ", text)                           # collapse runs of spaces/tabs (keep \n)
    text = re.sub(r" *\n *", "\n", text)                          # strip spaces around each newline
    text = re.sub(r"\n{3,}", "\n\n", text)                        # limit to 2 \n
    text = text.strip()

if href[-3:] == "pdf":
link.string = f"{link.string} [PDF: {href}]"
else:
link.string = f"{link.string} [Link: {href}]"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • href[-3:] == "pdf" can miss cases like ".PDF"
  • link.string seems to be None when an tag wraps nested html tags (e.g. FAQ page's dates and regulations link renders as None [Link: <the actual link>]
  • Using urllib can be useful here

This snippet should help:

from urllib.parse import urlsplit  # at the top with the rest of the imports

for link in root.find_all("a"):
    href = (link.get("href") or "").strip()
    if href and not href.startswith("#"):
        text = link.get_text(" ", strip=True)
        is_pdf = urlsplit(href).path.lower().endswith(".pdf")
        marker = "PDF" if is_pdf else "Link"
        link.replace_with(f"{text} [{marker}: {href}]")

Comment on lines +6 to +8
cut_marker = "<!-- close main-wrapper"
if cut_marker in html:
html = html.split(cut_marker)[0]

@AJaccP AJaccP Jun 30, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cut_marker approach may not always work (e.g. some SCS pages or CCSS pages don't have the <!-- close main-wrapper string.

From looking at a few pages, carleton webpages seem to always have a <main>...</main> block or div with id main-content with the actual page content. Scoping extraction to those for most ...carleton.ca/... links would be more robust.

CCSS resource pages however don't have a <main> block or id, so we should have fallback behaviour for those and any other pages without one.

This snippet should help:

soup = BeautifulSoup(html, "html.parser")
root = soup.select_one("main, [role=main], #main-content")
if root is None:  # CCSS-style pages have no <main>
    for t in soup(_BOILERPLATE_TAGS):
        t.decompose()
    for el in soup.select(_BOILERPLATE_SELECTORS):
        el.decompose()
    root = soup.body or soup

This would let you drop the cut_marker string. Out of curiosity though did you try and find issues with scoping extraction to the <main> block? Just want to be sure I'm not missing any reasoning there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve scraper's content extraction

2 participants