Skip to content

cornshakes/scrollsurf

Repository files navigation

Scrollsurf

Visit this app on my raspberry and read the diary!

Scrollsurf lets you scroll through wikipedia article abstracts, like/dislike them, and visit the full articles on wikipedia itself. The articles that it shows you are randomly selected from these datasets:

Getting Started

npm install

Before you run the app for the first time, you have to download the datasets that you want using the provided package scripts. The downloads take a long time, but one dataset is enough to run the app:

npm run download-vital-50000
npm run download-unusual
npm run download-good-articles
npm run download-featured-articles
npm run download-featured-pictures
npm run download-commons-featured-pictures

Then, you can categorize the articles by running

npm run categorize

Currently, that's not very useful - it just builds a huge category tree that you can look at. After downloading at least one dataset, you can

npm run dev

and go to http://localhost:3000

Integration Testing

All e2e tests run against a small example database (e2e/.data/). That database is created from the downloaded datasets using the test:e2e:create-db script. It is committed so that you don't have to download all datasets before being able to run e2e tests.

npm run test:e2e:create_db  # creates e2e test db, you don't have to do this
npm run test:e2e:setup      # downloads chromium for playwright
npm run test:e2e            # run all integration tests (seeds DB automatically)
npm run test:e2e:ui         # same, but with Playwright's interactive UI

Clicks, Likes & Dislikes

The feed is random, but influenced by user activity. Three signals are tracked per topic (e.g. Vital → History):

  • Like counts +1
  • Dislike counts −1
  • Following a link counts +0.5

These are averaged over seen articles of that topic, so a topic needs a few signals before it starts to move — one stray like won't change much.

Unseen articles are then drawn with weights based on the average affinity of their topics, i.e. liked topics show up more often, disliked topics show up less.

Without any votes (or without the consent cookie) the feed is random.

The weighting strength can be adjusted using the FEED_AFFINITY_STRENGTH env var (0 = random).

Example

Say you've scrolled for a while and your history per topic looks like this:

Topic Seen Likes Dislikes Clicks Affinity = (likes + 0.5·clicks − dislikes) / (seen + 5)
Vital → History 15 6 0 2 (6 + 1 − 0) / 20 = 0.35
Vital → Sports 15 0 6 0 (0 + 0 − 6) / 20 = −0.30
Vital → Arts 4 1 0 0 (1 + 0 − 0) / 9 = 0.11
anything you haven't voted on 0

The + 5 in the denominator is the smoothing: the lone Arts like only gets a third of the affinity of the six History likes, even though it's a 100% like rate.

Each unseen article then gets a weight of exp(2 · affinity) (the 2 is FEED_AFFINITY_STRENGTH):

Article tagged Mean affinity Weight
History 0.35 exp(0.70) ≈ 2.0
Sports −0.30 exp(−0.60) ≈ 0.55
Arts 0.11 exp(0.22) ≈ 1.25
History and Sports (0.35 − 0.30) / 2 = 0.025 exp(0.05) ≈ 1.05
no voted topics 0 exp(0) = 1.0

The weight is the article's relative chance per feed slot: a History article is about twice as likely to appear as a neutral one, and about 3.7× as likely as a Sports one — but even Sports articles keep showing up at roughly half the neutral rate. An article tagged with both a liked and a disliked topic lands back near neutral, because affinities are averaged across its topics.

The SQL behind it

There are no per-topic queries and no mixing of result sets in TypeScript — the whole draw happens inside one SELECT per item type. The statement is assembled from shared SQL fragments in src/lib/db/affinity.ts (the constants from the example are baked into the string; only $user_id and $limit are bound at query time) and chains three CTEs before the actual selection:

WITH clicked AS (              -- distinct items you clicked links on
  SELECT DISTINCT item_id FROM user_clicks WHERE user_id = $user_id ...
),
topic_affinity AS (            -- the first table from the example:
  SELECT dataset, topic,       -- one GROUP BY over your seen items
         (likes + 0.5*clicks - dislikes) / (seen + 5) AS affinity
  FROM user_articles JOIN article_topics ... LEFT JOIN clicked ...
  WHERE user_id = $user_id
  GROUP BY dataset, topic
),
item_affinity AS (             -- the second table: AVG over each item's topics
  SELECT article_id AS item_id, AVG(COALESCE(affinity, 0)) AS affinity
  FROM article_topics LEFT JOIN topic_affinity ...
  GROUP BY article_id
)
SELECT a.* FROM articles a
LEFT JOIN item_affinity ia ON ia.item_id = a.id
WHERE <unseen, dataset enabled>
ORDER BY -ln(random_0_to_1) / exp(2 * ia.affinity)   -- the weighted draw
LIMIT $limit

The ORDER BY line is the whole sampling trick (Efraimidis–Spirakis): every candidate row draws its own uniform random number, the weight stretches it, and taking the smallest n keys is mathematically the same as drawing n items without replacement with probability proportional to weight. So the "randomness" and the "weighting" live in the same expression — there's no second pass, no shuffle in TS.

For anonymous users $user_id is NULL, which matches nothing in the CTEs, so every article falls back to affinity 0 → weight 1 → plain uniform random, through the exact same query.

Pictures run the same query against their own tables (user_pictures, picture_topics). The only thing TypeScript does afterwards is interleave the two result lists at FEED_PICTURE_RATIO in src/lib/db/feed.ts — two queries per feed page, total.

Future inspiration

These Main topic classifications are not what I have

Wikipedia:Contents

why not reddit

Wikipedia:Categorization

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages