Add cluster support to multiply outbound write throughput by stilliard · Pull Request #46 · stilliard/github-task-list-completed

stilliard · 2026-06-17T19:00:23Z

Summary

Under load, the final checks.create call lags by minutes while the preceding GET lookups stay fast. Octokit's throttling plugin serializes writes at maxConcurrent: 1, minTime: 1000 (~1 write/sec), keyed per installation and queued in-memory per process. A busy installation backs up behind that gate. (Confirmed via logs + an LB experiment: pausing one node drained the backlog; no GitHub secondary-rate 403s involved.)

Key Changes

Add a small cluster.js entrypoint. The primary forks WEB_CONCURRENCY workers (default 4), each running the existing Probot app unchanged. Each worker gets its own in-memory write queue, so aggregate write throughput scales with worker count. The queue is timer-gated (idle-waiting), so this helps even on single-core droplets — it's process count, not core count, that matters here.

start → node cluster.js; start:single keeps the prior single-process path.
dev / dev-debug pinned to start:single so local smee dev doesn't fan out events across workers.
Crashed workers respawn; a burst of fast (startup) crashes makes the primary exit so the service manager can restart with backoff.
SIGTERM/SIGINT shuts workers down cleanly without respawn.
Worker mirrors probot run ./index.js exactly, so the default app's / and /probot routes are preserved.

Redis is intentionally left off — wiring redisConfig would share one queue fleet-wide and re-impose 1 write/sec globally per installation (worse for throughput).

Testing

Ensure WEBHOOK_PROXY_URL is unset in production (smee would cause fan-out there too).
Set WEB_CONCURRENCY if you want something other than 4 (start conservative).
Confirm the systemd unit has Restart=on-failure + RestartSec so the fast-crash exit self-heals with backoff.

After deploy on one droplet:

ps -ef | grep cluster.js → 1 primary + N workers.
No duplicate processing: journalctl -u <svc> | grep "Request received" | grep -oE "Context: [a-f0-9-]+" | sort | uniq -c → all counts 1.
Watch the "Complete and sending back" → "Check response status 201" gap shrink vs. the unchanged droplet; watch for any secondary rate limit 403s (expected: none).

🤖 Generated with Claude Code

…roughput GitHub API writes (checks.create) are serialized by Octokit's throttling plugin at maxConcurrent:1, minTime:1000 (≈1 write/sec) keyed per installation, queued in-memory per process. Under load a busy installation backs up behind this gate while GET lookups stay fast, causing multi-minute delays on the final check creation. Add a cluster.js wrapper as the process entry point: the primary forks WEB_CONCURRENCY workers (default 4), each running the existing Probot app unchanged via probot.run(). Each worker has its own in-memory write queue, so aggregate write throughput scales with worker count. The queue is timer-gated (idle-waiting), so this helps even on single-core droplets. - start now runs the cluster; start:single keeps the prior single-process path - dev/dev-debug pinned to start:single so local smee dev doesn't fan out events across workers - crashed workers respawn; SIGTERM/SIGINT shuts workers down without respawn Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… routes Address review feedback on the cluster entrypoint: - P1: a worker that died at startup (bad config, bind failure, throw in run()) was respawned instantly in a tight loop. Track per-worker uptime; deaths under 5s count as fast-crashes, and after WEB_CONCURRENCY of them the primary exits(1) so the service manager can apply its restart/backoff policy. Healthy runs reset the streak and still respawn normally. - P3: the function form of run() skipped Probot's default app, dropping the `/` and `/probot` setup routes (the latter is documented in the README). Drive the same argv path as `probot run ./index.js` so the default app loads first, keeping behaviour identical to the prior entrypoint. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

⚠️ Not ready to approve

The worker startup uses probot.run() with an incorrect argv shape and WEB_CONCURRENCY parsing can result in starting zero workers, both of which can break production startup/processing.

Pull request overview

Adds a Node.js cluster-based entrypoint so the Probot app can run as multiple worker processes, increasing aggregate GitHub write throughput by giving each worker its own in-memory Octokit write queue.

Changes:

Switch npm start to launch a new cluster.js primary process (with start:single preserved for the old single-process behavior).
Add cluster.js to fork WEB_CONCURRENCY workers and respawn workers on crash (with “fast-crash” protection).
Pin local dev scripts to the single-process path to avoid webhook fan-out during development.

File summaries

File	Description
package.json	Updates scripts to use clustered start by default while keeping a single-process option for dev.
cluster.js	New clustered entrypoint that forks workers and runs the existing Probot app in each worker.

Copilot's findings

Files reviewed: 2/2 changed files
Comments generated: 2

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…RRENCY A negative WEB_CONCURRENCY forked zero workers (service up, processing nothing). Only accept positive integers, otherwise use the default. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

⚠️ Not ready to approve

The worker bootstrap calls require('probot').run() with a CLI-style argv array, which is incompatible with Probot v12’s run() API and can prevent workers from starting correctly.

Copilot's findings

Files reviewed: 2/2 changed files
Comments generated: 1

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

stilliard and others added 2 commits June 17, 2026 19:33

stilliard requested a review from Copilot June 17, 2026 19:01

Copilot started reviewing on behalf of stilliard June 17, 2026 19:02 View session

stilliard marked this pull request as ready for review June 17, 2026 19:02

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread cluster.js Outdated

Comment thread cluster.js

fix(cluster): fall back to default workers for non-positive WEB_CONCU…

b96417c

…RRENCY A negative WEB_CONCURRENCY forked zero workers (service up, processing nothing). Only accept positive integers, otherwise use the default. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

stilliard requested a review from Copilot June 17, 2026 19:16

Copilot started reviewing on behalf of stilliard June 17, 2026 19:16 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread cluster.js

stilliard merged commit 5f9188c into master Jun 17, 2026
2 checks passed

stilliard deleted the experimental/cluster-support branch June 17, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cluster support to multiply outbound write throughput#46

Add cluster support to multiply outbound write throughput#46
stilliard merged 3 commits into
masterfrom
experimental/cluster-support

stilliard commented Jun 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

stilliard commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

⚠️ Not ready to approve

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

⚠️ Not ready to approve

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stilliard commented Jun 17, 2026 •

edited

Loading