Skip to content

Add cluster support to multiply outbound write throughput#46

Merged
stilliard merged 3 commits into
masterfrom
experimental/cluster-support
Jun 17, 2026
Merged

Add cluster support to multiply outbound write throughput#46
stilliard merged 3 commits into
masterfrom
experimental/cluster-support

Conversation

@stilliard

@stilliard stilliard commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Summary

Under load, the final checks.create call lags by minutes while the preceding GET lookups stay fast. Octokit's throttling plugin serializes writes at maxConcurrent: 1, minTime: 1000 (~1 write/sec), keyed per installation and queued in-memory per process. A busy installation backs up behind that gate. (Confirmed via logs + an LB experiment: pausing one node drained the backlog; no GitHub secondary-rate 403s involved.)

Key Changes

Add a small cluster.js entrypoint. The primary forks WEB_CONCURRENCY workers (default 4), each running the existing Probot app unchanged. Each worker gets its own in-memory write queue, so aggregate write throughput scales with worker count. The queue is timer-gated (idle-waiting), so this helps even on single-core droplets — it's process count, not core count, that matters here.

  • startnode cluster.js; start:single keeps the prior single-process path.
  • dev / dev-debug pinned to start:single so local smee dev doesn't fan out events across workers.
  • Crashed workers respawn; a burst of fast (startup) crashes makes the primary exit so the service manager can restart with backoff.
  • SIGTERM/SIGINT shuts workers down cleanly without respawn.
  • Worker mirrors probot run ./index.js exactly, so the default app's / and /probot routes are preserved.

Redis is intentionally left off — wiring redisConfig would share one queue fleet-wide and re-impose 1 write/sec globally per installation (worse for throughput).

Testing

  1. Ensure WEBHOOK_PROXY_URL is unset in production (smee would cause fan-out there too).
  2. Set WEB_CONCURRENCY if you want something other than 4 (start conservative).
  3. Confirm the systemd unit has Restart=on-failure + RestartSec so the fast-crash exit self-heals with backoff.

After deploy on one droplet:

  • ps -ef | grep cluster.js → 1 primary + N workers.
  • No duplicate processing: journalctl -u <svc> | grep "Request received" | grep -oE "Context: [a-f0-9-]+" | sort | uniq -c → all counts 1.
  • Watch the "Complete and sending back" → "Check response status 201" gap shrink vs. the unchanged droplet; watch for any secondary rate limit 403s (expected: none).

🤖 Generated with Claude Code

stilliard and others added 2 commits June 17, 2026 19:33
…roughput

GitHub API writes (checks.create) are serialized by Octokit's throttling
plugin at maxConcurrent:1, minTime:1000 (≈1 write/sec) keyed per installation,
queued in-memory per process. Under load a busy installation backs up behind
this gate while GET lookups stay fast, causing multi-minute delays on the final
check creation.

Add a cluster.js wrapper as the process entry point: the primary forks
WEB_CONCURRENCY workers (default 4), each running the existing Probot app
unchanged via probot.run(). Each worker has its own in-memory write queue, so
aggregate write throughput scales with worker count. The queue is timer-gated
(idle-waiting), so this helps even on single-core droplets.

- start now runs the cluster; start:single keeps the prior single-process path
- dev/dev-debug pinned to start:single so local smee dev doesn't fan out events
  across workers
- crashed workers respawn; SIGTERM/SIGINT shuts workers down without respawn

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… routes

Address review feedback on the cluster entrypoint:

- P1: a worker that died at startup (bad config, bind failure, throw in run())
  was respawned instantly in a tight loop. Track per-worker uptime; deaths
  under 5s count as fast-crashes, and after WEB_CONCURRENCY of them the primary
  exits(1) so the service manager can apply its restart/backoff policy. Healthy
  runs reset the streak and still respawn normally.

- P3: the function form of run() skipped Probot's default app, dropping the
  `/` and `/probot` setup routes (the latter is documented in the README).
  Drive the same argv path as `probot run ./index.js` so the default app loads
  first, keeping behaviour identical to the prior entrypoint.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@stilliard stilliard requested a review from Copilot June 17, 2026 19:01
@stilliard stilliard marked this pull request as ready for review June 17, 2026 19:02

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

The worker startup uses probot.run() with an incorrect argv shape and WEB_CONCURRENCY parsing can result in starting zero workers, both of which can break production startup/processing.

Pull request overview

Adds a Node.js cluster-based entrypoint so the Probot app can run as multiple worker processes, increasing aggregate GitHub write throughput by giving each worker its own in-memory Octokit write queue.

Changes:

  • Switch npm start to launch a new cluster.js primary process (with start:single preserved for the old single-process behavior).
  • Add cluster.js to fork WEB_CONCURRENCY workers and respawn workers on crash (with “fast-crash” protection).
  • Pin local dev scripts to the single-process path to avoid webhook fan-out during development.
File summaries
File Description
package.json Updates scripts to use clustered start by default while keeping a single-process option for dev.
cluster.js New clustered entrypoint that forks workers and runs the existing Probot app in each worker.

Copilot's findings

  • Files reviewed: 2/2 changed files
  • Comments generated: 2

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cluster.js Outdated
Comment thread cluster.js
…RRENCY

A negative WEB_CONCURRENCY forked zero workers (service up, processing
nothing). Only accept positive integers, otherwise use the default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

The worker bootstrap calls require('probot').run() with a CLI-style argv array, which is incompatible with Probot v12’s run() API and can prevent workers from starting correctly.

Copilot's findings
  • Files reviewed: 2/2 changed files
  • Comments generated: 1

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

Comment thread cluster.js
@stilliard stilliard merged commit 5f9188c into master Jun 17, 2026
2 checks passed
@stilliard stilliard deleted the experimental/cluster-support branch June 17, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants