Add cluster support to multiply outbound write throughput#46
Conversation
…roughput GitHub API writes (checks.create) are serialized by Octokit's throttling plugin at maxConcurrent:1, minTime:1000 (≈1 write/sec) keyed per installation, queued in-memory per process. Under load a busy installation backs up behind this gate while GET lookups stay fast, causing multi-minute delays on the final check creation. Add a cluster.js wrapper as the process entry point: the primary forks WEB_CONCURRENCY workers (default 4), each running the existing Probot app unchanged via probot.run(). Each worker has its own in-memory write queue, so aggregate write throughput scales with worker count. The queue is timer-gated (idle-waiting), so this helps even on single-core droplets. - start now runs the cluster; start:single keeps the prior single-process path - dev/dev-debug pinned to start:single so local smee dev doesn't fan out events across workers - crashed workers respawn; SIGTERM/SIGINT shuts workers down without respawn Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… routes Address review feedback on the cluster entrypoint: - P1: a worker that died at startup (bad config, bind failure, throw in run()) was respawned instantly in a tight loop. Track per-worker uptime; deaths under 5s count as fast-crashes, and after WEB_CONCURRENCY of them the primary exits(1) so the service manager can apply its restart/backoff policy. Healthy runs reset the streak and still respawn normally. - P3: the function form of run() skipped Probot's default app, dropping the `/` and `/probot` setup routes (the latter is documented in the README). Drive the same argv path as `probot run ./index.js` so the default app loads first, keeping behaviour identical to the prior entrypoint. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
⚠️ Not ready to approve
The worker startup uses probot.run() with an incorrect argv shape and WEB_CONCURRENCY parsing can result in starting zero workers, both of which can break production startup/processing.
Pull request overview
Adds a Node.js cluster-based entrypoint so the Probot app can run as multiple worker processes, increasing aggregate GitHub write throughput by giving each worker its own in-memory Octokit write queue.
Changes:
- Switch
npm startto launch a newcluster.jsprimary process (withstart:singlepreserved for the old single-process behavior). - Add
cluster.jsto forkWEB_CONCURRENCYworkers and respawn workers on crash (with “fast-crash” protection). - Pin local dev scripts to the single-process path to avoid webhook fan-out during development.
File summaries
| File | Description |
|---|---|
| package.json | Updates scripts to use clustered start by default while keeping a single-process option for dev. |
| cluster.js | New clustered entrypoint that forks workers and runs the existing Probot app in each worker. |
Copilot's findings
- Files reviewed: 2/2 changed files
- Comments generated: 2
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…RRENCY A negative WEB_CONCURRENCY forked zero workers (service up, processing nothing). Only accept positive integers, otherwise use the default. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
⚠️ Not ready to approve
The worker bootstrap calls require('probot').run() with a CLI-style argv array, which is incompatible with Probot v12’s run() API and can prevent workers from starting correctly.
Copilot's findings
- Files reviewed: 2/2 changed files
- Comments generated: 1
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
Summary
Under load, the final
checks.createcall lags by minutes while the preceding GET lookups stay fast. Octokit's throttling plugin serializes writes atmaxConcurrent: 1, minTime: 1000(~1 write/sec), keyed per installation and queued in-memory per process. A busy installation backs up behind that gate. (Confirmed via logs + an LB experiment: pausing one node drained the backlog; no GitHub secondary-rate 403s involved.)Key Changes
Add a small
cluster.jsentrypoint. The primary forksWEB_CONCURRENCYworkers (default 4), each running the existing Probot app unchanged. Each worker gets its own in-memory write queue, so aggregate write throughput scales with worker count. The queue is timer-gated (idle-waiting), so this helps even on single-core droplets — it's process count, not core count, that matters here.start→node cluster.js;start:singlekeeps the prior single-process path.dev/dev-debugpinned tostart:singleso local smee dev doesn't fan out events across workers.probot run ./index.jsexactly, so the default app's/and/probotroutes are preserved.Redis is intentionally left off — wiring
redisConfigwould share one queue fleet-wide and re-impose 1 write/sec globally per installation (worse for throughput).Testing
WEBHOOK_PROXY_URLis unset in production (smee would cause fan-out there too).WEB_CONCURRENCYif you want something other than 4 (start conservative).Restart=on-failure+RestartSecso the fast-crash exit self-heals with backoff.After deploy on one droplet:
ps -ef | grep cluster.js→ 1 primary + N workers.journalctl -u <svc> | grep "Request received" | grep -oE "Context: [a-f0-9-]+" | sort | uniq -c→ all counts1.secondary rate limit403s (expected: none).🤖 Generated with Claude Code