daemon: configurable max-age to recycle RPC connections#225
Open
DeviaVir wants to merge 4 commits into
Open
Conversation
Long-lived daemon RPC connections stay pinned to a single backend for their whole lifetime. When electrs connects through a load balancer such as a Kubernetes ClusterSetIP (`*.clusterset.local`), a connection established before a node rotation keeps routing to the original backend via the existing TCP/conntrack flow, even after healthier/closer backends become available. The connection is only re-established on error, so a still-working-but-stale endpoint is never rebalanced. Add a `--daemon-rpc-conn-max-age` option (seconds). When a connection exceeds the configured age it is proactively recycled before the next request, re-establishing the TCP connection so the load balancer can re-select a backend. Defaults to 0 = unlimited (never recycle), so behavior is unchanged unless explicitly enabled. The age check is also applied to the per-thread connections used for parallel RPC requests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bf22a45 to
cc0942f
Compare
There was a problem hiding this comment.
Pull request overview
Adds an opt-in mechanism to proactively recycle long-lived bitcoind RPC TCP connections, enabling better backend rebalancing when electrs connects through a load balancer (e.g., avoiding “pinned” connections across node rotations).
Changes:
- Introduces
--daemon-rpc-conn-max-age <seconds>(default0= unlimited) and stores it asOption<Duration>inConfig. - Tracks connection establishment time in
Connectionand reconnects incall_jsonrpc()when the configured max age is exceeded. - Threads the new configuration through binaries and test harness initialization.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/common.rs | Initializes the new Config::daemon_conn_max_age field and passes it into Daemon::new(). |
| src/daemon.rs | Adds connection age tracking + expiry check and proactive reconnect on JSON-RPC calls; threads config through Daemon. |
| src/config.rs | Adds CLI flag, parses to Option<Duration>, and stores it on Config. |
| src/bin/electrs.rs | Passes daemon_conn_max_age into Daemon::new(). |
| src/bin/tx-fingerprint-stats.rs | Passes daemon_conn_max_age into Daemon::new(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Proactive recycling previously called the infinite-retry reconnect path while holding the daemon connection mutex, before sending the request. A transient "new connections fail" event at the load balancer could therefore block all requests on that connection instead of continuing to use the existing, still-healthy socket -- turning an LB hiccup into an electrs outage when --daemon-rpc-conn-max-age is enabled. Split tcp_connect() into a single-attempt tcp_connect_once() (primary then fallback, no retry/backoff) and keep the looping tcp_connect() for startup and post-failure reconnects, where there is no usable socket to fall back to. Max-age recycling now uses try_reconnect_once(): on success the connection is swapped, on failure we log and keep the existing connection, retrying recycling on a later request. Real send/recv failures still go through the existing infinite reconnect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- config: parse --daemon-rpc-conn-max-age via value_t_or_exit! for consistent clap error handling instead of a manual parse + panic!. - daemon: store the actually-connected address (primary or fallback) on Connection and log it when recycling, so diagnostics aren't misleading when connected to the fallback daemon. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address review feedback on proactive max-age recycling:
- Blocker: a failed recycle attempt kept the existing connection (good)
but did not update any timestamp, so is_expired() stayed true and every
subsequent request re-attempted the recycle first -- each failed
attempt blocking up to DAEMON_CONNECTION_TIMEOUT under the connection
mutex. During a sustained "new connections fail" event this turned
every fast RPC into a request paying a full connect timeout. Now a
failed attempt records last_recycle_attempt and a cooldown
(DAEMON_CONN_RECYCLE_COOLDOWN, default 30s) gates retries, so the old
socket keeps serving requests at full speed between attempts.
- Extract the recycle decision into a pure `recycle_due()` helper and
cover it with unit tests (max-age boundary, None, and cooldown).
- Add a daemon_rpc_conn_recycled{result="ok|failed"} counter so recycle
behavior is observable in prod.
- tcp_connect_once no longer warns per-attempt; it returns one
descriptive error that callers log, avoiding double log lines on the
recycle path. The startup/error loop logs that error + backoff.
- Document in --daemon-rpc-conn-max-age help that the reconnect is inline
on the request path, so the value should be generous (minutes).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collaborator
|
lgtm, looking forward to seeing how your testing with the call_jsonrpc at runtime goes. |
Randy808
approved these changes
Jun 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
electrs holds long-lived TCP connections to bitcoind for the whole process lifetime, only re-establishing them on error. When electrs reaches the daemon through a fronting load balancer a connection established before a rotation keeps routing to its original backend via the existing TCP/conntrack flow. Even after that backend is rotated out (or a closer/healthier one appears), a connection that is still "working" is never rebalanced, so it can stay pinned to a stale or sub-optimal endpoint indefinitely.
Note:
daemon_rpc_addris resolved to aSocketAddronce at startup (config::str_to_socketaddr). For a ClusterSetIP that address is a stable virtual IP, so the fix here is to recycle the TCP connection (forcing the LB to re-select a backend), not to re-resolve DNS. If true DNS re-resolution is ever needed for a different deployment, that's a follow-up.Change
Adds a
--daemon-rpc-conn-max-age <seconds>option:Connectionrecords when it was established. Before sending a request, if the connection is older than the configured max age it is proactively recycled (reconnect()), giving the load balancer a fresh backend selection.0= unlimited / never recycle, so behavior is unchanged unless an operator opts in.requests_iter), which are the ones that fan out across backends.Files
src/config.rs— new CLI flag +daemon_conn_max_age: Option<Duration>(None when 0).src/daemon.rs—Connectiontracksestablished/max_age, addsis_expired(), andcall_jsonrpcrecycles expired connections; threaded throughDaemon.src/bin/electrs.rs,src/bin/tx-fingerprint-stats.rs— pass the new value.Testing
cargo check --binsandcargo check --bins --features liquidboth pass (only pre-existing warnings).call_jsonrpcthe right spot vs. a background timer, and should the value also accept a human duration like30m).