perf(mpool): #2/#7 prototype — measured that false-sharing isn't the cap#19
perf(mpool): #2/#7 prototype — measured that false-sharing isn't the cap#19gburd wants to merge 1 commit into
Conversation
Prototype the #2/#7 hypothesis that the buffer-header write-hot fields (pin ref count + LRU priority) false-share a cache line with the read-mostly identity/traversal fields every hash-chain walk reads. Isolate them on their own line behind MPOOL_HOTFIELDS_ISOLATED (off by default). Controlled interleaved A/B (packed vs isolated, medians, 12-core box): no effect (+/-0.6%). The read-path cap is TRUE sharing of the atomic counters (bhp->ref and the shared-latch share-counts), not false sharing -- relocating the words cannot help. Left off by default (only adds per-buffer memory); kept guarded to re-A/B on the 24-core Linux box. Refines the #2/#7 direction: the per-read shared-counter RMW must be removed (optimistic/versioned access needing epoch reclamation, or a sharded pin count), not relocated. Documented in docs/design/scaling-findings.md.
|
Closing as a superseded negative result. This prototype tested whether cache-line isolation / false-sharing of the per-buffer pin and latch fields was the multi-core read-scaling bottleneck. The controlled A/B showed it was not: isolating those fields changed throughput by <1%. The cost is true sharing of the words themselves (every thread does atomic RMWs on the same root/internal-page cache lines on every descent), which padding cannot fix. That finding directly motivated the chosen direction — write-free optimistic reads of the hot pages — now implemented as Stage 1b (lock-free root snapshot, #22) and the swip/optimistic-descent design (#20). The measured next bottleneck is the shared-handle cursor allocation mutex, tracked separately. Keeping the branch for the record; no longer a merge candidate. |
Prototypes and measures the first candidate fix for the read-scaling ceiling identified in
docs/design/scaling-findings.md.Hypothesis
struct __bhpacks the write-hot fields (pinrefcount, LRUpriority— written on every__memp_fget/__memp_fput) into the same cache line as the read-mostly identity fields (pgno/mf_offset/flags/hq) that every concurrent hash-chain walk of a hot (btree root) buffer reads. So each pin would invalidate the line all readers need just to traverse/match.Change
Isolate the write-hot fields on their own cache line, behind
MPOOL_HOTFIELDS_ISOLATED(one-line A/B). Off by default.Measured (controlled interleaved A/B, medians, 12-core)
No effect (±0.6%). The cap is true sharing of the atomic counters (
bhp->ref+ the shared-latch share-counts), not false sharing — relocating the words can't help.Why it's still useful
It rules out the cheap fix with data and refines #2/#7: the per-read shared-counter RMW must be removed (optimistic/versioned access — needs epoch reclamation BDB lacks — or a sharded pin count), not relocated. Kept guarded + off to re-A/B on the 24-core Linux box (currently unreachable) where the futex-dominated ceiling was characterized. Smoke-tested write + MVCC-freeze paths; full TCL regression required before any default-on change.
Builds clean default (off). See
docs/design/scaling-findings.md→ Prototype 1.