perf(btree): Stage 1b — lock-free root snapshot for read descents#22
Merged
Conversation
Foundation for the optimistic descent: B-tree internal/root pages are wired so the frame is never reclaimed, letting a later lock-free descent read them without a use-after-free hazard (BDB has no epoch reclamation). - struct __bh gains a dedicated 'wired' byte (not a flags bit: it is set with a plain monotonic store while the caller holds only a shared buffer latch, so it must not share the non-atomic RMW of the flags word that __memp_pgwrite uses to clear BH_DIRTY). Reset to 0 at every buffer-header (re)init site. - __memp_alloc skips wired buffers when choosing a victim. - __memp_wire() sets it; guarded against memory-mapped pages (whose page pointer is not a buffer frame -- caught a SIGBUS in test001 with mmap'd files). - __bam_search wires P_IBTREE/P_IRECNO pages on descent (bounded: internal levels only, never leaves). No measurable perf change yet (internals were already hot-resident); this only guarantees residency for step 2 (LSN-validated optimistic descent). Validated: NOSYNC forced-eviction integrity (50k/1MB); TCL test001 btree+hash, test003.
- __memp_unwire(): clears the wired mark so a freed frame is evictable again; called from __db_free (the single page-free chokepoint for all access methods) and from __memp_bhfree for the file/env-close discard path. The wired byte gates the counter so it is decremented exactly once. - Per-region wired-page counter (MPOOL.wired_pages, atomic) with a cap of MPOOL_WIRED_MAX_PCT (25%) of the region's buffers: over the cap __memp_wire is a no-op and the descent uses a normal pin, so wiring can never starve the cache. - db_stat -m reports 'Wired buffers (non-evictable)'. - mmap guard on both wire and unwire (page ptr is not a buffer frame). Validated: NOSYNC integrity 50k/1MB; TCL test001 btree+hash, test003, test011 (cursor splits/merges -> page frees exercise __memp_unwire).
Per review: internal/subtree-root pages should stay in the normal evictable pool; only the single main tree root (BAM_ROOT_PGNO) -- fetched by every operation -- is wired, so it stays resident without churning eviction and the root snapshot can refresh cheaply. Move the __memp_wire call from the all-internals site in __bam_search to __bam_get_root, gated on h->pgno == BAM_ROOT_PGNO(dbc). Unwiring is already handled on page free (__db_free) and file close (__memp_bhfree). Validated: NOSYNC integrity 50k/1MB; TCL test001 btree+hash, test011.
Read lookups of the main tree no longer fetch (pin/latch) the contended live root. Each handle keeps a private immutable copy of the root taken at a known root LSN; a plain read of the live root LSN (via the wired root buffer) confirms the copy is current, the copy yields the descent's first child, and __bam_search starts from that child -- never touching the live root. Correctness: after the child is fetched, the live root LSN is re-checked (seqlock); if it changed (a split added a level, or a merge freed the child) the child is released and the descent restarts from the real root. Gated to plain read finds of a logged, durable, non-multiversion btree (where the page LSN reliably advances on root modification) -- everything else uses the normal descent. Old copies are retired to a free list and released at handle close (root changes are rare; no reader/free race, no epoch reclamation). Child selection reuses __bam_cmp and the exact __bam_search binary-search rule so the chosen child is identical to a normal descent. Validated: NOSYNC integrity 50k/1MB (logged, fast path active, all verified); TCL test001 btree+hash, test003, test011 (dups), test026. The first cut mis-gated non-logged envs (LSN never advances -> stale copy -> wrong results); fixed with the LOGGING_ON + durable gate. Concurrent stress + scaling on meh next.
rrand 200k/3s on meh (24t, tmpfs): snapshot beats master at every thread count (+22-29% at 4-8t) but both peak ~8t and negatively scale to 24t. The snapshot raises the read-scaling ceiling without removing it; at 24t the bottleneck has moved to the lock-manager locker region (lockers% 51-67%, lockpart% ~0.1). Real measured win worth landing; multicore scaling past ~8 cores now bounded by the lock manager (ROADMAP #4).
…neck perf on meh (24t, snapshot): 40.5% of time is futex wait under __db_pthread_mutex_lock, split between __db_cursor_int (cursor alloc) and __dbc_close (cursor free) -- the per-get transient cursor linked/unlinked on the ONE shared DB handle's active-cursor queue (dbp->mutex). Per-thread handles (sepdb) run +49% at 24t and scale near-linearly to 8t, proving it. Next bottleneck underneath: __memp_fget/fput hash-bucket latch + refcount atomics per descent page (root snapshot removed only the root fetch). Benchmark critique: aggregate metric is sound; it induces lock-manager traffic via DB_INIT_LOCK|TXN reads (should also measure READ_UNCOMMITTED); targ_t.ops false-shares (latent, not in profile); meh is 12c/24t so the 12->24 tail is HT + all-core turbo, and peak-at-8 is software contention. Conclusion: the next scaling fix is the per-get cursor-allocation mutex -- NOT Stage 1c (blocked, only partial) nor Stage 2 #5 (orthogonal: rrand is 100% cache hits, zero I/O).
Each thread opens its own handle on the SAME bench.db (removing the shared-handle cursor-queue mutex app-side) and reads under a selectable isolation level (none/read-committed/snapshot/uncommitted) to measure how far BDB scales with full transactional isolation -- not requiring uncommitted reads. Per-thread state is cache-line padded.
scale_iso (per-thread handle on shared bench.db): full-isolation reads
('none', per-op page read locks) scale identically to uncommitted (668k
vs 656k @ 24t, 3x to 8t) -- isolation is NOT the scaling barrier, the
shared handle was. Per-op explicit txns ('rc') collapse past 8t (txn/
locker/log machinery = bottleneck #3); long-lived MVCC ('snap') avoids it.
Documents the cursor-allocation fix design (sharded queues recommended,
needs full run_std; ~+47% prize vs shared-handle path).
Scaling measurements, profiling, and design exploration are development notes, not user-facing documentation; they do not belong in ./docs (which should track the code). Moved to the agent notes area, which is never committed. (The same file remains on master from an earlier PR and should be removed there in a follow-up.)
The Stage 1b root-snapshot fast path cached the live-root buffer address (bt_rootpage) and read its LSN lock-free during descent. __memp_wire, however, silently no-ops when the per-region wired cap is reached (or on mmap'd pages) yet returned 0 in every case, so the caller could not tell whether the frame was actually wired. __bam_rsnap_refresh cached the frame unconditionally; an un-wired frame is evictable and its address can dangle, so __bam_rsnap_child could read LSN() from a freed/reused buffer. - __memp_wire: add a wiredp out-param reporting whether the frame is wired on return (newly wired or already wired); document that callers caching the address for lock-free reads must check it. - __bam_rsnap_refresh: only build the snapshot and cache bt_rootpage when wiring took; otherwise leave them NULL so the descent falls back to the normal pinned path. - bt_compact: when compaction moves the root to a new page, disarm the cached snapshot (bt_rootpage = NULL) under the handle mutex so a stale frame pointer is never followed. - Fix the cap arithmetic (pages * PCT / 100) so it does not truncate to zero for caches smaller than 100 buffers. Validated: clean build (debug + release); TCL lock001, txn001, test001, ssi001, ssi002, test003, test011, test026, test111 (compaction, incl. -revsplitoff) all pass.
bb2e78e to
5edf15c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stage 1b: lock-free root snapshot for read descents
Lift the multi-core read-scaling ceiling by removing the per-operation
contention on the B-tree root page. Every descent fetches the root via
__memp_fget, which does atomic read-modify-writes on shared cache lines(bucket-latch share count, buffer pin/refcount, buffer-latch share count).
Because every thread touches the root on every operation, that line is a true
serialization point: in-cache reads peak around 8 threads and then negatively
scale.
Approach
Each DB handle keeps a private, immutable copy of the root page taken at a
known root LSN, plus a wired (non-evictable) pointer to the live root buffer. A
plain read descent picks its first child out of the private copy — a pure read,
no pin, no latch, no shared write — and validates the copy against the live
root LSN before and after fetching the child. On the rare root change
(split/merge) it falls back to the normal pinned/latched descent and refreshes
the copy.
Wiring the root sidesteps the need for general epoch reclamation: the single
hot frame is simply never freed, so a lock-free reader can never dereference
reclaimed memory.
What's in the branch
BH_WIRED/wiredbuffer mark +__memp_wire/__memp_unwire, evictor skipswired frames, capped at
MPOOL_WIRED_MAX_PCTof the region so wiring cannotstarve the cache. Only the one common tree root is wired.
BAM_RSNAPper-handle root snapshot;__bam_rsnap_refresh/__bam_rsnap_child; superseded copies retired to a free list and freed athandle close (avoids a reader/free race without epoch reclamation).
del/min/max, not OPD, not recno/recnum,
multiversion == 0, logging on,durable. A plain read descent acquires no root page lock on the normal path
either (
BAM_GET_ROOTlocks the root only for WRITE/recno/recnum/downrev),so the fast path skips no lock it should have taken.
Correctness
the reader's pre/post LSN check brackets the child selection and never acts
on an unvalidated result.
refcount == 0+ exclusive-mtx_bufhandshake is unchanged.compaction is the one runtime root move and now disarms the cached snapshot.
A use-after-free found in review is fixed in bb2e78e:
__memp_wirenowreports whether the frame is actually wired, and the snapshot is cached only
when wiring took (otherwise the descent falls back); compaction invalidates the
cached root pointer; the wired-cap arithmetic no longer truncates to zero for
small caches.
Validation
(duplicates/OPD), test026 (cursor delete), test111 (compaction incl.
-revsplitoff) all pass.scaling-findings); headline metric is whether the ~8-thread ceiling lifts and
the futex/atomic self-time drops at 8–24 threads.
Maps to ROADMAP #2 (latch-free buffer lookup). Stage 1c (lock-free descent of
deeper internal pages) is intentionally out of scope — profiling shows the next
bottleneck is elsewhere and a correct version needs epoch reclamation.