perf(btree): Stage 1b — lock-free root snapshot for read descents by gburd · Pull Request #22 · berkeleydb/libdb

gburd · 2026-06-17T20:09:57Z

Stage 1b: lock-free root snapshot for read descents

Lift the multi-core read-scaling ceiling by removing the per-operation
contention on the B-tree root page. Every descent fetches the root via
__memp_fget, which does atomic read-modify-writes on shared cache lines
(bucket-latch share count, buffer pin/refcount, buffer-latch share count).
Because every thread touches the root on every operation, that line is a true
serialization point: in-cache reads peak around 8 threads and then negatively
scale.

Approach

Each DB handle keeps a private, immutable copy of the root page taken at a
known root LSN, plus a wired (non-evictable) pointer to the live root buffer. A
plain read descent picks its first child out of the private copy — a pure read,
no pin, no latch, no shared write — and validates the copy against the live
root LSN before and after fetching the child. On the rare root change
(split/merge) it falls back to the normal pinned/latched descent and refreshes
the copy.

Wiring the root sidesteps the need for general epoch reclamation: the single
hot frame is simply never freed, so a lock-free reader can never dereference
reclaimed memory.

What's in the branch

BH_WIRED/wired buffer mark + __memp_wire/__memp_unwire, evictor skips
wired frames, capped at MPOOL_WIRED_MAX_PCT of the region so wiring cannot
starve the cache. Only the one common tree root is wired.
BAM_RSNAP per-handle root snapshot; __bam_rsnap_refresh /
__bam_rsnap_child; superseded copies retired to a free list and freed at
handle close (avoids a reader/free race without epoch reclamation).
Fast path gated to plain btree read lookups: not write/stack/parent/next/
del/min/max, not OPD, not recno/recnum, multiversion == 0, logging on,
durable. A plain read descent acquires no root page lock on the normal path
either (BAM_GET_ROOT locks the root only for WRITE/recno/recnum/downrev),
so the fast path skips no lock it should have taken.

Correctness

Every structural change bumps the page LSN under the page's exclusive latch;
the reader's pre/post LSN check brackets the child selection and never acts
on an unvalidated result.
Eviction's refcount == 0 + exclusive-mtx_buf handshake is unchanged.
Root page number is stable in BDB (splits push content down, root pgno fixed);
compaction is the one runtime root move and now disarms the cached snapshot.

A use-after-free found in review is fixed in bb2e78e: __memp_wire now
reports whether the frame is actually wired, and the snapshot is cached only
when wiring took (otherwise the descent falls back); compaction invalidates the
cached root pointer; the wired-cap arithmetic no longer truncates to zero for
small caches.

Validation

Clean debug + release builds, no new warnings.
TCL: lock001, txn001, test001 (btree), ssi001, ssi002, test003, test011
(duplicates/OPD), test026 (cursor delete), test111 (compaction incl.
-revsplitoff) all pass.
Multi-core scaling measurement on the 24-core box is tracked separately (see
scaling-findings); headline metric is whether the ~8-thread ceiling lifts and
the futex/atomic self-time drops at 8–24 threads.

Maps to ROADMAP #2 (latch-free buffer lookup). Stage 1c (lock-free descent of
deeper internal pages) is intentionally out of scope — profiling shows the next
bottleneck is elsewhere and a correct version needs epoch reclamation.

Foundation for the optimistic descent: B-tree internal/root pages are wired so the frame is never reclaimed, letting a later lock-free descent read them without a use-after-free hazard (BDB has no epoch reclamation). - struct __bh gains a dedicated 'wired' byte (not a flags bit: it is set with a plain monotonic store while the caller holds only a shared buffer latch, so it must not share the non-atomic RMW of the flags word that __memp_pgwrite uses to clear BH_DIRTY). Reset to 0 at every buffer-header (re)init site. - __memp_alloc skips wired buffers when choosing a victim. - __memp_wire() sets it; guarded against memory-mapped pages (whose page pointer is not a buffer frame -- caught a SIGBUS in test001 with mmap'd files). - __bam_search wires P_IBTREE/P_IRECNO pages on descent (bounded: internal levels only, never leaves). No measurable perf change yet (internals were already hot-resident); this only guarantees residency for step 2 (LSN-validated optimistic descent). Validated: NOSYNC forced-eviction integrity (50k/1MB); TCL test001 btree+hash, test003.

- __memp_unwire(): clears the wired mark so a freed frame is evictable again; called from __db_free (the single page-free chokepoint for all access methods) and from __memp_bhfree for the file/env-close discard path. The wired byte gates the counter so it is decremented exactly once. - Per-region wired-page counter (MPOOL.wired_pages, atomic) with a cap of MPOOL_WIRED_MAX_PCT (25%) of the region's buffers: over the cap __memp_wire is a no-op and the descent uses a normal pin, so wiring can never starve the cache. - db_stat -m reports 'Wired buffers (non-evictable)'. - mmap guard on both wire and unwire (page ptr is not a buffer frame). Validated: NOSYNC integrity 50k/1MB; TCL test001 btree+hash, test003, test011 (cursor splits/merges -> page frees exercise __memp_unwire).

Per review: internal/subtree-root pages should stay in the normal evictable pool; only the single main tree root (BAM_ROOT_PGNO) -- fetched by every operation -- is wired, so it stays resident without churning eviction and the root snapshot can refresh cheaply. Move the __memp_wire call from the all-internals site in __bam_search to __bam_get_root, gated on h->pgno == BAM_ROOT_PGNO(dbc). Unwiring is already handled on page free (__db_free) and file close (__memp_bhfree). Validated: NOSYNC integrity 50k/1MB; TCL test001 btree+hash, test011.

Read lookups of the main tree no longer fetch (pin/latch) the contended live root. Each handle keeps a private immutable copy of the root taken at a known root LSN; a plain read of the live root LSN (via the wired root buffer) confirms the copy is current, the copy yields the descent's first child, and __bam_search starts from that child -- never touching the live root. Correctness: after the child is fetched, the live root LSN is re-checked (seqlock); if it changed (a split added a level, or a merge freed the child) the child is released and the descent restarts from the real root. Gated to plain read finds of a logged, durable, non-multiversion btree (where the page LSN reliably advances on root modification) -- everything else uses the normal descent. Old copies are retired to a free list and released at handle close (root changes are rare; no reader/free race, no epoch reclamation). Child selection reuses __bam_cmp and the exact __bam_search binary-search rule so the chosen child is identical to a normal descent. Validated: NOSYNC integrity 50k/1MB (logged, fast path active, all verified); TCL test001 btree+hash, test003, test011 (dups), test026. The first cut mis-gated non-logged envs (LSN never advances -> stale copy -> wrong results); fixed with the LOGGING_ON + durable gate. Concurrent stress + scaling on meh next.

rrand 200k/3s on meh (24t, tmpfs): snapshot beats master at every thread count (+22-29% at 4-8t) but both peak ~8t and negatively scale to 24t. The snapshot raises the read-scaling ceiling without removing it; at 24t the bottleneck has moved to the lock-manager locker region (lockers% 51-67%, lockpart% ~0.1). Real measured win worth landing; multicore scaling past ~8 cores now bounded by the lock manager (ROADMAP #4).

…neck perf on meh (24t, snapshot): 40.5% of time is futex wait under __db_pthread_mutex_lock, split between __db_cursor_int (cursor alloc) and __dbc_close (cursor free) -- the per-get transient cursor linked/unlinked on the ONE shared DB handle's active-cursor queue (dbp->mutex). Per-thread handles (sepdb) run +49% at 24t and scale near-linearly to 8t, proving it. Next bottleneck underneath: __memp_fget/fput hash-bucket latch + refcount atomics per descent page (root snapshot removed only the root fetch). Benchmark critique: aggregate metric is sound; it induces lock-manager traffic via DB_INIT_LOCK|TXN reads (should also measure READ_UNCOMMITTED); targ_t.ops false-shares (latent, not in profile); meh is 12c/24t so the 12->24 tail is HT + all-core turbo, and peak-at-8 is software contention. Conclusion: the next scaling fix is the per-get cursor-allocation mutex -- NOT Stage 1c (blocked, only partial) nor Stage 2 #5 (orthogonal: rrand is 100% cache hits, zero I/O).

Each thread opens its own handle on the SAME bench.db (removing the shared-handle cursor-queue mutex app-side) and reads under a selectable isolation level (none/read-committed/snapshot/uncommitted) to measure how far BDB scales with full transactional isolation -- not requiring uncommitted reads. Per-thread state is cache-line padded.

scale_iso (per-thread handle on shared bench.db): full-isolation reads ('none', per-op page read locks) scale identically to uncommitted (668k vs 656k @ 24t, 3x to 8t) -- isolation is NOT the scaling barrier, the shared handle was. Per-op explicit txns ('rc') collapse past 8t (txn/ locker/log machinery = bottleneck #3); long-lived MVCC ('snap') avoids it. Documents the cursor-allocation fix design (sharded queues recommended, needs full run_std; ~+47% prize vs shared-handle path).

Scaling measurements, profiling, and design exploration are development notes, not user-facing documentation; they do not belong in ./docs (which should track the code). Moved to the agent notes area, which is never committed. (The same file remains on master from an earlier PR and should be removed there in a follow-up.)

The Stage 1b root-snapshot fast path cached the live-root buffer address (bt_rootpage) and read its LSN lock-free during descent. __memp_wire, however, silently no-ops when the per-region wired cap is reached (or on mmap'd pages) yet returned 0 in every case, so the caller could not tell whether the frame was actually wired. __bam_rsnap_refresh cached the frame unconditionally; an un-wired frame is evictable and its address can dangle, so __bam_rsnap_child could read LSN() from a freed/reused buffer. - __memp_wire: add a wiredp out-param reporting whether the frame is wired on return (newly wired or already wired); document that callers caching the address for lock-free reads must check it. - __bam_rsnap_refresh: only build the snapshot and cache bt_rootpage when wiring took; otherwise leave them NULL so the descent falls back to the normal pinned path. - bt_compact: when compaction moves the root to a new page, disarm the cached snapshot (bt_rootpage = NULL) under the handle mutex so a stale frame pointer is never followed. - Fix the cap arithmetic (pages * PCT / 100) so it does not truncate to zero for caches smaller than 100 buffers. Validated: clean build (debug + release); TCL lock001, txn001, test001, ssi001, ssi002, test003, test011, test026, test111 (compaction, incl. -revsplitoff) all pass.

gburd changed the title ~~perf(mpool): Stage 1 — BH_WIRED + optimistic descent (WIP)~~ perf(btree): Stage 1b — lock-free root snapshot for read descents Jun 19, 2026

gburd marked this pull request as ready for review June 19, 2026 16:43

gburd mentioned this pull request Jun 19, 2026

perf(mpool): #2/#7 prototype — measured that false-sharing isn't the cap #19

Closed

gburd added 12 commits June 19, 2026 16:44

test(bench): fix targ_t cache-line alignment in scale_iso

49625df

test(bench): pad targ_t to one cache line in scale_iso

e30ec88

gburd force-pushed the perf/swip-stage1-descent branch from bb2e78e to 5edf15c Compare June 19, 2026 20:48

gburd merged commit e8e77dc into master Jun 20, 2026
36 of 39 checks passed

gburd deleted the perf/swip-stage1-descent branch June 20, 2026 01:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(btree): Stage 1b — lock-free root snapshot for read descents#22

perf(btree): Stage 1b — lock-free root snapshot for read descents#22
gburd merged 12 commits into
masterfrom
perf/swip-stage1-descent

gburd commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gburd commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stage 1b: lock-free root snapshot for read descents

Approach

What's in the branch

Correctness

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gburd commented Jun 17, 2026 •

edited

Loading