Skip to content

perf(btree): Stage 1b — lock-free root snapshot for read descents#22

Merged
gburd merged 12 commits into
masterfrom
perf/swip-stage1-descent
Jun 20, 2026
Merged

perf(btree): Stage 1b — lock-free root snapshot for read descents#22
gburd merged 12 commits into
masterfrom
perf/swip-stage1-descent

Conversation

@gburd

@gburd gburd commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Stage 1b: lock-free root snapshot for read descents

Lift the multi-core read-scaling ceiling by removing the per-operation
contention on the B-tree root page. Every descent fetches the root via
__memp_fget, which does atomic read-modify-writes on shared cache lines
(bucket-latch share count, buffer pin/refcount, buffer-latch share count).
Because every thread touches the root on every operation, that line is a true
serialization point: in-cache reads peak around 8 threads and then negatively
scale.

Approach

Each DB handle keeps a private, immutable copy of the root page taken at a
known root LSN, plus a wired (non-evictable) pointer to the live root buffer. A
plain read descent picks its first child out of the private copy — a pure read,
no pin, no latch, no shared write — and validates the copy against the live
root LSN before and after fetching the child. On the rare root change
(split/merge) it falls back to the normal pinned/latched descent and refreshes
the copy.

Wiring the root sidesteps the need for general epoch reclamation: the single
hot frame is simply never freed, so a lock-free reader can never dereference
reclaimed memory.

What's in the branch

  • BH_WIRED/wired buffer mark + __memp_wire/__memp_unwire, evictor skips
    wired frames, capped at MPOOL_WIRED_MAX_PCT of the region so wiring cannot
    starve the cache. Only the one common tree root is wired.
  • BAM_RSNAP per-handle root snapshot; __bam_rsnap_refresh /
    __bam_rsnap_child; superseded copies retired to a free list and freed at
    handle close (avoids a reader/free race without epoch reclamation).
  • Fast path gated to plain btree read lookups: not write/stack/parent/next/
    del/min/max, not OPD, not recno/recnum, multiversion == 0, logging on,
    durable. A plain read descent acquires no root page lock on the normal path
    either (BAM_GET_ROOT locks the root only for WRITE/recno/recnum/downrev),
    so the fast path skips no lock it should have taken.

Correctness

  • Every structural change bumps the page LSN under the page's exclusive latch;
    the reader's pre/post LSN check brackets the child selection and never acts
    on an unvalidated result.
  • Eviction's refcount == 0 + exclusive-mtx_buf handshake is unchanged.
  • Root page number is stable in BDB (splits push content down, root pgno fixed);
    compaction is the one runtime root move and now disarms the cached snapshot.

A use-after-free found in review is fixed in bb2e78e: __memp_wire now
reports whether the frame is actually wired, and the snapshot is cached only
when wiring took (otherwise the descent falls back); compaction invalidates the
cached root pointer; the wired-cap arithmetic no longer truncates to zero for
small caches.

Validation

  • Clean debug + release builds, no new warnings.
  • TCL: lock001, txn001, test001 (btree), ssi001, ssi002, test003, test011
    (duplicates/OPD), test026 (cursor delete), test111 (compaction incl.
    -revsplitoff) all pass.
  • Multi-core scaling measurement on the 24-core box is tracked separately (see
    scaling-findings); headline metric is whether the ~8-thread ceiling lifts and
    the futex/atomic self-time drops at 8–24 threads.

Maps to ROADMAP #2 (latch-free buffer lookup). Stage 1c (lock-free descent of
deeper internal pages) is intentionally out of scope — profiling shows the next
bottleneck is elsewhere and a correct version needs epoch reclamation.

@gburd gburd changed the title perf(mpool): Stage 1 — BH_WIRED + optimistic descent (WIP) perf(btree): Stage 1b — lock-free root snapshot for read descents Jun 19, 2026
@gburd gburd marked this pull request as ready for review June 19, 2026 16:43
gburd added 12 commits June 19, 2026 16:44
Foundation for the optimistic descent: B-tree internal/root pages are wired so
the frame is never reclaimed, letting a later lock-free descent read them
without a use-after-free hazard (BDB has no epoch reclamation).

- struct __bh gains a dedicated 'wired' byte (not a flags bit: it is set with a
  plain monotonic store while the caller holds only a shared buffer latch, so it
  must not share the non-atomic RMW of the flags word that __memp_pgwrite uses
  to clear BH_DIRTY).  Reset to 0 at every buffer-header (re)init site.
- __memp_alloc skips wired buffers when choosing a victim.
- __memp_wire() sets it; guarded against memory-mapped pages (whose page
  pointer is not a buffer frame -- caught a SIGBUS in test001 with mmap'd files).
- __bam_search wires P_IBTREE/P_IRECNO pages on descent (bounded: internal
  levels only, never leaves).

No measurable perf change yet (internals were already hot-resident); this only
guarantees residency for step 2 (LSN-validated optimistic descent).

Validated: NOSYNC forced-eviction integrity (50k/1MB); TCL test001 btree+hash,
test003.
- __memp_unwire(): clears the wired mark so a freed frame is evictable again;
  called from __db_free (the single page-free chokepoint for all access
  methods) and from __memp_bhfree for the file/env-close discard path. The
  wired byte gates the counter so it is decremented exactly once.
- Per-region wired-page counter (MPOOL.wired_pages, atomic) with a cap of
  MPOOL_WIRED_MAX_PCT (25%) of the region's buffers: over the cap __memp_wire is
  a no-op and the descent uses a normal pin, so wiring can never starve the
  cache.
- db_stat -m reports 'Wired buffers (non-evictable)'.
- mmap guard on both wire and unwire (page ptr is not a buffer frame).

Validated: NOSYNC integrity 50k/1MB; TCL test001 btree+hash, test003,
test011 (cursor splits/merges -> page frees exercise __memp_unwire).
Per review: internal/subtree-root pages should stay in the normal evictable
pool; only the single main tree root (BAM_ROOT_PGNO) -- fetched by every
operation -- is wired, so it stays resident without churning eviction and the
root snapshot can refresh cheaply.  Move the __memp_wire call from the
all-internals site in __bam_search to __bam_get_root, gated on
h->pgno == BAM_ROOT_PGNO(dbc).  Unwiring is already handled on page free
(__db_free) and file close (__memp_bhfree).

Validated: NOSYNC integrity 50k/1MB; TCL test001 btree+hash, test011.
Read lookups of the main tree no longer fetch (pin/latch) the contended live
root.  Each handle keeps a private immutable copy of the root taken at a known
root LSN; a plain read of the live root LSN (via the wired root buffer)
confirms the copy is current, the copy yields the descent's first child, and
__bam_search starts from that child -- never touching the live root.

Correctness: after the child is fetched, the live root LSN is re-checked
(seqlock); if it changed (a split added a level, or a merge freed the child)
the child is released and the descent restarts from the real root.  Gated to
plain read finds of a logged, durable, non-multiversion btree (where the page
LSN reliably advances on root modification) -- everything else uses the normal
descent.  Old copies are retired to a free list and released at handle close
(root changes are rare; no reader/free race, no epoch reclamation).

Child selection reuses __bam_cmp and the exact __bam_search binary-search rule
so the chosen child is identical to a normal descent.

Validated: NOSYNC integrity 50k/1MB (logged, fast path active, all verified);
TCL test001 btree+hash, test003, test011 (dups), test026.  The first cut
mis-gated non-logged envs (LSN never advances -> stale copy -> wrong results);
fixed with the LOGGING_ON + durable gate.  Concurrent stress + scaling on meh
next.
rrand 200k/3s on meh (24t, tmpfs): snapshot beats master at every thread
count (+22-29% at 4-8t) but both peak ~8t and negatively scale to 24t.
The snapshot raises the read-scaling ceiling without removing it; at 24t
the bottleneck has moved to the lock-manager locker region (lockers%
51-67%, lockpart% ~0.1).  Real measured win worth landing; multicore
scaling past ~8 cores now bounded by the lock manager (ROADMAP #4).
…neck

perf on meh (24t, snapshot): 40.5% of time is futex wait under
__db_pthread_mutex_lock, split between __db_cursor_int (cursor alloc) and
__dbc_close (cursor free) -- the per-get transient cursor linked/unlinked
on the ONE shared DB handle's active-cursor queue (dbp->mutex).  Per-thread
handles (sepdb) run +49% at 24t and scale near-linearly to 8t, proving it.

Next bottleneck underneath: __memp_fget/fput hash-bucket latch + refcount
atomics per descent page (root snapshot removed only the root fetch).

Benchmark critique: aggregate metric is sound; it induces lock-manager
traffic via DB_INIT_LOCK|TXN reads (should also measure READ_UNCOMMITTED);
targ_t.ops false-shares (latent, not in profile); meh is 12c/24t so the
12->24 tail is HT + all-core turbo, and peak-at-8 is software contention.

Conclusion: the next scaling fix is the per-get cursor-allocation mutex --
NOT Stage 1c (blocked, only partial) nor Stage 2 #5 (orthogonal: rrand is
100% cache hits, zero I/O).
Each thread opens its own handle on the SAME bench.db (removing the
shared-handle cursor-queue mutex app-side) and reads under a selectable
isolation level (none/read-committed/snapshot/uncommitted) to measure how
far BDB scales with full transactional isolation -- not requiring
uncommitted reads.  Per-thread state is cache-line padded.
scale_iso (per-thread handle on shared bench.db): full-isolation reads
('none', per-op page read locks) scale identically to uncommitted (668k
vs 656k @ 24t, 3x to 8t) -- isolation is NOT the scaling barrier, the
shared handle was.  Per-op explicit txns ('rc') collapse past 8t (txn/
locker/log machinery = bottleneck #3); long-lived MVCC ('snap') avoids it.
Documents the cursor-allocation fix design (sharded queues recommended,
needs full run_std; ~+47% prize vs shared-handle path).
Scaling measurements, profiling, and design exploration are development
notes, not user-facing documentation; they do not belong in ./docs (which
should track the code).  Moved to the agent notes area, which is never
committed.  (The same file remains on master from an earlier PR and should
be removed there in a follow-up.)
The Stage 1b root-snapshot fast path cached the live-root buffer address
(bt_rootpage) and read its LSN lock-free during descent.  __memp_wire,
however, silently no-ops when the per-region wired cap is reached (or on
mmap'd pages) yet returned 0 in every case, so the caller could not tell
whether the frame was actually wired.  __bam_rsnap_refresh cached the
frame unconditionally; an un-wired frame is evictable and its address can
dangle, so __bam_rsnap_child could read LSN() from a freed/reused buffer.

- __memp_wire: add a wiredp out-param reporting whether the frame is wired
  on return (newly wired or already wired); document that callers caching
  the address for lock-free reads must check it.
- __bam_rsnap_refresh: only build the snapshot and cache bt_rootpage when
  wiring took; otherwise leave them NULL so the descent falls back to the
  normal pinned path.
- bt_compact: when compaction moves the root to a new page, disarm the
  cached snapshot (bt_rootpage = NULL) under the handle mutex so a stale
  frame pointer is never followed.
- Fix the cap arithmetic (pages * PCT / 100) so it does not truncate to
  zero for caches smaller than 100 buffers.

Validated: clean build (debug + release); TCL lock001, txn001, test001,
ssi001, ssi002, test003, test011, test026, test111 (compaction, incl.
-revsplitoff) all pass.
@gburd gburd force-pushed the perf/swip-stage1-descent branch from bb2e78e to 5edf15c Compare June 19, 2026 20:48
@gburd gburd merged commit e8e77dc into master Jun 20, 2026
36 of 39 checks passed
@gburd gburd deleted the perf/swip-stage1-descent branch June 20, 2026 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant