perf(lock): shared locker latch on lock-get hot path (2.7x at 24t)#28
Merged
Conversation
Every DB_ENV->lock_get / lock_put resolves its locker through __lock_getlocker_int under the region-global locker mutex (mtx_lockers). On the lock-get path the lookup is create=0 -- a read-only walk of the locker hash bucket -- yet it was held *exclusive*, serializing every lock acquisition across all cores even when objects are fully partitioned and there is no lock conflict. Make mtx_lockers a DB_MUTEX_SHARED latch and take it in shared mode for the read-only locker lookup on the hot path (__lock_get_api). Locker create, free, the deadlock detector's locker-list walk, failchk, and stat continue to hold it exclusive, so they never run concurrently with a reader. Measured with lab/bench/lock_bench (distinct mode, no lock conflict, on a 24-thread box): master plateaus and then declines past 8 threads (~3.0M ops/s peak, 2.6M at 24t); the shared latch scales to 7.0M at 24t -- 2.1x at 8 threads, 2.7x at 24. It captures roughly half the upper bound of removing the mutex entirely; the remainder is the shared latch's own reference-count cache line, which would require partitioning the locker hash to recover (left for later -- this is the low-risk 80/20). A small single-thread regression (~8%) reflects the shared latch's slightly higher uncontended cost and is dwarfed by the multi-core gain. Verified: TCL lock001/002/003 (incl. multi-process), txn001/002, test001, ssi001/002 pass; concurrent shared read-lock acquisition (lock_bench shared) runs clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
perf(lock): take the locker mutex shared on the lock-get hot path
Every
DB_ENV->lock_get/lock_putresolves its locker through__lock_getlocker_intunder the region-global locker mutexmtx_lockers.On the lock-get path that lookup is
create=0— a read-only walk of alocker hash bucket — yet it was held exclusive, serializing every lock
acquisition across all cores even with objects fully partitioned (240-way)
and zero lock conflict.
Fix
Make
mtx_lockersaDB_MUTEX_SHAREDlatch and take it shared for theread-only locker lookup on the hot path (
__lock_get_api). Lockercreate/free, the deadlock detector's locker-list walk, failchk, and stat keep
it exclusive, so a reader never runs concurrently with a writer.
Measured (
lab/bench/lock_benchdistinct, no conflict, 24-thread box)Master plateaus and declines past 8 threads; the shared latch scales to
24 threads. *Upper bound = removing the mutex entirely (unsafe diagnostic);
the shared latch captures ~half, the rest needs partitioning the locker hash
(deferred — more invasive). Single-thread cost rises ~8% (shared vs plain
mutex, uncontended), dwarfed by the multi-core gain.
No regression on real workloads:
rrandunchanged (btree-bound),tproc_bflat (deadlock/disk-bound) — helps where the bottleneck is, costs nothing
elsewhere.
Verified
TCL
lock001/002/003(incl. the multi-process test),txn001/002,test001,ssi001/002pass; concurrent shared read-lock acquisition runsclean; clean build (gcc via Nix, Apple clang).
The probe and benchmark fixes used to find this are in #27.