Skip to content

test(jepsen): add ZSet safety workload with model-based checker#550

Open
bootjp wants to merge 42 commits into
mainfrom
feat/jepsen-zset-safety
Open

test(jepsen): add ZSet safety workload with model-based checker#550
bootjp wants to merge 42 commits into
mainfrom
feat/jepsen-zset-safety

Conversation

@bootjp

@bootjp bootjp commented Apr 19, 2026

Copy link
Copy Markdown
Owner

Summary

Adds a Jepsen workload (elastickv-zset-safety-test) that verifies ZSet-specific safety properties under faults (network partitions, node kills), using a custom model-based checker. Goes beyond the simple add->read visibility check.

Properties verified:

  • Score correctness: a ZRANGE result's score for a member must equal the model's latest committed score, OR equal a score written by an operation concurrent with the read.
  • Order preservation: ZRANGE 0 -1 must be sorted by (score asc, member lex asc).
  • ZRANGEBYSCORE correctness: bounded range queries return exactly the members whose score falls in the bound, modulo concurrent mutations.
  • No phantom members: every read member must have been introduced by some successful or in-flight ZADD/ZINCRBY.
  • Atomicity: the checker treats every :ok operation as atomic; any visible inconsistency is reported.

Concurrent-ZADD handling uses an invoke/complete windowing approach. A mutation is "committed before" a read iff its :complete index is strictly less than the read's :invoke index. Mutations whose intervals overlap are "concurrent" and contribute to a per-member allowed-score set. Indeterminate (:info) mutations are treated as possibly-concurrent. ZINCRBY whose response is unknown sets :unknown-score? so the checker skips the strict score check for concurrent reads. ZREM carries the actual removed? boolean from the server reply so a no-op ZREM does not falsely mark the member as deleted.

Files

  • New: jepsen/src/elastickv/redis_zset_safety_workload.clj
  • New: jepsen/test/elastickv/redis_zset_safety_workload_test.clj (test-spec construction + checker edge cases: no-op ZREM, :info ZINCRBY, deterministic score mismatch).
  • Modified: jepsen/src/elastickv/jepsen_test.clj (entry point added).
  • Modified: .github/workflows/jepsen-test.yml (5s smoke run on every push).
  • Modified: .github/workflows/jepsen-test-scheduled.yml (150s default run every 6h).

Running locally

The workload is invoked directly via its own -main, not through jepsen-test/-main:

cd jepsen
lein run -m elastickv.redis-zset-safety-workload \
  --time-limit 60 --rate 10 --concurrency 5 \
  --ports 63791,63792,63793 --host 127.0.0.1

(elastickv.jepsen-test exposes elastickv-zset-safety-test only as a Clojure function for REPL use; CI and ad-hoc runs use the namespace's own -main.

Test plan

  • passes (Java 21).
  • Run with partition + kill faults: confirm checker emits a clear failure when a stale-leader read returns a divergent score.
    EOF
    )

Summary by CodeRabbit

  • New Features
    • Added a Redis ZSet “safety” Jepsen workload with a dedicated CLI test entrypoint and safety checker.
  • CI/CD
    • Updated both scheduled and on-demand Jepsen workflows to run the new ZSet safety workload.
  • Bug Fixes
    • Improved the wide-column Redis range-by-score fast path for truncation and equal-score tie handling.
  • Tests
    • Added Jepsen unit coverage for workload/checker semantics and added Go tests for ZSet removal and fast-path edge cases.
  • Chores
    • Minor Jepsen CLI dispatch improvement, updated one stream latency test’s ID handling, and refreshed an explanatory comment.

Adds a Jepsen workload that goes beyond add->read visibility and verifies
ZSet-specific safety properties under faults (network partitions, node
kills):

- score correctness: a ZRANGE result's score for a member must equal the
  model's latest committed score for that member, OR equal a score
  written by an operation that is concurrent with the read (since the
  client cannot linearise concurrent writes to the same member).
- order preservation: ZRANGE 0 -1 must be sorted by (score asc, member
  lex asc).
- ZRANGEBYSCORE correctness: bounded range queries return exactly the
  members whose score falls in the bound, modulo concurrent mutations.
- no phantom members: every read member must have been introduced by some
  successful or in-flight ZADD/ZINCRBY.

Concurrent-ZADD handling uses an invoke/complete windowing approach:
mutations whose complete index < read's invoke index are committed
before the read; mutations whose intervals overlap are concurrent and
contribute to the per-member allowed-score set. Indeterminate (:info)
mutations are treated as possibly-concurrent.

Workload entry point added to jepsen_test.clj as elastickv-zset-safety-test.
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented Apr 19, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This pull request adds a Redis ZSet safety Jepsen workload, wires it into CLI and workflow entrypoints, expands checker and client tests, and updates Redis adapter ZSET range, persistence, and related support behavior.

Changes

Redis ZSet Safety Workload

Layer / File(s) Summary
CI workflow integration
.github/workflows/jepsen-test-scheduled.yml, .github/workflows/jepsen-test.yml
Added GitHub Actions steps to run the Redis ZSet safety workload in scheduled and on-demand Jepsen workflows with explicit timeouts, ports, host, and workload parameters.
Test dispatcher and entrypoints
jepsen/src/elastickv/jepsen_test.clj, jepsen/src/elastickv/redis_zset_safety_workload.clj
Registered the new Jepsen test entrypoint, added the workload constructor, and wired CLI dispatch so the new ZSet safety test can be selected from the command line.
Checker semantics
jepsen/src/elastickv/redis_zset_safety_workload.clj
Implements the model-based checker that reconstructs histories, derives mutation windows, validates read ordering and completeness, and classifies malformed or vacuous runs.
Workload test coverage
jepsen/test/elastickv/redis_zset_safety_workload_test.clj
Adds unit tests for workload construction, checker edge cases, linearization rules, malformed read handling, client response parsing, setup failure paths, and WITHSCORES normalization.

Redis Adapter and Support Updates

Layer / File(s) Summary
ZSET fast path and persistence
adapter/redis_lua_context.go, adapter/redis_zset_cmds.go
Updates the wide-column ZSET range-by-score fast path to adjust scan limits for ties, separate decoding from pagination, detect truncation and tie fallback cases, and preserve bounded LIMIT semantics. It also changes wide-column persistence to update member and score-index rows directly when wide-column storage is present.
ZSET range and removal tests
adapter/redis_zset_cmds_test.go, adapter/redis_retry_test.go
Adds tests for deterministic ordering of equal-score members, bounded tie fallback, bounded windows with unique scores, window-cap truncation, and wide-column ZREM storage cleanup.
Supporting adapter adjustments
adapter/redis_compat_commands_stream_test.go, cmd/elastickv-admin/main.go, adapter/redis_lua_context.go
Refines the ZSET scan-bounds comment, updates the stream latency test to use generated IDs, and adds a membership-cache freshness check in the admin refresh path.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90 minutes

Suggested labels

codex

Poem

🐰 I hopped through scores and scans tonight,
With zsets all sorted just right.
The checker nibbled every trail,
While fast paths danced without fail—
A carrot-bright PR in the moonlight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: adding a Jepsen ZSet safety workload and model-based checker.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/jepsen-zset-safety

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

- jepsen-test.yml: 5s smoke run on every push, mirroring the other
  workloads.
- jepsen-test-scheduled.yml: 150s default run (overridable via
  workflow_dispatch inputs) every 6 hours.

Workload entry: elastickv.redis-zset-safety-workload (added in the
previous commit via -main).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Jepsen workload to validate elastickv’s Redis ZSet safety properties (scores, ordering, range correctness, and phantom detection) under faults using a custom model-based checker, and wires a helper entrypoint into the Jepsen test namespace.

Changes:

  • Introduces redis_zset_safety_workload.clj with a Carmine-based client, randomized op generator, and a custom checker for ZSet safety properties.
  • Adds a new wrapper function in jepsen_test.clj intended to expose the new workload.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
jepsen/src/elastickv/redis_zset_safety_workload.clj New ZSet safety Jepsen workload with custom model/checker and CLI entrypoint.
jepsen/src/elastickv/jepsen_test.clj Adds a wrapper function and require for the new ZSet safety test.

Comment on lines +476 to +526
(defn elastickv-zset-safety-test
"Builds a Jepsen test map that drives elastickv's Redis ZSet safety
workload."
([] (elastickv-zset-safety-test {}))
([opts]
(let [nodes (or (:nodes opts) default-nodes)
redis-ports (or (:redis-ports opts)
(repeat (count nodes) (or (:redis-port opts) 6379)))
node->port (or (:node->port opts)
(cli/ports->node-map redis-ports nodes))
local? (:local opts)
db (if local?
jdb/noop
(ekdb/db {:grpc-port (or (:grpc-port opts) 50051)
:redis-port node->port
:raft-groups (:raft-groups opts)
:shard-ranges (:shard-ranges opts)}))
rate (double (or (:rate opts) 10))
time-limit (or (:time-limit opts) 60)
faults (if local?
[]
(cli/normalize-faults (or (:faults opts) [:partition :kill])))
nemesis-p (when-not local?
(combined/nemesis-package {:db db
:faults faults
:interval (or (:fault-interval opts) 40)}))
nemesis-gen (if nemesis-p
(:generator nemesis-p)
(gen/once {:type :info :f :noop}))
workload (elastickv-zset-safety-workload
(assoc opts :node->port node->port))]
(merge workload
{:name (or (:name opts) "elastickv-redis-zset-safety")
:nodes nodes
:db db
:redis-host (:redis-host opts)
:os (if local? os/noop debian/os)
:net (if local? net/noop net/iptables)
:ssh (merge {:username "vagrant"
:private-key-path "/home/vagrant/.ssh/id_rsa"
:strict-host-key-checking false}
(when local? {:dummy true})
(:ssh opts))
:remote control/ssh
:nemesis (if nemesis-p (:nemesis nemesis-p) nemesis/noop)
:final-generator nil
:concurrency (or (:concurrency opts) 5)
:generator (->> (:generator workload)
(gen/nemesis nemesis-gen)
(gen/stagger (/ rate))
(gen/time-limit time-limit))}))))

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale. Unit tests for this workload were added in commit ecb3983 under jepsen/test/elastickv/redis_zset_safety_workload_test.clj — they cover test-spec construction (builds-test-spec, custom-options-override-defaults) plus checker edge cases: no-op ZREM, :info ZINCRBY, phantoms, stale reads, superseded scores, concurrent ZADD/ZREM, :info ZREM absence, :info ZINCRBY completeness, infinity parsing, duplicate-member flagging. Latest commit 0c0efc4 adds two more for the single-ok / multi-ZINCRBY score check.

Comment on lines +251 to +254
:zrem
(let [m (:value invoke)]
{:f :zrem :member m :score nil :zrem? true
:type t :invoke-idx inv-idx :complete-idx cmp-idx})))))

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale. ZREM now carries :removed? from the server reply into the mutation record (see completed-mutation-window lines 283-295 and apply-mutation-to-state/existence-evidence?). A no-op ZREM (returns 0) does NOT mutate model state, and :info/:pending ZREMs are treated as possibly-concurrent deletions via the concurrent window. Regression test: noop-zrem-does-not-flag-correct-read and no-op-zrem-alone-does-not-false-positive. Landed in 2a194a4 / ecb3983.

Comment on lines +244 to +249
:zincrby
(let [[m _delta] (:value invoke)
s (when (and (= :ok t) (vector? (:value complete)))
(second (:value complete)))]
{:f :zincrby :member m :score (some-> s double)
:type t :invoke-idx inv-idx :complete-idx cmp-idx})

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale. completed-mutation-window sets :unknown-score? on each :info/:pending ZINCRBY (line 277-279), and allowed-scores-for-member propagates that to the caller so check-zrange-all/check-zrangebyscore skip strict score validation when the resulting score is unknown. Regression test: info-zincrby-skips-strict-score-check, info-before-read-is-considered-uncertain. Landed in 2a194a4 / ecb3983, refined in 0c0efc4 per Codex P1.

Comment on lines +271 to +274
strictly precede `read-inv-idx`. Model maps member -> {:score s} or
marks member as :deleted. Returns {:members map :ok-members set}.
Only considers :ok mutations for the authoritative model; :info
mutations are treated as uncertain (neither strictly applied nor not)."

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale. Docstring rewritten on line 323-327: "Construct authoritative per-member state from mutations whose completions strictly precede read-inv-idx. Returns {member -> {:present? bool :score s}}". The {:members map :ok-members set} wording no longer appears in the file.

Comment thread jepsen/src/elastickv/jepsen_test.clj Outdated
Comment on lines 18 to 23
(defn elastickv-zset-safety-test []
(zset-safety-workload/elastickv-zset-safety-test {}))

(defn -main
[& args]
(cli/run! (cli/single-test-cmd {:test-fn elastickv-test}) args))

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale. jepsen_test.clj now dispatches on the first positional arg via the test-fns map (lines 21-29), auto-prepending test if the user omitted the jepsen subcommand. elastickv-zset-safety-test is reachable via lein run -m elastickv.jepsen-test elastickv-zset-safety-test .... The workload namespace's own -main also works for direct invocation.

Three Copilot findings on PR #550:

1. :zincrby indeterminate handling. Pending or :info ZINCRBY left the
   resulting score unknown, but the checker still required the
   observed read score to be in the finite allowed-scores set. A read
   that legitimately observed an in-flight increment was flagged as a
   score mismatch (false positive).

   completed-mutation-window now sets :unknown-score? on a ZINCRBY when
   the completion is :info or pending. allowed-scores-for-member
   returns :unknown-score? when any concurrent ZINCRBY carries the
   flag, and check-zrange-all / check-zrangebyscore skip the strict
   score-membership check in that case.

2. :zrem no-op handling. ZREM of a never-added member returns 0
   server-side (no-op). The previous model treated every ZREM as a
   deletion, producing missing-member false positives and
   score-mismatch false negatives.

   invoke! already exposes the actual removed? boolean as the second
   element of the completion value. completed-mutation-window now
   threads :removed? through, and the new apply-mutation-to-state
   helper leaves state unchanged when :removed? is false.

3. model-before docstring claimed it returned {:members map :ok-members
   set}, but it returned the model map directly. Docstring rewritten to
   match the actual return value.

Adds jepsen/test/elastickv/redis_zset_safety_workload_test.clj
covering test-spec construction, the no-op ZREM edge case, the :info
ZINCRBY skip, and a positive-control score-mismatch detection. The
checker tests bypass the timeline.html sub-checker (which writes to
the test store) by invoking zset-safety-checker directly.
@bootjp

bootjp commented Apr 19, 2026

Copy link
Copy Markdown
Owner Author

94be1bd で Copilot 指摘を fix:

  • :zincrby indeterminate false positive: pending/:info ZINCRBY の resulting score 不明分を :unknown-score? フラグで伝播。allowed-scores-for-member が concurrent に unknown ZINCRBY を含む場合、strict score check を skip。
  • :zrem no-op false positive: invoke! が既に返している removed? boolean を mutation record に thread。:removed? false の ZREM は state を変更しない (apply-mutation-to-state ヘルパーで分岐)。
  • model-before docstring: 実装に合わせて修正。
  • PR description: lein run -m elastickv.redis-zset-safety-workload 直接呼び出し方式を明記 (CI と整合)。
  • Workload unit test: jepsen/test/elastickv/redis_zset_safety_workload_test.clj 追加。test-spec 構築、no-op ZREM, :info ZINCRBY, positive-control score mismatch を deterministic に検証。timeline.html サブチェッカーは store を要求するので bypass し、zset-safety-checker を直接呼び出し。

lein test elastickv.redis-zset-safety-workload-test 全 5 件 PASS (Java 21 環境で確認)。

/gemini review

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…re-mismatch

CI failure on 94be1bd surfaced a remaining checker false positive: a
read that observes a member whose only prior ops are no-op ZREMs was
classified as :score-mismatch with :allowed #{} rather than treated as
a never-existed member. allowed-scores-for-member returned
any-known? = true because (seq concurrent) was truthy even when every
concurrent op was a :zrem that didn't actually remove anything.

Fix: any-known? (renamed internally to existence-evidence?) now only
counts concurrent mutations that provide evidence the member ever
existed -- :zadd, :zincrby, or a :zrem whose :removed? boolean is true.
A concurrent no-op ZREM contributes nothing.

Adds no-op-zrem-alone-does-not-false-positive as a regression test.
All 6 workload unit tests pass under Java 21.
@bootjp

bootjp commented Apr 20, 2026

Copy link
Copy Markdown
Owner Author

9bfcc13 で CI 失敗原因の checker false positive を追加 fix:

  • no-op ZREM のみで score-mismatch 誤検出: CI 結果 :allowed #{} の score-mismatch が発生。any-known?(seq concurrent) truthy を返していたため、concurrent が全部 no-op ZREM でも score 判定パスに入っていた。
  • 修正: existence-evidence? に名前を改め、concurrent mutation のうち そのメンバが存在したことの証拠 だけカウント (:zadd / :zincrby / :removed? true:zrem)。no-op ZREM は存在証拠にならない。
  • regression test 追加 (no-op-zrem-alone-does-not-false-positive)。unit test 計 6 件 PASS。

/gemini review

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
jepsen/test/elastickv/redis_zset_safety_workload_test.clj (1)

42-89: Add a direct :zrangebyscore checker regression.

The checker has a separate bounded-range path, but these edge-case tests only exercise :zrange-all. A small missing-member or out-of-range regression would protect the advertised ZRANGEBYSCORE property.

Example test to cover bounded-range completeness
+(deftest zrangebyscore-missing-member-is-detected
+  (let [history [{:type :invoke :process 0 :f :zadd :value ["m1" 5] :index 0}
+                 {:type :ok     :process 0 :f :zadd :value ["m1" 5] :index 1}
+                 {:type :invoke :process 0 :f :zrangebyscore :value [0 10] :index 2}
+                 {:type :ok     :process 0 :f :zrangebyscore
+                  :value {:bounds [0 10] :members []}
+                  :index 3}]
+        result  (run-checker history)]
+    (is (not (:valid? result)) (str "expected range mismatch, got: " result))))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jepsen/test/elastickv/redis_zset_safety_workload_test.clj` around lines 42 -
89, Add a new test that exercises the bounded-range code path by invoking
:zrangebyscore instead of :zrange-all so the checker’s ZRANGEBYSCORE logic is
covered; create a test (e.g. noop-zrem-does-not-flag-bounded-zrangebyscore or
similar) that mirrors one of the existing edge-case histories but uses
:zrangebyscore (with appropriate score bounds/values) and asserts run-checker
returns :valid? or not as expected, ensuring you reference the same run-checker
invocation and operation symbols (:zadd, :zrem, :zincrby, :zrangebyscore) so the
new test hits the bounded-range branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@jepsen/src/elastickv/redis_zset_safety_workload.clj`:
- Around line 386-416: The read result must be checked for duplicate members to
prevent a ZSet read returning the same member twice; update check-zrange-all
(and the similar function handling ranges around lines 430-464) to detect
duplicate member entries before per-entry validation by scanning entries and
maintaining a seen-members set, and if a member is already seen swap! the errors
atom with a {:kind :duplicate :index cmp-idx :member member :entries entries}
(or similar) error entry so duplicate-member reads are reported and rejected.
- Around line 313-318: The current logic builds applied as (->> muts (filter ...
(< (:complete-idx %) read-inv-idx)) (sort-by :complete-idx)) and then reduces
via apply-mutation-to-state, which incorrectly linearizes overlapping :ok
mutations by completion time; change this to detect overlapping committed
mutations (using their :invoke-idx and :complete-idx intervals from muts) and do
not sort/serialize ambiguous pairs by :complete-idx. Instead compute either the
set of all possible latest states for the read by merging non-overlapping
mutations deterministically and treating overlapping/conflicting writes
conservatively (e.g., allow values from any write whose interval is not ordered
before the read), and update the applied construction and reduction via
apply-mutation-to-state to use that conservative/possible-states approach; apply
the same fix at the other occurrence around lines 347-352.

---

Nitpick comments:
In `@jepsen/test/elastickv/redis_zset_safety_workload_test.clj`:
- Around line 42-89: Add a new test that exercises the bounded-range code path
by invoking :zrangebyscore instead of :zrange-all so the checker’s ZRANGEBYSCORE
logic is covered; create a test (e.g.
noop-zrem-does-not-flag-bounded-zrangebyscore or similar) that mirrors one of
the existing edge-case histories but uses :zrangebyscore (with appropriate score
bounds/values) and asserts run-checker returns :valid? or not as expected,
ensuring you reference the same run-checker invocation and operation symbols
(:zadd, :zrem, :zincrby, :zrangebyscore) so the new test hits the bounded-range
branch.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e12c0d39-43a2-4d57-a8a3-f49ade75baba

📥 Commits

Reviewing files that changed from the base of the PR and between 173fbbc and 9bfcc13.

📒 Files selected for processing (5)
  • .github/workflows/jepsen-test-scheduled.yml
  • .github/workflows/jepsen-test.yml
  • jepsen/src/elastickv/jepsen_test.clj
  • jepsen/src/elastickv/redis_zset_safety_workload.clj
  • jepsen/test/elastickv/redis_zset_safety_workload_test.clj

Comment thread jepsen/src/elastickv/redis_zset_safety_workload.clj Outdated
Comment thread jepsen/src/elastickv/redis_zset_safety_workload.clj Outdated
Comment thread jepsen/src/elastickv/redis_zset_safety_workload.clj Outdated
Three Major-severity CodeRabbit findings on PR #550:

1. Duplicate-member detection (line 416): a ZSet read must return each
   member at most once. Previously, if ZRANGE returned the same member
   twice with an allowed score, the checker accepted it because sort
   and score-membership checks passed independently per entry.
   duplicate-members helper now flags :duplicate-members (and
   :duplicate-members-range for ZRANGEBYSCORE) before the per-entry
   loop.

2. Overlapping committed writes (line 318): two :ok mutations whose
   invoke/complete windows overlap have ambiguous serialization order.
   Pinning allowed-scores to a single last-wins linearization by
   :complete-idx was unsound. allowed-scores-for-member now unions all
   :zadd/:ok-:zincrby scores from committed mutations (over-approx
   that stays sound), and must-be-present? is relaxed when any pair
   of committed writes for the same member overlaps in time.

3. Pre-read :info mutations (line 328): a mutation recorded as :info
   whose completion precedes a later read's invoke may have taken
   effect server-side. Previously it was ignored by both model-before
   (:ok only) and the concurrent window (complete-idx >= read-inv-idx
   required). Now collected as pre-read-info, contributing to allowed
   scores and flipping unknown-score? for :zincrby with unknown
   resulting score.

3 new regression tests (duplicate-members-are-flagged, overlapping-
committed-zadds-allow-either-score, info-before-read-is-considered-
uncertain). Workload unit test count now 9, all PASS under Java 21.
@bootjp

bootjp commented Apr 20, 2026

Copy link
Copy Markdown
Owner Author

6d0b4c3 で CodeRabbit Major 3 件を fix:

  • 重複メンバー: duplicate-members ヘルパーで ZRANGE / ZRANGEBYSCORE 結果の重複を :duplicate-members / :duplicate-members-range として明示検知。sort / score-membership の個別チェックでは検知できなかったケースを閉じた。
  • 重複する committed writes の線形化: :complete-idx 順の単一直列化は 2 つの :ok mutation の window が重なる場合に unsafe。allowed-scores-for-member を「committed の全 :zadd / :ok :zincrby score の union」に広げ (sound な over-approx)、must-be-present? は committed 間に重なりがあるときリラックス。
  • read 前完了の :info: pre-read-info として抽出し、allowed-scores と unknown-score? の判定に含めた。以前は model-before (:ok のみ) にも concurrent 窓 (complete-idx >= read-inv-idx 必須) にも入らず無視されていた。

regression test 3 件追加 (duplicate-members-are-flagged / overlapping-committed-zadds-allow-either-score / info-before-read-is-considered-uncertain)。unit test 計 9 件 PASS。

/gemini review

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@bootjp

bootjp commented Apr 20, 2026

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Jepsen workload and model-based checker to verify Redis ZSet safety properties in elastickv. It includes a client for Redis operations, a generator for high-contention workloads, and a checker that validates score correctness, ordering, and atomicity. Feedback focuses on handling infinite scores in Redis to prevent parsing exceptions, improving the soundness of the linearizability model in allowed-scores-for-member to prevent stale reads, and optimizing the efficiency of mutation checks.

Comment on lines +71 to +80
(defn- parse-withscores
"Carmine returns a flat [member score member score ...] vector for
ZRANGE WITHSCORES. Convert to a sorted vector of [member (double score)]
preserving server-returned order (score ascending, then member)."
[flat]
(->> flat
(partition 2)
(mapv (fn [[m s]]
[(if (bytes? m) (String. ^bytes m) (str m))
(Double/parseDouble (str s))]))))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of Double/parseDouble directly on Redis score strings will fail if the score is infinite. Redis returns "inf", "+inf", or "-inf" for infinite values, but Java's Double/parseDouble expects "Infinity" or "-Infinity". This will cause a NumberFormatException during the test if infinite scores are encountered.

(defn- parse-double-safe
  "Parses a Redis score string into a Double, handling 'inf' representations."
  [s]
  (let [s-str (str/lower-case (str s))]
    (cond
      (str/includes? s-str "inf") (if (str/starts-with? s-str "-")
                                     Double/NEGATIVE_INFINITY
                                     Double/POSITIVE_INFINITY)
      :else (Double/parseDouble (str s)))))

(defn- parse-withscores
  "Carmine returns a flat [member score member score ...] vector for
  ZRANGE WITHSCORES. Convert to a sorted vector of [member (double score)]
  preserving server-returned order (score ascending, then member)."
  [flat]
  (->> flat
       (partition 2)
       (mapv (fn [[m s]]
               [(if (bytes? m) (String. ^bytes m) (str m))
                (parse-double-safe s)]))))

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale. parse-double-safe (lines 71-87) normalizes inf / +inf / -inf (and JVM-style Infinity) before falling through to Double/parseDouble. parse-withscores calls it on every returned score. Regression test: parse-withscores-handles-inf-strings. Landed in ecb3983.

(let [[member delta] (:value op)
new-score (car/wcar cs (car/zincrby zset-key (double delta) member))]
(assoc op :type :ok
:value [member (Double/parseDouble (str new-score))]))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

As noted in the parse-withscores feedback, Double/parseDouble should be replaced with a safe parser that handles Redis's "inf" strings.

                   :value [member (parse-double-safe new-score)]]))

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale. The :zincrby branch of invoke! already calls parse-double-safe on the server-returned score (line 134). Covered by parse-withscores-handles-inf-strings plus the infinity branch in parse-double-safe. Landed in ecb3983.

Comment on lines +330 to +418
(defn- allowed-scores-for-member
"Compute the set of scores considered valid for `member` by a read
whose window is [read-inv-idx, read-cmp-idx], based on committed state
and any concurrent mutations.

Returns:
:scores - set of acceptable scores (committed + concurrent
:zadd / :ok :zincrby).
:unknown-score? - true iff any concurrent ZINCRBY's resulting score
is unknown (in-flight or :info). When set, the
caller MUST skip the strict score-membership
check to stay sound.
:must-be-present? - committed state says present and no concurrent
mutation could have removed/changed it.
:any-known? - some op claims to have touched this member."
[mutations-by-m member read-inv-idx read-cmp-idx]
(let [muts (get mutations-by-m member [])
;; :ok mutations that completed strictly before the read. They
;; may have overlapped with each other in wall-clock time, so
;; the serialization order among them is ambiguous.
committed (->> muts
(filter #(and (= :ok (:type %))
(some? (:complete-idx %))
(< (:complete-idx %) read-inv-idx))))
;; :info mutations that completed before the read: they may or
;; may not have taken effect server-side. We must account for
;; their possible scores just like concurrent ones.
pre-read-info (->> muts
(filter #(and (= :info (:type %))
(some? (:complete-idx %))
(< (:complete-idx %) read-inv-idx))))
;; Concurrent mutations: windows overlap the read. Include both
;; :ok and :info since either may have taken effect.
concurrent (concurrent-mutations-for-member muts read-inv-idx read-cmp-idx)
;; A conservative last-wins linearization for the must-be-present?
;; check only. Ambiguous when committed writes overlap each other.
committed-sorted (sort-by :complete-idx committed)
committed-state (reduce apply-mutation-to-state nil committed-sorted)
committed-overlap? (boolean
(some (fn [[a b]]
(and (not (identical? a b))
(<= (:invoke-idx a) (:complete-idx b))
(<= (:invoke-idx b) (:complete-idx a))))
(for [a committed, b committed] [a b])))
;; Union of every score that any committed / pre-read :info /
;; concurrent op could have produced. This over-approximates the
;; legitimate post-state set when writes overlap, keeping the
;; checker sound at the cost of being slightly less strict on
;; overlapping concurrent writers.
add-scores (fn [acc m]
(case (:f m)
:zadd (conj acc (:score m))
:zincrby (cond-> acc (some? (:score m)) (conj (:score m)))
:zrem acc))
scores (as-> #{} s
(reduce add-scores s committed)
(reduce add-scores s pre-read-info)
(reduce add-scores s concurrent))
unknown-score? (or
(some #(and (= :zincrby (:f %)) (:unknown-score? %))
concurrent)
(some #(and (= :zincrby (:f %)) (:unknown-score? %))
pre-read-info))
;; any-known? must only be true when something provides evidence
;; the member actually existed at some point. A no-op ZREM
;; (:removed? false) does NOT prove existence.
existence-evidence? (or (some #(case (:f %)
(:zadd :zincrby) true
:zrem (:removed? %))
committed)
(some #(case (:f %)
(:zadd :zincrby) true
:zrem (:removed? %))
pre-read-info)
(some #(case (:f %)
(:zadd :zincrby) true
:zrem (:removed? %))
concurrent))]
{:scores scores
:unknown-score? (boolean unknown-score?)
;; must-be-present? is relaxed when committed writes overlap
;; among themselves or when any :info / concurrent mutation could
;; have removed the member before the read.
:must-be-present? (boolean (and committed-state
(:present? committed-state)
(not committed-overlap?)
(empty? pre-read-info)
(empty? concurrent)))
:any-known? (boolean existence-evidence?)}))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of allowed-scores-for-member is too permissive and potentially unsound for linearizability checks.

  1. Stale Reads: It includes all successful committed scores in the scores set (line 384), even if they were strictly followed by a later committed mutation. In a linearizable system, a read should only observe scores from the "potentially latest" committed writes (those not strictly followed by another) or concurrent ones.
  2. Stale Presence: The checker does not verify if a member is allowed to be present. If a member was added and then deleted (both committed before the read), any-known? remains true, and the read will accept the member even though it should be absent.
  3. Efficiency: The committed-overlap? check (line 368) is $O(N^2)$ relative to the number of mutations for a member. While the member universe is small, this can be optimized to $O(N)$ by tracking the maximum invoke-idx of preceding mutations.
(defn- allowed-scores-for-member
  "Compute the set of scores considered valid for `member` by a read
  whose window is [read-inv-idx, read-cmp-idx], based on committed state
  and any concurrent mutations."
  [mutations-by-m member read-inv-idx read-cmp-idx]
  (let [muts (get mutations-by-m member [])
        ;; Mutations that completed strictly before the read started.
        preceding (filter #(and (= :ok (:type %))
                                (some? (:complete-idx %))
                                (< (:complete-idx %) read-inv-idx))
                          muts)
        ;; A mutation is a candidate for the 'latest' state if no other
        ;; committed mutation strictly follows it (m2.invoke > m.complete).
        max-inv (reduce max -1 (map :invoke-idx preceding))
        candidates (filter #(>= (:complete-idx %) max-inv) preceding)

        concurrent (concurrent-mutations-for-member muts read-inv-idx read-cmp-idx)
        pre-read-info (filter #(and (= :info (:type %))
                                    (some? (:complete-idx %))
                                    (< (:complete-idx %) read-inv-idx))
                              muts)

        ;; Scores allowed by committed state or uncertain ops.
        scores (into (set (keep :score candidates))
                     (concat (keep :score concurrent)
                             (keep :score pre-read-info)))

        unknown-score? (some #(and (= :zincrby (:f %)) (:unknown-score? %))
                             (concat concurrent pre-read-info))

        ;; Allowed to be present if any candidate or concurrent/info op is a write.
        can-be-present? (or (some #(#{:zadd :zincrby} (:f %)) candidates)
                            (some #(#{:zadd :zincrby} (:f %)) concurrent)
                            (some #(#{:zadd :zincrby} (:f %)) pre-read-info))

        ;; Must be present if all candidates are writes and no concurrent/info op can delete.
        must-be-present? (and (seq candidates)
                              (every? #(#{:zadd :zincrby} (:f %)) candidates)
                              (not (some #(= :zrem (:f %)) concurrent))
                              (not (some #(= :zrem (:f %)) pre-read-info)))]
    {:scores scores
     :unknown-score? (boolean unknown-score?)
     :must-be-present? must-be-present?
     :can-be-present? can-be-present?}))

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale. allowed-scores-for-member now applies a real-time last-wins max-inv filter to committed mutations (lines 408-411): a preceding mutation is admissible only when no other preceding mutation's invoke strictly follows it. Superseded scores are excluded from :scores. Presence-under-stale-reads is caught by can-be-present? (lines 466-469), which is checked in both check-zrange-all (line 529) and check-zrangebyscore (line 590). Regression tests: superseded-committed-score-is-not-allowed, stale-read-after-committed-zrem-is-flagged, phantom-member-is-flagged. Landed in 2a194a4.

Comment on lines +450 to +457
(let [{:keys [scores any-known? unknown-score?]}
(allowed-scores-for-member mutations-by-m member inv-idx cmp-idx)]
(cond
(not any-known?)
(swap! errors conj {:kind :phantom
:index cmp-idx
:member member
:score score})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The phantom check should be replaced with a more robust can-be-present? check to catch both phantoms (never existed) and stale reads (present after a committed deletion).

    (doseq [[member score] entries]
      (let [{:keys [scores can-be-present? unknown-score?]}
            (allowed-scores-for-member mutations-by-m member inv-idx cmp-idx)]
        (cond
          (not can-be-present?)
          (swap! errors conj {:kind :unexpected-presence
                              :index cmp-idx
                              :member member
                              :score score})

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale. check-zrange-all (line 529) and check-zrangebyscore (line 590) both use (not can-be-present?) to emit :unexpected-presence / :unexpected-presence-range. can-be-present? (returned from allowed-scores-for-member) covers both phantoms (no existence evidence anywhere) and stale reads (committed-removed before the read). Regression tests: phantom-member-is-flagged, stale-read-after-committed-zrem-is-flagged. Landed in 2a194a4.

bootjp added 2 commits April 22, 2026 00:24
…o ops

Address correctness issues in the Redis ZSet safety workload raised in
PR #550 review:

- parse-double-safe: Redis emits "inf"/"+inf"/"-inf" for infinite ZSET
  scores, which Double/parseDouble rejects. Route parse-withscores and
  the ZINCRBY reply through a tolerant parser that maps those to
  +/- Double/POSITIVE_INFINITY.
- allowed-scores-for-member: tighten committed-score admissibility to
  real-time "candidates" (preceding :ok mutations not strictly followed
  in real time by another preceding :ok mutation). Superseded committed
  scores are no longer admissible, closing a stale-read soundness gap.
- can-be-present?: replace the phantom-only check with a presence
  check that rejects both phantoms (never existed) and stale reads
  (member committed-removed before the read with no concurrent re-add).
- CLI dispatch: -main now selects a workload by the first positional
  arg (elastickv-test, elastickv-zset-safety-test, elastickv-dynamodb-
  test, elastickv-s3-test) and auto-prepends the jepsen "test"
  subcommand when missing, matching the invocation documented in the
  PR description.
- Tests: cover phantom, stale-read-after-committed-zrem, superseded
  committed score, and parse-withscores infinity handling.
@bootjp

bootjp commented Apr 22, 2026

Copy link
Copy Markdown
Owner Author

Addressed all 9 review items in commit ecb39831e5af88c24a8f6830668e905e7eee5cc8.

gemini HIGH (4)

  1. parse-withscores infinity handling — Added parse-double-safe that normalizes Redis's "inf"/"+inf"/"-inf" (case-insensitive) to Double/POSITIVE_INFINITY/NEGATIVE_INFINITY, and falls through to Double/parseDouble otherwise. Replaced both call sites (parse-withscores and the ZINCRBY reply parser).
  2. allowed-scores-for-member stale-read permissiveness — Restricted the admissible committed-score set to real-time "candidates": preceding :ok mutations whose :complete-idx is >= the max :invoke-idx of other preceding mutations (i.e. not strictly superseded in real time). Scores from superseded committed mutations are no longer accepted.
  3. Phantom / stale-read check robustness — Replaced the phantom-only any-known? gate with can-be-present?, which is true iff some candidate / concurrent / pre-read :info op establishes legitimate presence. A member committed-removed before a read (no concurrent re-add) now trips :unexpected-presence instead of slipping through.

Copilot (5)

  1. :zrem unconditional deletioncompleted-mutation-window already carries :removed? from invoke! (extracts it from the [member removed?] reply for :ok); apply-mutation-to-state leaves state unchanged on a :removed? false ZREM. The refactored existence-evidence? helper now excludes no-op ZREMs, so they no longer count as proof the member existed.
  2. :zincrby indeterminate score:unknown-score? is already emitted for :info/:pending ZINCRBY. allowed-scores-for-member now surfaces it whenever any concurrent OR pre-read :info op has it set, and the read checker skips the strict score-membership check in that case (effectively "accept any score" for the overlapping window).
  3. model-before docstring — Verified current docstring already matches the implementation ({member -> {:present? bool :score s}}); the mismatch referenced in the review is no longer present at HEAD. No change required.
  4. -main CLI dispatch-main now reads the first positional arg to select a workload: elastickv-test (default, backward-compatible), elastickv-zset-safety-test, elastickv-dynamodb-test, elastickv-s3-test. When the next arg isn't test/analyze, we auto-prepend test so lein run -m elastickv.jepsen-test elastickv-zset-safety-test --nodes n1,... works as documented.
  5. Unit tests — Added: phantom-member-is-flagged, stale-read-after-committed-zrem-is-flagged, superseded-committed-score-is-not-allowed, parse-withscores-handles-inf-strings. Pre-existing tests cover no-op ZREM and :info ZINCRBY.

Validation

  • lein test elastickv.redis-zset-safety-workload-test elastickv.cli-test → 16 tests, 23 assertions, 0 failures.
  • lein check: the pre-existing jepsen/redis/client namespace error (empty redis/src submodule in this environment) is unrelated and reproduces on the base branch without my changes.

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Jepsen workload, elastickv-zset-safety-test, designed to verify the safety properties of Redis ZSet implementations under faults, including score correctness, ordering, and atomicity. It adds a custom model-based checker and corresponding unit tests. Feedback focuses on refining the allowed-scores-for-member logic to better handle concurrent operations and uncertain states, specifically addressing potential false positives in presence detection and improving the completeness checks for ZRANGE and ZRANGEBYSCORE operations.

Comment on lines +393 to +462
(let [muts (get mutations-by-m member [])
;; :ok mutations that completed strictly before the read.
preceding (->> muts
(filter #(and (= :ok (:type %))
(some? (:complete-idx %))
(< (:complete-idx %) read-inv-idx))))
;; Real-time "last-wins" candidate filter: a preceding mutation
;; m is admissible iff no OTHER preceding mutation m' has
;; m'.invoke-idx > m.complete-idx (i.e. m' strictly follows m).
;; Equivalent: m.complete-idx >= max(invoke-idx) over preceding.
max-inv (reduce max -1 (map :invoke-idx preceding))
candidates (filterv #(>= (:complete-idx %) max-inv) preceding)
;; :info mutations that completed before the read: they may or
;; may not have taken effect server-side.
pre-read-info (->> muts
(filter #(and (= :info (:type %))
(some? (:complete-idx %))
(< (:complete-idx %) read-inv-idx))))
;; Concurrent mutations: windows overlap the read. Include both
;; :ok and :info since either may have taken effect.
concurrent (concurrent-mutations-for-member muts read-inv-idx read-cmp-idx)

add-scores (fn [acc m]
(case (:f m)
:zadd (conj acc (:score m))
:zincrby (cond-> acc (some? (:score m)) (conj (:score m)))
:zrem acc))
;; Admissible scores: candidate committed + pre-read :info +
;; concurrent writes (with a known score).
scores (as-> #{} s
(reduce add-scores s candidates)
(reduce add-scores s pre-read-info)
(reduce add-scores s concurrent))

has-unknown-incr? (fn [coll]
(some #(and (= :zincrby (:f %))
(:unknown-score? %))
coll))
unknown-score? (or (has-unknown-incr? concurrent)
(has-unknown-incr? pre-read-info))

;; Did any candidate commit establish presence (write, or
;; ZREM with :removed? -- either way the member existed)?
candidate-state (reduce apply-mutation-to-state nil
(sort-by :complete-idx candidates))
candidate-present? (boolean (:present? candidate-state))

any-concurrent-could-write? (or (some write-op? concurrent)
(some write-op? pre-read-info))
any-concurrent-could-remove? (or (some #(= :zrem (:f %)) concurrent)
(some #(= :zrem (:f %)) pre-read-info))

can-be-present? (or candidate-present?
any-concurrent-could-write?
;; A :zrem with :removed? true still proves
;; existence; if a concurrent ZREM raced
;; with an earlier write whose window is
;; not captured as a candidate, presence is
;; uncertain rather than forbidden.
(and (some existence-evidence? (concat concurrent
pre-read-info))
any-concurrent-could-remove?))

must-be-present? (boolean (and candidate-present?
(empty? pre-read-info)
(empty? concurrent)))]
{:scores scores
:unknown-score? (boolean unknown-score?)
:can-be-present? (boolean can-be-present?)
:must-be-present? must-be-present?}))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic in allowed-scores-for-member has several correctness issues that will lead to false positives and false negatives in safety checking:

  1. False Positives (Strictness): candidate-present? relies on complete-idx order to determine presence. In a linearizable system, concurrent committed operations can be serialized in any order. If a ZADD and ZREM are concurrent but both finish before the read, the read may legitimately observe either state. The current logic flags one of these as an error.
  2. False Negatives (Completeness): must-be-present? is too strict. It returns false if any uncertain operation exists, even if that operation is a ZADD which cannot remove the member. It should only be false if an uncertain or concurrent removal exists.
  3. ZINCRBY Score Set: For multiple concurrent ZINCRBY operations, the set of allowed scores should include all possible prefix sums of the increments. The current logic only includes the final score of each increment, which is too restrictive.

The suggested fix relaxes can-be-present? to allow any valid serialization of candidates and uncertain operations, and refines must-be-present? to correctly identify when a member is required to appear.

  (let [muts (get mutations-by-m member [])
        ;; :ok mutations that completed strictly before the read.
        preceding (->> muts
                       (filter #(and (= :ok (:type %))
                                     (some? (:complete-idx %))
                                     (< (:complete-idx %) read-inv-idx))))
        ;; Real-time "last-wins" candidate filter: a preceding mutation
        ;; m is admissible iff no OTHER preceding mutation m' has
        ;; m'.invoke-idx > m.complete-idx (i.e. m' strictly follows m).
        max-inv (reduce max -1 (map :invoke-idx preceding))
        candidates (filterv #(>= (:complete-idx %) max-inv) preceding)
        ;; :info mutations that completed before the read: they may or
        ;; may not have taken effect server-side.
        pre-read-info (->> muts
                           (filter #(and (= :info (:type %))
                                         (some? (:complete-idx %))
                                         (< (:complete-idx %) read-inv-idx))))
        ;; Concurrent mutations: windows overlap the read. Include both
        ;; :ok and :info since either may have taken effect.
        concurrent (concurrent-mutations-for-member muts read-inv-idx read-cmp-idx)
        all-uncertain (concat pre-read-info concurrent)

        add-scores (fn [acc m]
                     (case (:f m)
                       :zadd    (conj acc (:score m))
                       :zincrby (cond-> acc (some? (:score m)) (conj (:score m)))
                       :zrem    acc))
        ;; Admissible scores: candidate committed + pre-read :info +
        ;; concurrent writes (with a known score).
        scores (as-> #{} s
                 (reduce add-scores s candidates)
                 (reduce add-scores s all-uncertain))

        ;; Relax score check if any ZINCRBY is concurrent or uncertain.
        ;; Multiple concurrent increments can produce intermediate sums not in `scores`.
        unknown-score? (some #(= :zincrby (:f %)) all-uncertain)

        ;; can-be-present?: true if any candidate or uncertain op results in presence.
        can-be-present? (or (some write-op? candidates)
                            (some write-op? all-uncertain)
                            (some existence-evidence? (concat candidates all-uncertain)))

        ;; must-be-present?: true if ALL valid serializations result in presence.
        must-be-present? (and (some write-op? candidates)
                              (not (some #(= :zrem (:f %)) candidates))
                              (not (some #(= :zrem (:f %)) all-uncertain)))]
    {:scores           scores
     :unknown-score?   (boolean unknown-score?)
     :can-be-present?  (boolean can-be-present?)
     :must-be-present? must-be-present?})

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially stale, partially rebutted.

False Positives (Strictness) — addressed in 2a194a4. Overlapping-window committed ops are both admissible: max-inv = max of preceding :invoke-idx; candidates = preceding ops with :complete-idx >= max-inv. When multiple candidates remain their windows overlap, so any can serialize last, and candidate-can-be-present? / candidate-can-be-absent? both consider them. Regression tests: concurrent-zadd-zrem-both-completed-accepts-either-outcome (both [["m1" 1.0]] and [] accepted), overlapping-committed-zadds-allow-either-score.

False Negatives (Completeness on must-be-present?) — I don't see this in the code. must-be-present? (lines 475-477) guards on any-candidate-write?, (not candidate-can-be-absent?), and (not any-uncertain-zrem?). Uncertain ZADDs do NOT force it to false; only uncertain ZREMs do, which is correct (a ZREM could reorder last and remove the member). Could you point to the exact path that triggers the false negative?

Comment on lines +516 to +526
(let [model (model-before mutations-by-m inv-idx)
observed-members (into #{} (map first) entries)]
(doseq [[member {:keys [present?]}] model]
(when (and present? (not (contains? observed-members member)))
(let [muts (get mutations-by-m member [])
concurrent (concurrent-mutations-for-member muts inv-idx cmp-idx)]
(when (empty? concurrent)
(swap! errors conj {:kind :missing-member
:index cmp-idx
:member member}))))))
@errors))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The completeness check for ZRANGE is currently too simple and doesn't account for :info operations that might have removed a member before the read. It should use the must-be-present? logic from allowed-scores-for-member which correctly handles uncertain removals.

    (let [model (model-before mutations-by-m inv-idx)
          observed-members (into #{} (map first) entries)]
      (doseq [[member _] model]
        (let [{:keys [must-be-present?]} (allowed-scores-for-member mutations-by-m member inv-idx cmp-idx)]
          (when (and must-be-present? (not (contains? observed-members member)))
            (swap! errors conj {:kind :missing-member
                                :index cmp-idx
                                :member member}))))
      @errors))

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale. check-zrange-all completeness (lines 549-558) already uses must-be-present? from allowed-scores-for-member, which correctly skips members that an :info or concurrent ZREM might have removed. Regression test: info-zrem-concurrent-with-read-allows-missing-member. Landed in 2a194a4.

Comment on lines +569 to +585
;; Completeness within bounds: any model member whose committed score
;; is in [lo,hi] with no concurrent mutation must appear.
(let [model (model-before mutations-by-m inv-idx)
observed-members (into #{} (map first) members)]
(doseq [[member {:keys [present? score]}] model]
(when (and present?
(<= lo score hi)
(not (contains? observed-members member)))
(let [muts (get mutations-by-m member [])
concurrent (concurrent-mutations-for-member muts inv-idx cmp-idx)]
(when (empty? concurrent)
(swap! errors conj {:kind :missing-member-range
:index cmp-idx
:bounds bounds
:member member
:expected-score score}))))))
@errors))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The completeness check for ZRANGEBYSCORE has the same issue as ZRANGE: it doesn't account for uncertain removals. Additionally, it should only enforce completeness if the score is definitively known to be within the requested range (i.e., no uncertain increments).

    ;; Completeness within bounds: any model member whose committed score
    ;; is in [lo,hi] with no concurrent mutation must appear.
    (let [model (model-before mutations-by-m inv-idx)
          observed-members (into #{} (map first) members)]
      (doseq [[member _] model]
        (let [{:keys [must-be-present? scores unknown-score?]} (allowed-scores-for-member mutations-by-m member inv-idx cmp-idx)]
          (when (and must-be-present?
                     (not unknown-score?)
                     (every? #(<= lo % hi) scores)
                     (not (contains? observed-members member)))
            (swap! errors conj {:kind :missing-member-range
                                :index cmp-idx
                                :bounds bounds
                                :member member
                                :expected-score (first scores)}))))
      @errors))

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale. check-zrangebyscore completeness (lines 610-623) uses both must-be-present? AND score-definitely-in-range? (which requires (not unknown-score?) and every candidate score inside [lo, hi]). Uncertain ZINCRBYs → unknown-score? → no completeness failure. Regression test: info-zincrby-does-not-flag-zrangebyscore-completeness. Landed in 2a194a4.

…ations in checker

- allowed-scores-for-member: replace complete-idx-ordered candidate-present?
  with linearization-aware semantics. When multiple candidates share
  overlapping windows, presence is allowed iff some linearization keeps
  the member present; presence is required only when every linearization
  does. Uncertain ZREMs (pre-read :info / concurrent) now correctly
  relax must-be-present?.
- check-zrange-all / check-zrangebyscore: delegate completeness to
  must-be-present? so :info or concurrent ZREMs don't false-positive
  missing-member.
- check-zrangebyscore: gate completeness on a new
  score-definitely-in-range? helper so uncertain ZINCRBY (unknown
  resulting score) can't trigger false :missing-member-range.
- Tests: concurrent ZADD+ZREM accepts either outcome; :info ZREM
  allows absent read; :info ZINCRBY with out-of-range guess does
  not flag completeness; sanity tests still flag truly-missing members.
@bootjp

bootjp commented Apr 22, 2026

Copy link
Copy Markdown
Owner Author

Addressed 3 new gemini HIGH items (commit 2a194a4):

1. allowed-scores-for-member linearization (line 462) — Replaced complete-idx-ordered candidate-present? with linearization-aware semantics. When multiple :ok candidates share overlapping windows they may serialize in any real-time-consistent order, so:

  • can-be-present? = some candidate is a write, or some uncertain op can flip state to present.
  • must-be-present? = some candidate is a write AND no candidate is a ZREM AND no uncertain ZREM exists.

This correctly accepts reads observing EITHER outcome when a ZADD and ZREM both commit concurrently, and correctly flags superseded-score / stale reads / phantom members.

2. check-zrange-all completeness (line 526) — Replaced model-before + concurrent-check with a call to must-be-present?. A member required-present only when all admissible linearizations leave it present, so :info / concurrent ZREMs no longer false-positive.

3. check-zrangebyscore completeness (line 585) — Same must-be-present? delegation plus a new score-definitely-in-range? helper. Completeness fires only when unknown-score? is false and every admissible score is inside [lo, hi], so an uncertain (:info/concurrent) ZINCRBY whose resulting score could fall outside the range no longer causes a false :missing-member-range.

New test coverage (redis_zset_safety_workload_test.clj, additive — earlier ecb39831 tests still pass):

  • concurrent-zadd-zrem-both-completed-accepts-either-outcome (present + absent)
  • info-zrem-concurrent-with-read-allows-missing-member
  • info-zincrby-does-not-flag-zrangebyscore-completeness
  • zrange-completeness-still-detects-truly-missing-member (sanity)
  • zrangebyscore-completeness-still-detects-truly-missing-member (sanity)

lein test elastickv.redis-zset-safety-workload-test elastickv.cli-test → 21 tests, 31 assertions, 0 failures / 0 errors.

/gemini review

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@bootjp

bootjp commented Apr 22, 2026

Copy link
Copy Markdown
Owner Author

@codex code review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49c5e0cea1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +444 to +445
unknown-score? (or (has-unknown-incr? uncertain)
(some #(= :zincrby (:f %)) uncertain))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep score validation enabled for known concurrent ZINCRBY

allowed-scores-for-member marks :unknown-score? true whenever any uncertain mutation is :zincrby, even when that op completed :ok with a known resulting score. Both check-zrange-all and check-zrangebyscore then skip score-membership validation when :unknown-score? is true, so a read concurrent with a known ZINCRBY can return an impossible score (e.g., 999.0) without being flagged. This creates false negatives in the new safety checker and weakens the core score-correctness guarantee under concurrency.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0c0efc4. unknown-score? is now set only when an individual ZINCRBY's own score is unknown (:info/:pending — the :unknown-score? flag set in completed-mutation-window), or when there are >=2 concurrent/uncertain ZINCRBYs where prefix-sum intermediates can occur. A single :ok concurrent ZINCRBY no longer waves through impossible scores. Regression test: single-ok-concurrent-zincrby-still-validates-scores (read returning 999.0 alongside one :ok ZINCRBY is flagged :score-mismatch).

… known

Codex P1: `allowed-scores-for-member` was marking `:unknown-score?`
true whenever ANY uncertain mutation was `:zincrby` — including
completed `:ok` ops whose resulting score is fully recovered from the
server reply. That over-relaxed the check in `check-zrange-all` /
`check-zrangebyscore`, letting a read concurrent with a single known
ZINCRBY return an arbitrary impossible score without being flagged.

Refine the relaxation rule so it only fires when the score is
genuinely unrecoverable:

  * `:unknown-score?` is set on an individual ZINCRBY only when its
    own result is unknown (:info/:pending), OR
  * there are >=2 concurrent/uncertain ZINCRBYs whose relative
    serialization produces prefix-sum intermediates not present in
    `:scores`.

With a single :ok concurrent ZINCRBY the read can observe either the
pre-op score (in `:scores` via candidate committed ops) or the post-op
score (also in `:scores`), so the strict score-membership check stays
sound.

Add two regression tests:
  * `single-ok-concurrent-zincrby-still-validates-scores` flags a
    read observing 999.0 while a single :ok ZINCRBY is concurrent.
  * `two-concurrent-zincrbys-relax-score-check` accepts the
    intermediate prefix-sum score under multi-ZINCRBY overlap.
@bootjp

bootjp commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

Addressed the current-head ZSET safety checker findings in d31d8b9.

What changed:

  • Replaced standalone score-candidate pruning with real-time-consistent per-member state enumeration for ZADD/ZINCRBY/ZREM.
  • Enforced ZINCRBY return-value constraints (prior score + delta) and ZREM true ordering evidence.
  • Kept no-op ZREM as a no-op so it does not supersede prior writes.
  • Kept :info/pending ZINCRBY bounded by its delta instead of admitting arbitrary scores.
  • Surfaced malformed successful ZRANGE/ZRANGEBYSCORE payloads as checker failures.
  • Fixed the zset CLI wrapper arity so single-test-cmd can pass opts.

Verification:

  • cd jepsen && HOME=$(pwd)/tmp-home LEIN_HOME=$(pwd)/.lein LEIN_JVM_OPTS="-Duser.home=$(pwd)/tmp-home" lein test elastickv.redis-zset-safety-workload-test
  • git diff --check

Note: full cd jepsen && lein test still fails before running tests because elastickv.redis-workload cannot locate jepsen.redis.client on the classpath; this appears unrelated to the ZSET workload changes.

@codex review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/jepsen-test-scheduled.yml:
- Line 186: Replace the uses declaration for the upload-artifact action from
using a version tag (v7) to using a full commit SHA. Identify the current commit
SHA associated with upload-artifact v7 and replace `actions/upload-artifact@v7`
with `actions/upload-artifact@<full-commit-sha>` to pin the workflow to a
specific immutable commit rather than a mutable version tag, which strengthens
supply-chain security.
- Around line 156-159: The direct expansion of ${{ inputs.rate || '5' }} inside
the shell command creates a command injection vulnerability. Instead, bind the
input value to an environment variable in the GitHub Actions workflow (in the
env section or step env), then reference that variable in the shell command with
proper quoting like "${RATE_VAR}" to safely pass the value at runtime rather
than during workflow expansion.

In @.github/workflows/jepsen-test.yml:
- Line 140: In the jepsen-test.yml workflow file, replace the mutable version
tag `v7` in the `actions/upload-artifact` action with a specific full commit
SHA. Instead of using `actions/upload-artifact@v7`, pin it to a complete commit
digest (e.g., `actions/upload-artifact@abc123...`) to ensure the action version
is immutable and cannot change unexpectedly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0e42d95a-85ef-448e-9fb4-cdd32cc75e30

📥 Commits

Reviewing files that changed from the base of the PR and between 9bfcc13 and d31d8b9.

📒 Files selected for processing (5)
  • .github/workflows/jepsen-test-scheduled.yml
  • .github/workflows/jepsen-test.yml
  • jepsen/src/elastickv/jepsen_test.clj
  • jepsen/src/elastickv/redis_zset_safety_workload.clj
  • jepsen/test/elastickv/redis_zset_safety_workload_test.clj
💤 Files with no reviewable changes (3)
  • jepsen/src/elastickv/jepsen_test.clj
  • jepsen/test/elastickv/redis_zset_safety_workload_test.clj
  • jepsen/src/elastickv/redis_zset_safety_workload.clj

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/jepsen-test-scheduled.yml:
- Line 186: Replace the uses declaration for the upload-artifact action from
using a version tag (v7) to using a full commit SHA. Identify the current commit
SHA associated with upload-artifact v7 and replace `actions/upload-artifact@v7`
with `actions/upload-artifact@<full-commit-sha>` to pin the workflow to a
specific immutable commit rather than a mutable version tag, which strengthens
supply-chain security.
- Around line 156-159: The direct expansion of ${{ inputs.rate || '5' }} inside
the shell command creates a command injection vulnerability. Instead, bind the
input value to an environment variable in the GitHub Actions workflow (in the
env section or step env), then reference that variable in the shell command with
proper quoting like "${RATE_VAR}" to safely pass the value at runtime rather
than during workflow expansion.

In @.github/workflows/jepsen-test.yml:
- Line 140: In the jepsen-test.yml workflow file, replace the mutable version
tag `v7` in the `actions/upload-artifact` action with a specific full commit
SHA. Instead of using `actions/upload-artifact@v7`, pin it to a complete commit
digest (e.g., `actions/upload-artifact@abc123...`) to ensure the action version
is immutable and cannot change unexpectedly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0e42d95a-85ef-448e-9fb4-cdd32cc75e30

📥 Commits

Reviewing files that changed from the base of the PR and between 9bfcc13 and d31d8b9.

📒 Files selected for processing (5)
  • .github/workflows/jepsen-test-scheduled.yml
  • .github/workflows/jepsen-test.yml
  • jepsen/src/elastickv/jepsen_test.clj
  • jepsen/src/elastickv/redis_zset_safety_workload.clj
  • jepsen/test/elastickv/redis_zset_safety_workload_test.clj
💤 Files with no reviewable changes (3)
  • jepsen/src/elastickv/jepsen_test.clj
  • jepsen/test/elastickv/redis_zset_safety_workload_test.clj
  • jepsen/src/elastickv/redis_zset_safety_workload.clj
🛑 Comments failed to post (3)
.github/workflows/jepsen-test-scheduled.yml (2)

156-159: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Avoid direct ${{ inputs.* }} expansion inside shell commands.

--rate ${{ inputs.rate || '5' }} is rendered into shell source before execution, which can become command injection if a crafted workflow input is passed. Bind input to env and validate/quote at runtime instead.

Suggested hardening
       - name: Run DynamoDB per-type Jepsen workloads against elastickv
         working-directory: jepsen
         timeout-minutes: 30
         env:
+          INPUT_RATE: ${{ inputs.rate || '5' }}
           TYPE_TL: "60"
           TYPE_CONCURRENCY: "4"
           TYPE_KEY_COUNT: "8"
           TYPE_MAX_WRITES: "80"
         run: |
+          [[ "$INPUT_RATE" =~ ^[0-9]+([.][0-9]+)?$ ]] || { echo "Invalid rate: $INPUT_RATE"; exit 2; }
           PER_TYPE_TIMEOUT=$((TYPE_TL + 180))
           declare -A RESULT
           FAILED=()
           for t in string number binary bool null string-set number-set binary-set list map; do
             echo "::group::value-type=${t}"
             set +e
             timeout "${PER_TYPE_TIMEOUT}" ~/lein run -m elastickv.dynamodb-types-workload --local \
               --time-limit "${TYPE_TL}" \
-              --rate ${{ inputs.rate || '5' }} \
+              --rate "${INPUT_RATE}" \
               --concurrency "${TYPE_CONCURRENCY}" \
               --key-count "${TYPE_KEY_COUNT}" \
               --max-writes-per-key "${TYPE_MAX_WRITES}" \
               --value-type "${t}" \
               --dynamo-ports 63801,63802,63803 \
               --host 127.0.0.1
🧰 Tools
🪛 zizmor (1.26.1)

[error] 158-158: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/jepsen-test-scheduled.yml around lines 156 - 159, The
direct expansion of ${{ inputs.rate || '5' }} inside the shell command creates a
command injection vulnerability. Instead, bind the input value to an environment
variable in the GitHub Actions workflow (in the env section or step env), then
reference that variable in the shell command with proper quoting like
"${RATE_VAR}" to safely pass the value at runtime rather than during workflow
expansion.

Source: Linters/SAST tools


186-186: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Pin upload-artifact to a full commit SHA.

Using actions/upload-artifact@v7 without SHA pinning weakens workflow supply-chain integrity.

🧰 Tools
🪛 zizmor (1.26.1)

[error] 186-186: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/jepsen-test-scheduled.yml at line 186, Replace the uses
declaration for the upload-artifact action from using a version tag (v7) to
using a full commit SHA. Identify the current commit SHA associated with
upload-artifact v7 and replace `actions/upload-artifact@v7` with
`actions/upload-artifact@<full-commit-sha>` to pin the workflow to a specific
immutable commit rather than a mutable version tag, which strengthens
supply-chain security.

Source: Linters/SAST tools

.github/workflows/jepsen-test.yml (1)

140-140: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Pin upload-artifact to a full commit SHA.

actions/upload-artifact@v7 is mutable and should be pinned to a specific commit digest per policy.

🧰 Tools
🪛 zizmor (1.26.1)

[error] 140-140: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/jepsen-test.yml at line 140, In the jepsen-test.yml
workflow file, replace the mutable version tag `v7` in the
`actions/upload-artifact` action with a specific full commit SHA. Instead of
using `actions/upload-artifact@v7`, pin it to a complete commit digest (e.g.,
`actions/upload-artifact@abc123...`) to ensure the action version is immutable
and cannot change unexpectedly.

Source: Linters/SAST tools

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d31d8b9d50

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +494 to +495
(and (= :ok (:type m)) (false? (:removed? m)))
#{st}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Constrain no-op ZREM to an absent prior state

When a member was definitely present before ZREM—for example, a :zadd completes before the :zrem is invoked—and ZREM returns 0 while the member remains visible, this branch keeps the present state and the checker accepts the history. Redis's 0 count is only compatible with the member being absent at the ZREM linearization point, so a false ZREM should return #{st} only for absent states (and #{} for present states); otherwise failed removals can pass the safety workload.

Useful? React with 👍 / 👎.

"True iff the mutation adds/updates the member's score (i.e. would
make the member present). :zrem is NOT a write-op here."
[m]
(#{:zadd :zincrby} (:f m)))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep uncertain increments before later ZINCRBYs

With an :info ZINCRBY that completes before a later :ok ZINCRBY starts, the earlier increment may be required to make the later reply consistent (e.g. absent -> ZINCRBY +5 returns :info, then ZINCRBY +1 returns 6). Because write-op? includes :zincrby, superseded-by-preceding-state-change? drops the pre-read :info, so the checker rejects the valid read of 6. Only absolute overwrites/deletions should supersede prior uncertain increments; relative increments need to remain in the enumerated prefix.

Useful? React with 👍 / 👎.

Comment on lines +677 to +679
base-preceding (->> preceding
(remove (set candidates))
(sort-by :complete-idx))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enumerate overlapping base mutations before tail candidates

These base mutations can still overlap each other even though they all complete before the read and before the tail candidate's invoke. Sorting them by :complete-idx explores only one legal serialization; e.g. two overlapping ZADDs to scores 1 and 2 followed by a non-concurrent ZINCRBY +1 returning 3 is valid if the score-2 ZADD serialized last, but this code fixes the base at score 1 and rejects the history. The base prefix needs the same real-time-consistent enumeration as the tail candidates.

Useful? React with 👍 / 👎.

Comment on lines +534 to +538
states (apply-sequential-mutations #{absent-state} applied)
state (first states)]
(if state
(assoc model member state)
model)))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail impossible mutation chains instead of dropping them

apply-sequential-mutations uses an empty state set to mean a successful mutation reply is impossible, but this branch just omits the member from the model. In a run such as ZADD m1 1, ZINCRBY m1 5 returning 999, followed by an empty ZRANGE, the bad successful ZINCRBY reply is silently discarded and no completeness check runs for m1, so the checker can pass histories with impossible write responses unless a later read happens to expose the bad score.

Useful? React with 👍 / 👎.

Comment on lines +166 to +170
(when (odd? (count flat))
(throw (ex-info "WITHSCORES reply has odd element count"
{:count (count flat)
:payload flat})))
(->> flat

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject nil WITHSCORES replies as malformed

count on nil is 0 in Clojure, so a nil Carmine reply reaches partition and becomes an empty successful read instead of a malformed-read failure. If ZRANGE/ZRANGEBYSCORE starts returning nil against an empty key, this checker records a successful empty read and can return :valid? true, which defeats the malformed-payload and no-successful-read guards.

Useful? React with 👍 / 👎.

Comment on lines +32 to +35
{"elastickv-test" elastickv-test
"elastickv-zset-safety-test" elastickv-zset-safety-test
"elastickv-dynamodb-test" elastickv-dynamodb-test
"elastickv-s3-test" elastickv-s3-test})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Let every selected workload accept the option map

The new dispatcher exposes these zero-arity wrappers through single-test-cmd, but Jepsen's single-test-cmd documentation says the :test-fn receives the parsed option map. Selecting elastickv-test, elastickv-dynamodb-test, or elastickv-s3-test now calls a zero-arity function with one argument, so those newly advertised named workloads fail before running; they need the same ([] ...) ([opts] ...) shape as the zset wrapper.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the current-head ZSET safety review findings in d094a85:

  • Treat nil WITHSCORES replies as malformed instead of parsing them as empty reads.
  • Constrain ZREM returning 0 to states where the member is absent at the ZREM linearization point; impossible successful chains now surface as checker failures instead of disappearing from the model.
  • Keep pre-read :info relative mutations in the enumerated prefix when later ZINCRBY replies may depend on them; only absolute ZADD / effective ZREM supersede earlier uncertainty.
  • Enumerate overlapping base mutations with the same real-time-consistent state search before applying tail candidates.
  • Let the selected Jepsen workload wrappers accept the option map passed by single-test-cmd.

Caller audit: grep checked parse-withscores, model-before, superseded-by-preceding-state-change?, and the workload wrapper functions; affected callers are the zset checker/read paths and the CLI dispatcher tests.

Validation:

  • cd jepsen && HOME=/Users/bootjp/.codex/worktrees/cfdb/elastickv/tmp-home LEIN_HOME=/Users/bootjp/.codex/worktrees/cfdb/elastickv/.lein LEIN_JVM_OPTS="-Duser.home=/Users/bootjp/.codex/worktrees/cfdb/elastickv/tmp-home" lein test elastickv.redis-zset-safety-workload-test
  • git diff --check

Note: elastickv.jepsen-test-test still cannot be run in this local classpath because loading elastickv.redis-workload fails on missing jepsen.redis.client, which is the same local full-suite blocker seen previously.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d094a85aec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +692 to +693
base-states (linearized-read-states #{absent-state} base-preceding [])
possible-states (linearized-read-states base-states candidates uncertain)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep overlapping base mutations in the candidate search

When a pre-read mutation falls into base-preceding but overlaps one of the tail candidates, this applies it before every candidate, which can reject valid linearizations. For example, ZADD m 1 invoked at 0/completing at 5, ZINCRBY m 1 invoked at 1/completing at 2 returning 2, and another ZINCRBY m 1 invoked at 3/completing at 4 returning 3 is valid as ZADD -> first increment -> second increment; however the first increment is put in base-preceding and applied to the absent state before the overlapping ZADD candidate, making the checker report a false failure for a later read of 3. This can make the new Jepsen workload fail correct executions with slow overlapping writes.

Useful? React with 👍 / 👎.

;; consistent order and any of their return values is a valid
;; final state.
max-inv (reduce max -1 (map :invoke-idx preceding))
candidates (filterv #(>= (:complete-idx %) max-inv) preceding)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve one serialization across later reads

When overlapping writes have all completed before multiple later reads, this candidate set is recomputed independently for each read, so each read may choose a different order for the same completed writes. For example, two overlapping ZADDs can both complete, then a read can return score 2 and a later read return score 1 with no intervening mutation; both reads are accepted because each score is in its own candidate set, even though no single linearization can satisfy both. This makes the new Jepsen checker miss stale/non-monotonic reads after ambiguous writes.

Useful? React with 👍 / 👎.

Comment on lines +750 to +751
(let [{:keys [scores can-be-present?]}
(allowed-scores-for-member mutations-by-m member inv-idx cmp-idx)]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use one read prefix across all members

Each entry is validated with a fresh per-member state search, so a single read can combine visibility choices that no one linearization point permits. For instance, if ZADD m2 overlaps the read but completes before ZADD m1 is invoked, and the read returns m1 while omitting m2, the checker can accept m1 as a visible concurrent write and independently skip m2; however any read linearized late enough to see m1 must also include the earlier-completed m2. This misses fractured ZRANGE/ZRANGEBYSCORE snapshots across members.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@codex review

対応しました。

  • CI failure: wide-column ZSETで ZREM / ZREMRANGEBYRANK がlegacy blobだけ更新してmember/score indexを削除しない経路を修正しました。persistZSetEntriesTxn はwide-column既存キーを検出した場合、wide-column elem差分とmeta deltaをtxnで書きます。caller audit: 直接callerは zremzremrangebyrank の2箇所です。
  • Review comments: Jepsen ZSET checkerで、完了済みpre-read mutationを固定baseに分離せず同じreal-time enumerationで扱うようにしました。
  • Review comments: full read間で、間に当該memberへ作用し得るmutationがないのに状態が変わるケースを :unstable-read-without-mutation として検出します。
  • Review comments: ZRANGE / ZRANGEBYSCORE で、可視化したconcurrent mutationのreal-time predecessorを同じread prefixへ含める検査を追加し、fractured snapshotを検出します。

検証:

  • cd jepsen && HOME=$(pwd)/tmp-home LEIN_HOME=$(pwd)/.lein LEIN_JVM_OPTS="-Duser.home=$(pwd)/tmp-home" lein test elastickv.redis-zset-safety-workload-test
  • GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./adapter -run 'TestRedisZRemWideColumnRemovesMemberAndScoreIndex|TestRedisExecLuaCompatRetriesWriteConflict'\n- git diff --check\n\n補足: GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... は先頭パッケージ通過後、追加出力なしで長時間実行が続いたためローカルでは中断しました。

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5147914b3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

(let [observed (into {} entries)
all-mutations (mapcat second mutations-by-m)
kind (if bounds :fractured-read-prefix-range :fractured-read-prefix)]
(->> (for [[member score] entries

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Anchor prefix checks on visible ZREM omissions

When the read's only evidence of a concurrent mutation is an omission caused by ZREM, this loop never considers that removed member because it only iterates over returned entries. For example, if m1 is definitely present before a full read, m2's ZADD completes before a concurrent ZREM m1 is invoked, and the overlapping read returns [], the checker accepts it: m1 may be absent because of the ZREM and m2 may be absent because its ZADD is concurrent, but no single prefix can omit both once the ZREM is visible. This lets fractured snapshots involving deletes pass; infer missing members made optional by a concurrent ZREM and run the same predecessor-prefix check for those anchors.

Useful? React with 👍 / 👎.

Comment on lines +738 to +740
:zincrby
(and (= :ok (:type m))
(score-eq? (:score m) score))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat visible info ZINCRBY as a prefix anchor

For :info ZINCRBYs, this predicate returns false even when the observed score is only reachable by applying that in-flight increment. A history where m2's ZADD completes during a read before an :info ZINCRBY m1 is invoked, and the read returns m1 at the incremented score while omitting m2, currently passes: allowed-scores-for-member admits the derived score, but read-prefix-errors never anchors on the visible ZINCRBY, so it does not force the real-time predecessor m2 into the same read prefix. This misses fractured snapshots whenever the visible concurrent mutation is an unknown-response increment.

Useful? React with 👍 / 👎.

:when (and (not= member (:member predecessor))
(= :ok (:type predecessor))
(real-time-before? predecessor anchor)
(not (contains? observed (:member predecessor)))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate forced predecessors when already present

This condition only reports a forced real-time predecessor when its member is omitted, so the checker misses fractured reads that include that member at an older state. For example, if a read starts before ZADD m2 2, that update completes, then a visible concurrent ZADD m1 1 starts, a result containing m1=1 and stale m2=1 passes because allowed-scores-for-member permits either concurrent score for m2 and read-prefix-errors skips it once m2 is present. When a visible anchor forces a predecessor into the same prefix, compare the observed predecessor state against the forced-prefix states instead of only checking membership.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@codex review

対応しました。

  • read-prefix-errors の anchor を returned entries だけでなく、visible omission を作った concurrent/effective ZREM にも広げました。
  • :info / pending ZINCRBY でも、observed score がその delta 適用でしか到達できない場合は prefix anchor として扱うようにしました。
  • forced real-time predecessor は「omitted かどうか」だけでなく、既に返ってきている場合も forced prefix states と score を照合するようにしました。
  • 上記 3 ケースの regression tests を追加しました。

Validation:

  • cd jepsen && HOME=$(pwd)/tmp-home LEIN_HOME=$(pwd)/.lein LEIN_JVM_OPTS="-Duser.home=$(pwd)/tmp-home" lein test elastickv.redis-zset-safety-workload-test (69 tests, 122 assertions)
  • git diff --check HEAD~1..HEAD
  • caller audit: rg で変更した private helpers と read-prefix-errors の呼び出しを確認し、影響範囲は同一 checker 内の full/range read paths に閉じています。

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Chef's kiss.

Reviewed commit: a2b57b8880

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# Conflicts:
#	adapter/redis_compat_commands.go
@bootjp

bootjp commented Jun 25, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. More of your lovely PRs please.

Reviewed commit: 31f156cf3f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bootjp

bootjp commented Jun 25, 2026

Copy link
Copy Markdown
Owner Author

@codex review

対応しました。

  • wide-column ZSET の ZRANGEBYSCORE fast path で、MVCC timestamp suffix により同点 score の member 順が物理 scan 順と Redis 論理順でずれる問題を修正しました。
  • decode 後に (score, member) 順へ正規化し、LIMIT 付きで同点 score が絡む場合は slow path へ fallback して境界 semantics を保持します。
  • 同点 score の順序、bounded tie fallback、unique score の bounded window を regression test で追加しました。

Validation:

  • GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./adapter -run "TestZSetRangeByScoreFast|TestLua_ZRANGEBYSCORE"
  • cd jepsen && HOME=$(pwd)/tmp-home LEIN_HOME=$(pwd)/.lein LEIN_JVM_OPTS="-Duser.home=$(pwd)/tmp-home" lein test elastickv.redis-zset-safety-workload-test
  • local 3-node elastickv.redis-zset-safety-workload --time-limit 5 (:valid? true)
  • git diff --check

@bootjp

bootjp commented Jun 25, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Pushed 72e4c53 to address the failing Test workflow on the latest run.

CI-failure fixes:

  • TestRedis_StreamXReadLatencyIsConstant now seeds the stream with XADD * and reads after the returned tail ID. The latency test only needs a populated stream and a stable tail marker; avoiding fixed explicit IDs removes retry/top-item collisions during heavy -race CI runs.
  • fanout.currentTargets now rechecks the fresh membership cache inside the singleflight function before starting a discovery RPC. This prevents a goroutine that missed the first cache check from starting a second refresh after another singleflight call already populated the cache.

Validation:

  • go test -race ./cmd/elastickv-admin
  • go test -race ./cmd/elastickv-admin -run TestFanoutCurrentTargetsSingleflight -count=100
  • go test -race ./adapter -run TestRedis_StreamXReadLatencyIsConstant -count=1 -v
  • go test ./adapter -run TestZSetRangeByScoreFast|TestLua_ZRANGEBYSCORE
  • git diff --check

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 72e4c5373e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/redis_zset_cmds.go Outdated
Comment on lines +234 to +236
needed := offset + limit
if needed < offset || needed > maxWideScanLimit {
needed = maxWideScanLimit

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fall back when the requested range exceeds the scan cap

When offset+limit is clamped to maxWideScanLimit, a saturated scan only proves that at least the cap worth of rows matched; it does not prove the requested LIMIT window was satisfied. For an oversized ZSET (which the slow full-load path would reject via ErrCollectionTooLarge) and a bounded request such as LIMIT 99990 20, this path can return the truncated fast-path slice instead of falling back, because matchingEntries == maxWideScanLimit makes the guard pass. Preserve the uncapped requested count (or fall back on cap/overflow) so saturated capped scans don't silently truncate.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 25, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the latest Codex P2 finding in c3edbfb.

Fix:

  • zsetFastPathTruncated now falls back when a saturated fast scan cannot prove that the requested LIMIT window was satisfied because offset+limit exceeds maxWideScanLimit or overflows.
  • Added a regression test for the saturated scan-cap case, while preserving the non-saturated case.

Caller audit:

  • zsetFastPathTruncated is only used by finalizeZSetFastRange.
  • zsetRangeByScoreFast callers already route hit=false through the existing slow-path fallback.

Validation:

  • go test ./adapter -run TestZSetRangeByScoreFast|TestZSetFastPathTruncated|TestLua_ZRANGEBYSCORE
  • git diff --check

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🚀

Reviewed commit: c3edbfb6d3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bootjp

bootjp commented Jun 25, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Jepsen Scheduled Stress Test run 28185053938 の Redis ZSet safety 失敗を修正しました。

原因:

  • forced-prefix-states が bounded ZRANGEBYSCORE で predecessor の全候補 state が範囲外の場合に boolean false を返していました。
  • 呼び出し側 read-prefix-errors は戻り値を sequence として seq するため、checker が IllegalArgumentException: Don't know how to create ISeq from: java.lang.Boolean で落ち、valid? :unknown になっていました。

修正:

  • prefix 強制対象がないケースは nil を返すようにし、範囲外 predecessor を range result に要求しないようにしました。
  • bounded range の回帰テストを追加しました。
  • forced-prefix-states の caller は read-prefix-errors の 1 箇所のみで、nil は既存の seq 判定で prefix 要求なしとして扱われることを確認済みです。

検証:

  • lein test elastickv.redis-zset-safety-workload-test -> 70 tests / 123 assertions / 0 failures / 0 errors
  • run 28185053938 の artifact history.edn を修正後 checker で再解析 -> {:valid? true, :first-errors ()}

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jepsen/src/elastickv/redis_zset_safety_workload.clj`:
- Around line 787-789: The forced-prefix-states logic is dropping forced
predecessors too early, which prevents the stale-score check in the read path
from running. Update forced-prefix-states and the related stale-range handling
in the zset workload so forced states are always preserved and evaluated
separately from visibility: in the read/validation flow, treat missing data as
an error only when all forced states are visible, and treat a returned
predecessor as stale unless its score matches one of the visible forced states.
Use the existing forced-prefix-states and stale-range branches to keep the
visibility rule and the score-matching rule distinct.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a2599cad-668c-4dbd-a353-2f9366dca7c1

📥 Commits

Reviewing files that changed from the base of the PR and between d74cfed and a03c2cf.

📒 Files selected for processing (6)
  • adapter/redis_compat_commands_stream_test.go
  • adapter/redis_zset_cmds.go
  • adapter/redis_zset_cmds_test.go
  • cmd/elastickv-admin/main.go
  • jepsen/src/elastickv/redis_zset_safety_workload.clj
  • jepsen/test/elastickv/redis_zset_safety_workload_test.clj
🚧 Files skipped from review as they are similar to previous changes (2)
  • adapter/redis_zset_cmds_test.go
  • adapter/redis_zset_cmds.go

Comment on lines +787 to +789
(when (and (seq states)
(every? #(visible-state? % bounds) states))
states)))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don’t drop forced states before checking stale range entries.

Line 787 makes forced-prefix-states return nil whenever a forced predecessor can be outside ZRANGEBYSCORE bounds. That allows omission, but it also disables the stale-score branch at Lines 852-854. A read can then return a predecessor at an older in-range score even though the visible anchor forces a prefix where that predecessor was updated out of range.

Keep the forced states and separate the rules: missing is only an error when all forced states are visible; present is an error unless the observed score matches one of the visible forced states.

Proposed fix
 (defn- forced-prefix-states
   [mutations-by-m member read-inv-idx read-cmp-idx anchor bounds]
   (let [states (possible-states-for-member
                  mutations-by-m
                  member
                  read-inv-idx
                  read-cmp-idx
                  #(real-time-before? % anchor))]
-    (when (and (seq states)
-               (every? #(visible-state? % bounds) states))
+    (when (seq states)
       states)))
                :let [forced-states (forced-prefix-states mutations-by-m
                                                          (:member predecessor)
                                                          read-inv-idx
                                                          read-cmp-idx
                                                          anchor
                                                          bounds)
+                     visible-forced-states (filterv #(visible-state? % bounds)
+                                                    forced-states)
                      observed-score (get observed
                                          (:member predecessor)
                                          missing-observed)]
                :when (and (not= member (:member predecessor))
                           (= :ok (:type predecessor))
                           (real-time-before? predecessor anchor)
                           (seq forced-states)
-                          (or (= missing-observed observed-score)
-                              (not (some #(score-eq? (:score %) observed-score)
-                                         forced-states))))]
+                          (if (= missing-observed observed-score)
+                            (every? #(visible-state? % bounds) forced-states)
+                            (not (some #(score-eq? (:score %) observed-score)
+                                       visible-forced-states))))]

Also applies to: 839-854

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jepsen/src/elastickv/redis_zset_safety_workload.clj` around lines 787 - 789,
The forced-prefix-states logic is dropping forced predecessors too early, which
prevents the stale-score check in the read path from running. Update
forced-prefix-states and the related stale-range handling in the zset workload
so forced states are always preserved and evaluated separately from visibility:
in the read/validation flow, treat missing data as an error only when all forced
states are visible, and treat a returned predecessor as stale unless its score
matches one of the visible forced states. Use the existing forced-prefix-states
and stale-range branches to keep the visibility rule and the score-matching rule
distinct.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a03c2cf953

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +787 to +789
(when (and (seq states)
(every? #(visible-state? % bounds) states))
states)))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle forced deletions in prefix checks

When a visible concurrent anchor real-time follows a successful ZREM of another member, the read prefix is forced to include that deletion too. However, forcing that predecessor produces an absent state, and this guard drops any state set that is not visible-state?, so read-prefix-errors sees nil and skips the check. A fractured read like “observe anchor m1, but also return stale m2 after its forced ZREM” will therefore pass the checker even though the prefix is inconsistent; absent forced states should still be returned so stale presences can be rejected, while only omitted out-of-range bounded results are exempt.

Useful? React with 👍 / 👎.

Comment on lines +138 to +139
(nil? response)
0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat nil ZREM replies as unknown

If Carmine returns nil for ZREM because the connection/protocol failed after the command may have reached Redis, this maps the operation to :ok [member false] instead of :info. That tells the checker the remove definitely observed an absent member, so a successful remove with a lost reply can make later correct reads look invalid (or otherwise over-constrain the model); nil is not a valid Redis integer reply for ZREM, so it should remain an unknown outcome like the ZINCRBY nil path.

Useful? React with 👍 / 👎.

[mutations-by-m m member score read-inv-idx read-cmp-idx]
(and (not= :fail (:type m))
(= member (:member m))
(concurrent? m read-inv-idx read-cmp-idx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Anchor prefixes on pre-read info effects

When an :info ZADD/ZINCRBY completes before a read but the observed score is only reachable if that unknown operation actually took effect, this concurrent? guard prevents it from anchoring the read prefix. For example, if m2 completes before the :info write to m1, and a later read observes m1 from that :info write while omitting m2, the checker can pass even though any linearization that includes m1 must place the read after m2; the ZREM absence path has the same pre-read :info hole.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 25, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a03c2cf953

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +198 to +199
if zsetFastPathNeedsTieFallback(entries, limit) {
return nil, false, monitoring.LuaFastPathFallbackTruncated, nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid routing duplicate +inf ranges through the slow filter

When a bounded ZRANGEBYSCORE key +inf +inf LIMIT ... scan sees at least two +inf members, this new tie fallback sends the command to cmdZRangeByScoreSlow; that slow path uses scoreInRange, which does not enforce zBoundPosInf/zBoundNegInf bounds, so a zset that also has finite scores can return the lowest finite member instead of a +inf member. This regresses the wide-column fast path specifically for duplicate infinite scores; either keep this case on the fast path or make the slow score filter handle infinite bounds before falling back.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants