fix: SpliceIndexer double-applies sample-subset map (spliced+subset sample scramble) by d-laub · Pull Request #243 · mcvickerlab/GenVarLoader

d-laub · 2026-06-25T01:54:21Z

Root cause

SpliceIndexer.parse_idx in python/genvarloader/_dataset/_indexing.py (around line 451 on main) applied the sample-subset map twice, silently corrupting spliced output whenever a Dataset had been sample-subsetted with a non-identity sample_subset_idxs.

The double-application happened as follows:

First application (line 412/420/427): s_idx = self.dsi._s_idx[s_idx] converts output-space sample positions [0..n_subset_samples) → on-disk sample positions via full_sample_idxs[sample_subset_idxs[...]].
The ravel/unravel over self.full_shape produces s_idx values already in on-disk space.
After r_idx = self.map.full_splice_map[r_idx] (exon expansion) and s_idx = s_idx.repeat(lengths), the code called self.dsi.parse_idx((r_idx, s_idx)).
Second application (line 245 of DatasetIndexer.parse_idx): s_idx = self._s_idx[s_idx] applied full_sample_idxs[sample_subset_idxs[...]] again to what were already on-disk positions — mapping wrong samples entirely, or raising IndexError when the double-mapped value fell out of bounds.

Symptom

Only triggered with splice_info enabled and a non-identity sample_subset (i.e. dataset.subset_to(samples=...)). Non-spliced paths were unaffected because they don't call dsi.parse_idx with already-mapped indices.

Concrete MMRF repro: With the MMRF consensus dataset, sample MMRF_2702 (svar position 54, GVL sorted 625) received MMRF_1395's NRAS G12D mutation because sample_subset_idxs[625] mapped to MMRF_1395.

Fix

python/genvarloader/_dataset/_indexing.py, line ~451

Replace the dsi.parse_idx call with a direct ravel_multi_index:

# Before (buggy):
ds_idx, *_ = self.dsi.parse_idx((r_idx, s_idx))

# After (fixed):
r_flat = self.dsi.full_region_idxs[r_idx]
ds_idx = np.ravel_multi_index((r_flat, s_idx), self.dsi.full_shape).ravel()

After the unravel/exon-expand step, r_idx are region positions (indices into full_region_idxs) and s_idx are already-mapped on-disk sample indices. Applying full_region_idxs[r_idx] once and ravelling is all that's needed — no second _s_idx pass.

Regression test

tests/unit/dataset/test_splice_indexer_subset.py — three cases:

test_parse_idx_sample_subset_applied_once: constructs a SpliceIndexer with sample_subset_idxs=[2, 4] over 5 on-disk samples, requests all transcripts × all subset samples, and asserts every returned on-disk sample index is 2 or 4 (not a double-mapped value). Failed on main with IndexError: index 4 is out of bounds for axis 0 with size 2. Passes after the fix.
test_parse_idx_single_sample_from_subset: requests one specific subset sample by output position and verifies the correct on-disk index. Failed on main with the same IndexError. Passes after the fix.
test_parse_idx_no_subset_unaffected: no sample subset; confirms the non-subset path is unchanged. Passed on main and after the fix.

Test execution

Ran locally in the repo's dev pixi environment (Python 3.10, maturin Rust build):

pixi r -e dev pytest tests/unit/dataset/test_splice_indexer_subset.py tests/unit/dataset/test_indexing.py tests/dataset/test_flat_splice.py tests/unit/ tests/dataset/ -v

Results:

3/3 new regression tests: FAIL → PASS
460 unit tests: 460 passed, 5 skipped, 2 xfailed (no regressions)
167 dataset integration tests: 167 passed (no regressions)

🤖 Generated with Claude Code

When a Dataset has been sample-subsetted (non-identity sample_subset_idxs) AND output is spliced, SpliceIndexer.parse_idx applied the sample-subset map twice: 1. Via self.dsi._s_idx[s_idx] within parse_idx, which maps output-space sample positions to on-disk positions. 2. Again via self.dsi.parse_idx((r_idx, s_idx)) at the end, which applies _s_idx a second time to what are already on-disk positions. With the MMRF consensus dataset this caused sample MMRF_2702 (svar pos 54, GVL sorted 625) to receive MMRF_1395's NRAS G12D mutation because sample_subset_idxs[625] mapped to MMRF_1395. Fix: after the unravel/exon-expand step, r_idx and s_idx already hold full-dataset positions. Compute the flat storage index directly via ravel_multi_index using full_region_idxs, bypassing the second _s_idx application that parse_idx would otherwise inject. Adds regression test test_splice_indexer_subset.py that constructs a SpliceIndexer with sample_subset_idxs=[2,4] over 5 on-disk samples, verifies all three index paths (slice×slice, scalar×scalar, no-subset) return the correct on-disk sample positions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…pro 0.20 Design for: merge origin/main (#242/#244 clip fix + #243 splice-subset fix) into the branch, lift the now-obsolete #242 xfails, port Reference.fetch to rust, fuse the annotated/splice haps paths, bump seqpro 0.18->0.20 with to_numpy(validate=False) adoption, and reconcile the roadmap honestly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

7 tasks: merge origin/main (#242/#243), lift obsolete #242 xfails, reroute Reference.fetch through rust get_reference, fuse annotated + spliced haps kernels, bump seqpro 0.20 + validate=False, roadmap honesty pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

d-laub force-pushed the fix/spliced-subset-sample-scramble branch from 4454de8 to d814965 Compare June 25, 2026 03:27

d-laub merged commit 2229127 into main Jun 25, 2026
8 checks passed

d-laub deleted the fix/spliced-subset-sample-scramble branch June 25, 2026 03:45

d-laub mentioned this pull request Jun 25, 2026

Phase 3 close-out: main merge, missing-kernel ports, fused annot/splice haps, seqpro 0.20 #246

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: SpliceIndexer double-applies sample-subset map (spliced+subset sample scramble)#243

fix: SpliceIndexer double-applies sample-subset map (spliced+subset sample scramble)#243
d-laub merged 1 commit into
mainfrom
fix/spliced-subset-sample-scramble

d-laub commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

d-laub commented Jun 25, 2026

Root cause

Symptom

Fix

Regression test

Test execution

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant