Skip to content

fix: SpliceIndexer double-applies sample-subset map (spliced+subset sample scramble)#243

Merged
d-laub merged 1 commit into
mainfrom
fix/spliced-subset-sample-scramble
Jun 25, 2026
Merged

fix: SpliceIndexer double-applies sample-subset map (spliced+subset sample scramble)#243
d-laub merged 1 commit into
mainfrom
fix/spliced-subset-sample-scramble

Conversation

@d-laub

@d-laub d-laub commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Root cause

SpliceIndexer.parse_idx in python/genvarloader/_dataset/_indexing.py (around line 451 on main) applied the sample-subset map twice, silently corrupting spliced output whenever a Dataset had been sample-subsetted with a non-identity sample_subset_idxs.

The double-application happened as follows:

  1. First application (line 412/420/427): s_idx = self.dsi._s_idx[s_idx] converts output-space sample positions [0..n_subset_samples) → on-disk sample positions via full_sample_idxs[sample_subset_idxs[...]].
  2. The ravel/unravel over self.full_shape produces s_idx values already in on-disk space.
  3. After r_idx = self.map.full_splice_map[r_idx] (exon expansion) and s_idx = s_idx.repeat(lengths), the code called self.dsi.parse_idx((r_idx, s_idx)).
  4. Second application (line 245 of DatasetIndexer.parse_idx): s_idx = self._s_idx[s_idx] applied full_sample_idxs[sample_subset_idxs[...]] again to what were already on-disk positions — mapping wrong samples entirely, or raising IndexError when the double-mapped value fell out of bounds.

Symptom

Only triggered with splice_info enabled and a non-identity sample_subset (i.e. dataset.subset_to(samples=...)). Non-spliced paths were unaffected because they don't call dsi.parse_idx with already-mapped indices.

Concrete MMRF repro: With the MMRF consensus dataset, sample MMRF_2702 (svar position 54, GVL sorted 625) received MMRF_1395's NRAS G12D mutation because sample_subset_idxs[625] mapped to MMRF_1395.

Fix

python/genvarloader/_dataset/_indexing.py, line ~451

Replace the dsi.parse_idx call with a direct ravel_multi_index:

# Before (buggy):
ds_idx, *_ = self.dsi.parse_idx((r_idx, s_idx))

# After (fixed):
r_flat = self.dsi.full_region_idxs[r_idx]
ds_idx = np.ravel_multi_index((r_flat, s_idx), self.dsi.full_shape).ravel()

After the unravel/exon-expand step, r_idx are region positions (indices into full_region_idxs) and s_idx are already-mapped on-disk sample indices. Applying full_region_idxs[r_idx] once and ravelling is all that's needed — no second _s_idx pass.

Regression test

tests/unit/dataset/test_splice_indexer_subset.py — three cases:

  1. test_parse_idx_sample_subset_applied_once: constructs a SpliceIndexer with sample_subset_idxs=[2, 4] over 5 on-disk samples, requests all transcripts × all subset samples, and asserts every returned on-disk sample index is 2 or 4 (not a double-mapped value). Failed on main with IndexError: index 4 is out of bounds for axis 0 with size 2. Passes after the fix.
  2. test_parse_idx_single_sample_from_subset: requests one specific subset sample by output position and verifies the correct on-disk index. Failed on main with the same IndexError. Passes after the fix.
  3. test_parse_idx_no_subset_unaffected: no sample subset; confirms the non-subset path is unchanged. Passed on main and after the fix.

Test execution

Ran locally in the repo's dev pixi environment (Python 3.10, maturin Rust build):

pixi r -e dev pytest tests/unit/dataset/test_splice_indexer_subset.py tests/unit/dataset/test_indexing.py tests/dataset/test_flat_splice.py tests/unit/ tests/dataset/ -v

Results:

  • 3/3 new regression tests: FAIL → PASS
  • 460 unit tests: 460 passed, 5 skipped, 2 xfailed (no regressions)
  • 167 dataset integration tests: 167 passed (no regressions)

🤖 Generated with Claude Code

When a Dataset has been sample-subsetted (non-identity sample_subset_idxs)
AND output is spliced, SpliceIndexer.parse_idx applied the sample-subset
map twice:

1. Via self.dsi._s_idx[s_idx] within parse_idx, which maps output-space
   sample positions to on-disk positions.
2. Again via self.dsi.parse_idx((r_idx, s_idx)) at the end, which applies
   _s_idx a second time to what are already on-disk positions.

With the MMRF consensus dataset this caused sample MMRF_2702 (svar pos 54,
GVL sorted 625) to receive MMRF_1395's NRAS G12D mutation because
sample_subset_idxs[625] mapped to MMRF_1395.

Fix: after the unravel/exon-expand step, r_idx and s_idx already hold
full-dataset positions. Compute the flat storage index directly via
ravel_multi_index using full_region_idxs, bypassing the second _s_idx
application that parse_idx would otherwise inject.

Adds regression test test_splice_indexer_subset.py that constructs a
SpliceIndexer with sample_subset_idxs=[2,4] over 5 on-disk samples,
verifies all three index paths (slice×slice, scalar×scalar, no-subset)
return the correct on-disk sample positions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@d-laub d-laub force-pushed the fix/spliced-subset-sample-scramble branch from 4454de8 to d814965 Compare June 25, 2026 03:27
@d-laub d-laub merged commit 2229127 into main Jun 25, 2026
8 checks passed
@d-laub d-laub deleted the fix/spliced-subset-sample-scramble branch June 25, 2026 03:45
d-laub added a commit that referenced this pull request Jun 25, 2026
…pro 0.20

Design for: merge origin/main (#242/#244 clip fix + #243 splice-subset fix)
into the branch, lift the now-obsolete #242 xfails, port Reference.fetch to
rust, fuse the annotated/splice haps paths, bump seqpro 0.18->0.20 with
to_numpy(validate=False) adoption, and reconcile the roadmap honestly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
d-laub added a commit that referenced this pull request Jun 25, 2026
7 tasks: merge origin/main (#242/#243), lift obsolete #242 xfails, reroute
Reference.fetch through rust get_reference, fuse annotated + spliced haps
kernels, bump seqpro 0.20 + validate=False, roadmap honesty pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant