fix: SpliceIndexer double-applies sample-subset map (spliced+subset sample scramble)#243
Merged
Merged
Conversation
When a Dataset has been sample-subsetted (non-identity sample_subset_idxs) AND output is spliced, SpliceIndexer.parse_idx applied the sample-subset map twice: 1. Via self.dsi._s_idx[s_idx] within parse_idx, which maps output-space sample positions to on-disk positions. 2. Again via self.dsi.parse_idx((r_idx, s_idx)) at the end, which applies _s_idx a second time to what are already on-disk positions. With the MMRF consensus dataset this caused sample MMRF_2702 (svar pos 54, GVL sorted 625) to receive MMRF_1395's NRAS G12D mutation because sample_subset_idxs[625] mapped to MMRF_1395. Fix: after the unravel/exon-expand step, r_idx and s_idx already hold full-dataset positions. Compute the flat storage index directly via ravel_multi_index using full_region_idxs, bypassing the second _s_idx application that parse_idx would otherwise inject. Adds regression test test_splice_indexer_subset.py that constructs a SpliceIndexer with sample_subset_idxs=[2,4] over 5 on-disk samples, verifies all three index paths (slice×slice, scalar×scalar, no-subset) return the correct on-disk sample positions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4454de8 to
d814965
Compare
d-laub
added a commit
that referenced
this pull request
Jun 25, 2026
…pro 0.20 Design for: merge origin/main (#242/#244 clip fix + #243 splice-subset fix) into the branch, lift the now-obsolete #242 xfails, port Reference.fetch to rust, fuse the annotated/splice haps paths, bump seqpro 0.18->0.20 with to_numpy(validate=False) adoption, and reconcile the roadmap honestly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
SpliceIndexer.parse_idxinpython/genvarloader/_dataset/_indexing.py(around line 451 on main) applied the sample-subset map twice, silently corrupting spliced output whenever a Dataset had been sample-subsetted with a non-identitysample_subset_idxs.The double-application happened as follows:
s_idx = self.dsi._s_idx[s_idx]converts output-space sample positions [0..n_subset_samples) → on-disk sample positions viafull_sample_idxs[sample_subset_idxs[...]].self.full_shapeproducess_idxvalues already in on-disk space.r_idx = self.map.full_splice_map[r_idx](exon expansion) ands_idx = s_idx.repeat(lengths), the code calledself.dsi.parse_idx((r_idx, s_idx)).DatasetIndexer.parse_idx):s_idx = self._s_idx[s_idx]appliedfull_sample_idxs[sample_subset_idxs[...]]again to what were already on-disk positions — mapping wrong samples entirely, or raisingIndexErrorwhen the double-mapped value fell out of bounds.Symptom
Only triggered with
splice_infoenabled and a non-identitysample_subset(i.e.dataset.subset_to(samples=...)). Non-spliced paths were unaffected because they don't calldsi.parse_idxwith already-mapped indices.Concrete MMRF repro: With the MMRF consensus dataset, sample MMRF_2702 (svar position 54, GVL sorted 625) received MMRF_1395's NRAS G12D mutation because
sample_subset_idxs[625]mapped to MMRF_1395.Fix
python/genvarloader/_dataset/_indexing.py, line ~451Replace the
dsi.parse_idxcall with a directravel_multi_index:After the unravel/exon-expand step,
r_idxare region positions (indices intofull_region_idxs) ands_idxare already-mapped on-disk sample indices. Applyingfull_region_idxs[r_idx]once and ravelling is all that's needed — no second_s_idxpass.Regression test
tests/unit/dataset/test_splice_indexer_subset.py— three cases:test_parse_idx_sample_subset_applied_once: constructs aSpliceIndexerwithsample_subset_idxs=[2, 4]over 5 on-disk samples, requests all transcripts × all subset samples, and asserts every returned on-disk sample index is 2 or 4 (not a double-mapped value). Failed on main withIndexError: index 4 is out of bounds for axis 0 with size 2. Passes after the fix.test_parse_idx_single_sample_from_subset: requests one specific subset sample by output position and verifies the correct on-disk index. Failed on main with the sameIndexError. Passes after the fix.test_parse_idx_no_subset_unaffected: no sample subset; confirms the non-subset path is unchanged. Passed on main and after the fix.Test execution
Ran locally in the repo's
devpixi environment (Python 3.10, maturin Rust build):Results:
🤖 Generated with Claude Code