[SYSTEMDS-3949] Add native Delta Lake frame read/write via Delta Kernel#2515
Open
Baunsgaard wants to merge 3 commits into
Open
[SYSTEMDS-3949] Add native Delta Lake frame read/write via Delta Kernel#2515Baunsgaard wants to merge 3 commits into
Baunsgaard wants to merge 3 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2515 +/- ##
============================================
+ Coverage 71.56% 71.70% +0.14%
- Complexity 49110 49463 +353
============================================
Files 1575 1583 +8
Lines 189793 190943 +1150
Branches 37235 37451 +216
============================================
+ Hits 135816 136908 +1092
- Misses 43480 43482 +2
- Partials 10497 10553 +56 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
27fe2ff to
d269ee7
Compare
Extend the native Delta Lake support from matrices to frames, reading and writing Delta Lake tables through the Spark-free Delta Kernel library on the single-node CP path. DML read/write with format="delta" now works for frames, discovering schema, column names, and dimensions directly from the table. - Add FrameReaderDelta, FrameReaderDeltaParallel and FrameWriterDelta - Wire DELTA into the frame reader and writer factories - Refresh cached frame metadata and schema after a Delta read - Broaden Delta frame component IO coverage Stacked on the matrix Delta support; append/overwrite semantics, distributed execution, and time travel remain out of scope.
d269ee7 to
8feb495
Compare
The native Delta read decode is CPU-bound and parallelizes per data file, so a table written as one large file cannot use more than one reader thread. Size data files toward roughly one file per expected parallel reader, capped by the configured target and floored to avoid tiny-file proliferation. This materially improves parallel-read throughput for both matrix and frame tables. - Add the sysds.io.delta.writer.adaptivefilesize config (default true) plus adaptiveWriterTargetFileSize/createWriteEngine helpers in DeltaKernelUtils, and document the target file size as an upper bound - Wire FrameWriterDelta and WriterDelta to size files from the block's estimated bytes (dense double footprint for matrices) - Use the configurable DELTA_WRITER_BATCH_SIZE in FrameWriterDelta instead of a hardcoded batch size, matching the matrix writer
The parallel frame reader's metadata-direct path wrote each data file's rows into shared per-column arrays at a fixed offset without bounding the row count, so a table whose per-file numRecords statistic under-counts the actual rows (possible for externally written Delta tables) could overrun its slice into the next file's region under concurrent writes. - Add the per-file row-count overflow guard in FrameReaderDeltaParallel .readDirect, matching the matrix reader: fail fast with a clear message instead of risking overlapping concurrent writes or an array overrun - Reuse DeltaKernelUtils.typeCode/T_* in FrameReaderDelta instead of a forked R_* table and instanceof cascade, keeping the frame and matrix type dispatch in lockstep; drop the now-unused type imports - Extract awaitFileTasks in FrameReaderDeltaParallel to share the pool lifecycle across both read paths and restore the interrupt flag when a parallel read is cancelled - Add a unit test covering the adaptive target-file-size flag on/off and the floor/cap clamp boundaries - Clarify the adaptive-size javadoc floor wording, the createWriteEngine batch-size comment, and rename opaque locals (names2, bcs/bss)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Extend the native Delta Lake support (#2511) from matrices to frames, reading and writing Delta Lake tables through the Spark-free Delta Kernel library on the single-node CP path. DML read/write with format="delta" now works for frames, discovering schema, column names, and dimensions directly from the table.
Stacked on #2511 and should merge after it. Append/overwrite semantics, distributed execution, and time travel remain out of scope