Skip to content

[SYSTEMDS-3929] Speed up Parquet frame reader/writer#2528

Open
Jakob-al28 wants to merge 2 commits into
apache:mainfrom
Jakob-al28:SYSTEMDS-3929-parquet
Open

[SYSTEMDS-3929] Speed up Parquet frame reader/writer#2528
Jakob-al28 wants to merge 2 commits into
apache:mainfrom
Jakob-al28:SYSTEMDS-3929-parquet

Conversation

@Jakob-al28

Copy link
Copy Markdown

[SYSTEMDS-3929] Speed up Parquet frame reader/writer

The old reader created a Group object per row, boxing every decoded value into a wrapper object inside it, even though SystemDS frames are always flat. The parallel reader also converted every boxed value to a string and re-parsed it back to its target type in FrameBlock.set. This PR reads columns via parquet's internal column API (ColumnReadStoreImpl/ColumnReader) instead, avoiding the Group allocation. On TPC-H lineitem (sf=5, ~30M rows, median of 7 runs on a cloud VM), read time went from 88.9s to 34.5s (2.6x faster).

tpch_read

Also rewrites the writer to use a custom WriteSupport with tuned defaults, adds INT96 timestamp read support, and fixes row offsets for the parallel reader.

The old writer had a tunable batch of 1000 rows. The batch size was benchmarked looking for a better default, but ParquetWriter already batches internally by buffered byte size, making the manual row-count batch on top of it redundant, so the new writer drops it and instead exposes that internal buffer size as the tunable row-group size. Compression codec and dictionary-encoding were benchmarked as well: ZSTD and per-column dictionary encoding (STRING_ONLY) are used as defaults. Row group size was benchmarked too and kept at parquet's default of 128MB. Note the batch-size comparison runs uncompressed to match the legacy writer's codec, while the row-group benchmark uses the ZSTD default, so absolute times differ between the two plots.

tpch_batch_sizes tpch_row_group_sizes tpch_compression tpch_encoding

TPC-H lineitem's comment column is ~84% unique, so dictionary encoding on it doesn't pay off, which is why ALL_OFF wins on this benchmark. Typical frame workloads are assumed to have low-cardinality string columns where dictionaries pay off, so STRING_ONLY is kept as the default.

Also adds public Parquet test files under src/test/resources/datasets/parquet/ (duckdb, apache/parquet-testing, Titanic from HuggingFace) as a check against real-world files, and tests covering INT96 decoding, null handling and parallel/sequential equivalence.

Possible next step: row-group-level parallel reads within a single file.

Limitations: INT96 decoding truncates nanosecond precision to milliseconds, and there's no logical-type mapping for dates/timestamps since FrameBlock has no such types.

Rewrites the Parquet frame reader to read columns via
parquet's column API (ColumnReadStoreImpl, ColumnReader) instead of
creating a Group object per row. On TPC-H lineitem (~30M rows), read time went from
88.9s to 34.5s (2.6x faster).

Also rewrites the writer around a custom WriteSupport, removing the
per-row Group allocation, adding INT96 timestamp decoding, and fixing
the parallel reader to use the sequential implementation
instead of reimplementing row iteration per thread. ParquetWriter
batches by buffered size internally, so the old writer's manual batch
buffer was redundant and has been removed. Compression, dictionary
encoding, and row-group size are benchmarked and configurable.

The old Group-based implementations are kept under test scope as a benchmark baseline.
Adds tests covering null handling, INT96 decoding, and round trips against public Parquet
files.
@Baunsgaard

Copy link
Copy Markdown
Contributor

Im curious how this compares to the Delta IO i have in #2515 .

There the main bottleneck is decoding the Parquet files.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible avoid adding parquet files to the system,

instead rely on some known good Parquet writer to generate your initial inputs. e.g. Spark has a writer you can use for making inputs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet files have been deleted and instead, Spark is used to write the test files. Will take a look at how this compares to the Delta IO solution in #2515.

@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.50617% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.66%. Comparing base (c099a2f) to head (e0916e1).

Files with missing lines Patch % Lines
...rg/apache/sysds/runtime/io/FrameReaderParquet.java 87.69% 5 Missing and 3 partials ⚠️
...rg/apache/sysds/runtime/io/FrameWriterParquet.java 91.89% 3 Missing and 3 partials ⚠️
...rg/apache/sysds/runtime/io/FrameReaderFactory.java 0.00% 1 Missing ⚠️
...rg/apache/sysds/runtime/io/FrameWriterFactory.java 0.00% 1 Missing ⚠️
...e/sysds/runtime/io/FrameWriterParquetParallel.java 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2528   +/-   ##
=========================================
  Coverage     71.66%   71.66%           
+ Complexity    49338    49328   -10     
=========================================
  Files          1580     1580           
  Lines        190516   190579   +63     
  Branches      37364    37368    +4     
=========================================
+ Hits         136525   136582   +57     
- Misses        43464    43472    +8     
+ Partials      10527    10525    -2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants