[SYSTEMDS-3929] Speed up Parquet frame reader/writer by Jakob-al28 · Pull Request #2528 · apache/systemds

Jakob-al28 · 2026-07-02T07:27:56Z

[SYSTEMDS-3929] Speed up Parquet frame reader/writer

The old reader created a Group object per row, boxing every decoded value into a wrapper object inside it, even though SystemDS frames are always flat. The parallel reader also converted every boxed value to a string and re-parsed it back to its target type in FrameBlock.set. This PR reads columns via parquet's internal column API (ColumnReadStoreImpl/ColumnReader) instead, avoiding the Group allocation. On TPC-H lineitem (sf=5, ~30M rows, median of 7 runs on a cloud VM), read time went from 88.9s to 34.5s (2.6x faster).

Also rewrites the writer to use a custom WriteSupport with tuned defaults, adds INT96 timestamp read support, and fixes row offsets for the parallel reader.

The old writer had a tunable batch of 1000 rows. The batch size was benchmarked looking for a better default, but ParquetWriter already batches internally by buffered byte size, making the manual row-count batch on top of it redundant, so the new writer drops it and instead exposes that internal buffer size as the tunable row-group size. Compression codec and dictionary-encoding were benchmarked as well: ZSTD and per-column dictionary encoding (STRING_ONLY) are used as defaults. Row group size was benchmarked too and kept at parquet's default of 128MB. Note the batch-size comparison runs uncompressed to match the legacy writer's codec, while the row-group benchmark uses the ZSTD default, so absolute times differ between the two plots.

TPC-H lineitem's comment column is ~84% unique, so dictionary encoding on it doesn't pay off, which is why ALL_OFF wins on this benchmark. Typical frame workloads are assumed to have low-cardinality string columns where dictionaries pay off, so STRING_ONLY is kept as the default.

Also adds public Parquet test files under src/test/resources/datasets/parquet/ (duckdb, apache/parquet-testing, Titanic from HuggingFace) as a check against real-world files, and tests covering INT96 decoding, null handling and parallel/sequential equivalence.

Possible next step: row-group-level parallel reads within a single file.

Limitations: INT96 decoding truncates nanosecond precision to milliseconds, and there's no logical-type mapping for dates/timestamps since FrameBlock has no such types.

Rewrites the Parquet frame reader to read columns via parquet's column API (ColumnReadStoreImpl, ColumnReader) instead of creating a Group object per row. On TPC-H lineitem (~30M rows), read time went from 88.9s to 34.5s (2.6x faster). Also rewrites the writer around a custom WriteSupport, removing the per-row Group allocation, adding INT96 timestamp decoding, and fixing the parallel reader to use the sequential implementation instead of reimplementing row iteration per thread. ParquetWriter batches by buffered size internally, so the old writer's manual batch buffer was redundant and has been removed. Compression, dictionary encoding, and row-group size are benchmarked and configurable. The old Group-based implementations are kept under test scope as a benchmark baseline. Adds tests covering null handling, INT96 decoding, and round trips against public Parquet files.

Baunsgaard · 2026-07-02T08:26:09Z

Im curious how this compares to the Delta IO i have in #2515 .

There the main bottleneck is decoding the Parquet files.

Baunsgaard · 2026-07-02T08:27:38Z

If possible avoid adding parquet files to the system,

instead rely on some known good Parquet writer to generate your initial inputs. e.g. Spark has a writer you can use for making inputs.

Parquet files have been deleted and instead, Spark is used to write the test files. Will take a look at how this compares to the Delta IO solution in #2515.

codecov · 2026-07-02T08:51:47Z

Codecov Report

❌ Patch coverage is 89.50617% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.66%. Comparing base (c099a2f) to head (e0916e1).

Files with missing lines	Patch %	Lines
...rg/apache/sysds/runtime/io/FrameReaderParquet.java	87.69%	5 Missing and 3 partials ⚠️
...rg/apache/sysds/runtime/io/FrameWriterParquet.java	91.89%	3 Missing and 3 partials ⚠️
...rg/apache/sysds/runtime/io/FrameReaderFactory.java	0.00%	1 Missing ⚠️
...rg/apache/sysds/runtime/io/FrameWriterFactory.java	0.00%	1 Missing ⚠️
...e/sysds/runtime/io/FrameWriterParquetParallel.java	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##               main    #2528   +/-   ##
=========================================
  Coverage     71.66%   71.66%           
+ Complexity    49338    49328   -10     
=========================================
  Files          1580     1580           
  Lines        190516   190579   +63     
  Branches      37364    37368    +4     
=========================================
+ Hits         136525   136582   +57     
- Misses        43464    43472    +8     
+ Partials      10527    10525    -2

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-project-automation Bot added this to SystemDS PR Queue Jul 2, 2026

github-project-automation Bot moved this to In Progress in SystemDS PR Queue Jul 2, 2026

Baunsgaard reviewed Jul 2, 2026

View reviewed changes

Generate parquet test files with Spark

e0916e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYSTEMDS-3929] Speed up Parquet frame reader/writer#2528

[SYSTEMDS-3929] Speed up Parquet frame reader/writer#2528
Jakob-al28 wants to merge 2 commits into
apache:mainfrom
Jakob-al28:SYSTEMDS-3929-parquet

Jakob-al28 commented Jul 2, 2026

Uh oh!

Baunsgaard commented Jul 2, 2026

Uh oh!

Baunsgaard Jul 2, 2026

Uh oh!

Jakob-al28 Jul 2, 2026

Uh oh!

codecov Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Jakob-al28 commented Jul 2, 2026

Uh oh!

Baunsgaard commented Jul 2, 2026

Uh oh!

Baunsgaard Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Jakob-al28 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jul 2, 2026 •

edited

Loading