feat: Allow parquet column access by field_id #6156

devinrsmith · 2024-09-30T22:28:41Z

This allows the the resolution of a parquet column by field_id instead of by its "path". This is a lower-level option that will not typically be used by end-users; as such, this option has not been plumbed through to python. This feature will be used in follow-up PRs in combination with Iceberg's field-ids to improve column mappings.

Writing support has also been added.

Fixes #6128

This allows the the resolution of a parquet column by field_id instead of by its "path". This is a lower-level option that will not typically be used by end-users; as such, this option has not been plumbed through to python. This feature will be used in follow-up PRs in combination with Iceberg's field-ids to improve column mappings. Fixes deephaven#6128

malhotrashivam

First level of review, can do a more detailed review tomorrow.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

malhotrashivam · 2024-09-30T22:52:11Z

Do verify the nightlies pass before merging.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

This also fixes a bug where `parquetColumnNameToInstructions.put(parquetColumnName, ci);` was called without setting the parqute column name on ci and the KeyDef would blow up.

…t skip the logic when a user explicitly sets the parquet column name the same as the column name

devinrsmith · 2024-10-01T14:40:15Z

Do verify the nightlies pass before merging.

Verified.

malhotrashivam

I really like the change, minor comments.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

malhotrashivam · 2024-10-01T15:28:18Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

+                        // TODO: how should we handle this? Ignore?
+                        // throw new IllegalArgumentException();


I feel that this should be an error in ParquetInstructions, right when the user sets it instead of here

This whole code path is very smelly; I want to go through a larger refactoring that would alleviate the need to make these types of calls in the first place. This code path is only hit when inferring the TableDefinition, so I don't think it should be an error to set the same field id multiple times in general. We have set it up this way with parquet column names, but we shouldn't technically need to do that either - every little modelling mismatch we present is a small papercut that can lead to larger modelling problems at higher layers IMO.

I would be okay throwing an error here or silently ignoring wrt inferrence. Ideally, the user would be able to choose the behavior they desire. The structure of ParquetInstructions / builder makes that tedious (I wish we could redo it w/ Immutables and saner structures).

I'll change this to throw an error here, with a note we could think about exposing option to silently ignore if that's what the user wants.

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ParquetFileReader.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

devinrsmith · 2024-10-01T16:53:45Z

I couldn't find any resources to confirm, but this does feel incorrect to me, having two columns with same field ID. For example, if we get a field ID by Iceberg, it would expect a single column, right?

Iceberg probably mandates the uniqueness of field-ids.

Parquet doesn't have any mandates wrt that. And even the column names aren't guaranteed to be unique. I need to find the reference I found earlier that the parquet format "strongly recommends" unique column names, but it's not even a guarantee.

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java

malhotrashivam · 2024-10-01T17:30:09Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/TypeInfos.java

@@ -474,6 +474,7 @@ default Type createSchemaType(
                builder = getBuilder(isRequired(columnDefinition), false, dataType);
                isRepeating = false;
            }
+            instructions.getFieldId(columnDefinition.getName()).ifPresent(builder::id);


You can skip it here, I am making the change and testing it as part of my PR here.
Or if you have already added the tests, you can copy the logic from my PR. The main difference is how we nested columns like handle lists.

The ability to write Parquet field ids doesn't necessarily need to be tied into Iceberg's usage of it. Given how simple it was here, I think we can leave it in?

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

devinrsmith added parquet Related to the Parquet integration NoDocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Sep 30, 2024

devinrsmith added this to the 0.37.0 milestone Sep 30, 2024

devinrsmith requested a review from malhotrashivam September 30, 2024 22:28

devinrsmith self-assigned this Sep 30, 2024

devinrsmith requested a review from rcaudy September 30, 2024 22:28

malhotrashivam reviewed Sep 30, 2024

View reviewed changes

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java Show resolved Hide resolved

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java Outdated Show resolved Hide resolved

malhotrashivam reviewed Sep 30, 2024

View reviewed changes

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java Show resolved Hide resolved

Review response

65453c4

devinrsmith requested a review from malhotrashivam October 1, 2024 00:00

devinrsmith added 3 commits September 30, 2024 17:52

Cleanup ParquetInstructions.addColumnNameMapping

9fa979e

This also fixes a bug where `parquetColumnNameToInstructions.put(parquetColumnName, ci);` was called without setting the parqute column name on ci and the KeyDef would blow up.

Given statefulness we maintain around parquetColumnName, we should no…

6b58468

…t skip the logic when a user explicitly sets the parquet column name the same as the column name

Add ParquetInstructions test

3e34bfa

malhotrashivam reviewed Oct 1, 2024

View reviewed changes

Add writing support

19a6490

Review response

ce2f2b8

devinrsmith requested a review from malhotrashivam October 1, 2024 16:56

malhotrashivam reviewed Oct 1, 2024

View reviewed changes

devinrsmith added 2 commits October 1, 2024 12:18

Handle case where a parquet field has non-unique field ids

a6ed292

Ensure LIST support for field_id

35e2983

devinrsmith requested a review from malhotrashivam October 1, 2024 19:59

malhotrashivam reviewed Oct 2, 2024

View reviewed changes

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java Outdated Show resolved Hide resolved

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java Outdated Show resolved Hide resolved

review response

1a2aa69

devinrsmith requested a review from malhotrashivam October 2, 2024 18:42

malhotrashivam approved these changes Oct 2, 2024

View reviewed changes

malhotrashivam mentioned this pull request Oct 2, 2024

feat: Added support to write iceberg tables #5989

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Allow parquet column access by field_id #6156

feat: Allow parquet column access by field_id #6156

devinrsmith commented Sep 30, 2024 •

edited

Loading

malhotrashivam left a comment

malhotrashivam commented Sep 30, 2024

devinrsmith commented Oct 1, 2024

malhotrashivam left a comment

malhotrashivam Oct 1, 2024

devinrsmith Oct 1, 2024

devinrsmith commented Oct 1, 2024

malhotrashivam Oct 1, 2024 •

edited

Loading

devinrsmith Oct 1, 2024

		// TODO: how should we handle this? Ignore?
		// throw new IllegalArgumentException();

feat: Allow parquet column access by field_id #6156

Are you sure you want to change the base?

feat: Allow parquet column access by field_id #6156

Conversation

devinrsmith commented Sep 30, 2024 • edited Loading

malhotrashivam left a comment

Choose a reason for hiding this comment

malhotrashivam commented Sep 30, 2024

devinrsmith commented Oct 1, 2024

malhotrashivam left a comment

Choose a reason for hiding this comment

malhotrashivam Oct 1, 2024

Choose a reason for hiding this comment

devinrsmith Oct 1, 2024

Choose a reason for hiding this comment

devinrsmith commented Oct 1, 2024

malhotrashivam Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

devinrsmith Oct 1, 2024

Choose a reason for hiding this comment

devinrsmith commented Sep 30, 2024 •

edited

Loading

malhotrashivam Oct 1, 2024 •

edited

Loading