-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Allow parquet column access by field_id #6156
base: main
Are you sure you want to change the base?
Conversation
This allows the the resolution of a parquet column by field_id instead of by its "path". This is a lower-level option that will not typically be used by end-users; as such, this option has not been plumbed through to python. This feature will be used in follow-up PRs in combination with Iceberg's field-ids to improve column mappings. Fixes deephaven#6128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First level of review, can do a more detailed review tomorrow.
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java
Outdated
Show resolved
Hide resolved
Do verify the nightlies pass before merging. |
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java
Show resolved
Hide resolved
This also fixes a bug where `parquetColumnNameToInstructions.put(parquetColumnName, ci);` was called without setting the parqute column name on ci and the KeyDef would blow up.
…t skip the logic when a user explicitly sets the parquet column name the same as the column name
Verified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the change, minor comments.
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java
Show resolved
Hide resolved
// TODO: how should we handle this? Ignore? | ||
// throw new IllegalArgumentException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that this should be an error in ParquetInstructions, right when the user sets it instead of here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole code path is very smelly; I want to go through a larger refactoring that would alleviate the need to make these types of calls in the first place. This code path is only hit when inferring the TableDefinition, so I don't think it should be an error to set the same field id multiple times in general. We have set it up this way with parquet column names, but we shouldn't technically need to do that either - every little modelling mismatch we present is a small papercut that can lead to larger modelling problems at higher layers IMO.
I would be okay throwing an error here or silently ignoring wrt inferrence. Ideally, the user would be able to choose the behavior they desire. The structure of ParquetInstructions / builder makes that tedious (I wish we could redo it w/ Immutables and saner structures).
I'll change this to throw an error here, with a note we could think about exposing option to silently ignore if that's what the user wants.
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ParquetFileReader.java
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java
Show resolved
Hide resolved
Iceberg probably mandates the uniqueness of field-ids. Parquet doesn't have any mandates wrt that. And even the column names aren't guaranteed to be unique. I need to find the reference I found earlier that the parquet format "strongly recommends" unique column names, but it's not even a guarantee. |
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java
Show resolved
Hide resolved
@@ -474,6 +474,7 @@ default Type createSchemaType( | |||
builder = getBuilder(isRequired(columnDefinition), false, dataType); | |||
isRepeating = false; | |||
} | |||
instructions.getFieldId(columnDefinition.getName()).ifPresent(builder::id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can skip it here, I am making the change and testing it as part of my PR here.
Or if you have already added the tests, you can copy the logic from my PR. The main difference is how we nested columns like handle lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ability to write Parquet field ids doesn't necessarily need to be tied into Iceberg's usage of it. Given how simple it was here, I think we can leave it in?
This allows the the resolution of a parquet column by field_id instead of by its "path". This is a lower-level option that will not typically be used by end-users; as such, this option has not been plumbed through to python. This feature will be used in follow-up PRs in combination with Iceberg's field-ids to improve column mappings.
Writing support has also been added.
Fixes #6128