Iceberg reading with explicit Schema support #6124

devinrsmith · 2024-09-25T13:37:22Z

I believe we need to offer Iceberg reading support based on a user-specified Schema (likely a Schema that was sourced from some history in the Table, or some subset); in the context where a user is passing in String column renames, the keys of that map are tied to a specific Schema (in this way also, it can internally be converted into a field-id rename, which can be used across column renaming). It may also be important in an enterprise for them to establish and record the specific Schema a table was initially ingested as (for example, they may want to enforce that db.t("MyNamespace", "MyTable") produces the same exact output today as it did yesterday). It is not enough to simply record the latest, or even a specific, Snapshot because the schema of the Table may change without a new Snapshot - for example, when a column is renamed, the Table's Schema will be updated, but no new Snapshots will be created.

As a point of convention, it probably makes sense to assume these defaults (in order):

If a schema is provided, use that schema
If a snapshot is provided, use schema table.schemas().get(snapshot.schemaId())
Otherwise, use schema table.schema(); note, this is not equivalent to table.schemas().get(table.currentSnapshot().schemaId()) for reasons mentioned above.

The text was updated successfully, but these errors were encountered:

devinrsmith · 2024-09-25T16:06:22Z

I can also make an argument that our default should be:

If a schema is provided, use that schema
Otherwise, use schema table.schema()

IE, if providing a specific snapshot (and not a schema), the results would be the snapshot's data projected into the latest Table's schema.

This has the advantage that tableAdapter.read() would semantically be the same as tableAdapter.read(table.currentSnapshot()).

Regardless, in either regime, the user can specify the schema, and so achieve the behavior they desire.

devinrsmith · 2024-09-25T16:50:41Z

The table.schema() regime by default also allows the current constructors from io.deephaven.iceberg.layout.IcebergBaseLayout (and derived) to still be valid; while we should be okay breaking these, we eventually need to get into state where we aren't breaking them.

devinrsmith · 2024-09-30T21:46:45Z

Q: does PartitionSpec come into play, and if yes, is it always the case that the schema is equal to PartitionSpec#schema?

devinrsmith added feature request New feature or request triage iceberg labels Sep 25, 2024

devinrsmith added this to the Triage milestone Sep 25, 2024

devinrsmith assigned lbooker42 Sep 25, 2024

rcaudy added core Core development tasks and removed triage labels Sep 25, 2024

rcaudy modified the milestones: Triage, 0.38.0 Sep 25, 2024

devinrsmith mentioned this issue Sep 30, 2024

Iceberg column rename handling #6118

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg reading with explicit Schema support #6124

Iceberg reading with explicit Schema support #6124

devinrsmith commented Sep 25, 2024

devinrsmith commented Sep 25, 2024

devinrsmith commented Sep 25, 2024

devinrsmith commented Sep 30, 2024

Iceberg reading with explicit Schema support #6124

Iceberg reading with explicit Schema support #6124

Comments

devinrsmith commented Sep 25, 2024

devinrsmith commented Sep 25, 2024

devinrsmith commented Sep 25, 2024

devinrsmith commented Sep 30, 2024