feat: Re-enable eager parquet reading #1273
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
WIP!
Eager parquet reading had a mismatch between the uproot-like shims and the parquet mappings, so
allow_missing
is inserted into the appropriate methods, with a copy of the uproot IndexedOptionArray creation added (I hope appropriately... let me know if not Q1). Running only the test_nanoevents.py locally the tests succeed, but I'm not sure where the dask/delayed version gets (would get) tested Q2 (that should still fail due to lack of form_mapping in dask-awkward).The test for a column being available dives through the file, arrow_schema, etc. which may be more direct than desired Q3
I'm not expecting
allow_missing
is tested for parquet, nor if there are files already available for this... Q4 if needed, I could probably make some... both HLT and GenModel branches potentiallyThe old
nano_dy.parquet
file triggers an error related to awkward extension type:c.f. scikit-hep/awkward#3393
However... newer parquet files made with hepconvert, even with extensionarray=True, pass. Might be able to make use of accessing the
storage_type
to bypass the error, but I didn't have a complete implementation before I swapped to using newer parquet files. Q5The new parquet files use hepconvert's default compression, which is zstd. Okay? Better to use another? Q6
The equivalent of preprocess for parquet files would be a nice addition, and for reasons related to consistency (and related work on experimental column-joining, which benefits from similar interfaces + building unionform pre-emptively), I've inserted a dict-like filespec handling ala uproot5 interface. Improvements can certainly be made.
@lgray this will need quite some work still, but if you could approve/veto some of the choices made (or give thoughts on open questions above) that would be greatly appreciated!