Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Re-enable eager parquet reading #1273

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

NJManganelli
Copy link
Collaborator

WIP!

Eager parquet reading had a mismatch between the uproot-like shims and the parquet mappings, so allow_missing is inserted into the appropriate methods, with a copy of the uproot IndexedOptionArray creation added (I hope appropriately... let me know if not Q1). Running only the test_nanoevents.py locally the tests succeed, but I'm not sure where the dask/delayed version gets (would get) tested Q2 (that should still fail due to lack of form_mapping in dask-awkward).

The test for a column being available dives through the file, arrow_schema, etc. which may be more direct than desired Q3

I'm not expecting allow_missing is tested for parquet, nor if there are files already available for this... Q4 if needed, I could probably make some... both HLT and GenModel branches potentially

The old nano_dy.parquet file triggers an error related to awkward extension type:

src/coffea/nanoevents/factory.py:557: in from_parquet
    base_form = mapping._extract_base_form(table_file.schema_arrow)
src/coffea/nanoevents/mapping/parquet.py:155: in _extract_base_form
    form = arrow_schema_to_awkward_form(schema)
src/coffea/nanoevents/mapping/parquet.py:84: in arrow_schema_to_awkward_form
    dtype = schema.to_pandas_dtype()()
pyarrow/types.pxi:406: in pyarrow.lib.DataType.to_pandas_dtype
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   NotImplementedError: ak:uint32

c.f. scikit-hep/awkward#3393

However... newer parquet files made with hepconvert, even with extensionarray=True, pass. Might be able to make use of accessing the storage_type to bypass the error, but I didn't have a complete implementation before I swapped to using newer parquet files. Q5

The new parquet files use hepconvert's default compression, which is zstd. Okay? Better to use another? Q6

The equivalent of preprocess for parquet files would be a nice addition, and for reasons related to consistency (and related work on experimental column-joining, which benefits from similar interfaces + building unionform pre-emptively), I've inserted a dict-like filespec handling ala uproot5 interface. Improvements can certainly be made.

@lgray this will need quite some work still, but if you could approve/veto some of the choices made (or give thoughts on open questions above) that would be greatly appreciated!

@NJManganelli NJManganelli changed the title feat: Re-enable eaguer parquet reading feat: Re-enable eager parquet reading Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant