feat: Re-enable eager parquet reading #1273

NJManganelli · 2025-02-11T00:10:23Z

WIP!

Eager parquet reading had a mismatch between the uproot-like shims and the parquet mappings, so allow_missing is inserted into the appropriate methods, with a copy of the uproot IndexedOptionArray creation added (I hope appropriately... let me know if not Q1). Running only the test_nanoevents.py locally the tests succeed, but I'm not sure where the dask/delayed version gets (would get) tested Q2 (that should still fail due to lack of form_mapping in dask-awkward).

The test for a column being available dives through the file, arrow_schema, etc. which may be more direct than desired Q3

I'm not expecting allow_missing is tested for parquet, nor if there are files already available for this... Q4 if needed, I could probably make some... both HLT and GenModel branches potentially

The old nano_dy.parquet file triggers an error related to awkward extension type:

src/coffea/nanoevents/factory.py:557: in from_parquet
    base_form = mapping._extract_base_form(table_file.schema_arrow)
src/coffea/nanoevents/mapping/parquet.py:155: in _extract_base_form
    form = arrow_schema_to_awkward_form(schema)
src/coffea/nanoevents/mapping/parquet.py:84: in arrow_schema_to_awkward_form
    dtype = schema.to_pandas_dtype()()
pyarrow/types.pxi:406: in pyarrow.lib.DataType.to_pandas_dtype
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   NotImplementedError: ak:uint32

c.f. scikit-hep/awkward#3393

However... newer parquet files made with hepconvert, even with extensionarray=True, pass. Might be able to make use of accessing the storage_type to bypass the error, but I didn't have a complete implementation before I swapped to using newer parquet files. Q5

The new parquet files use hepconvert's default compression, which is zstd. Okay? Better to use another? Q6

The equivalent of preprocess for parquet files would be a nice addition, and for reasons related to consistency (and related work on experimental column-joining, which benefits from similar interfaces + building unionform pre-emptively), I've inserted a dict-like filespec handling ala uproot5 interface. Improvements can certainly be made.

@lgray this will need quite some work still, but if you could approve/veto some of the choices made (or give thoughts on open questions above) that would be greatly appreciated!

…ault settings

for more information, see https://pre-commit.ci

Nick Manganelli and others added 6 commits February 10, 2025 17:28

Match eager parquet's eager shims to be more like uproot's

91c1a99

Reenable parquet eager testing

7f9c383

Handle dict input for parquet, for more uniform filespec-like handling

b6e811a

Rename SemVer test nano parquet files

bf89d34

New converted parquet of nano_dy and nano_dimuon using hepconvert def…

0b1b313

…ault settings

[pre-commit.ci] auto fixes from pre-commit.com hooks

c0e24dc

for more information, see https://pre-commit.ci

NJManganelli changed the title ~~feat: Re-enable eaguer parquet reading~~ feat: Re-enable eager parquet reading Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Re-enable eager parquet reading #1273

feat: Re-enable eager parquet reading #1273

NJManganelli commented Feb 11, 2025

feat: Re-enable eager parquet reading #1273

Are you sure you want to change the base?

feat: Re-enable eager parquet reading #1273

Conversation

NJManganelli commented Feb 11, 2025