Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr-python v3 compatibility #516

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft

zarr-python v3 compatibility #516

wants to merge 16 commits into from

Conversation

mpiannucci
Copy link
Contributor

@mpiannucci mpiannucci commented Oct 10, 2024

So far the only tested file type is HDF5. Thats the only module that currently works with new zarr python in some way

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's much less change here than I might have thought.

kerchunk/tests/test_hdf.py Outdated Show resolved Hide resolved
kerchunk/tests/test_hdf.py Outdated Show resolved Hide resolved
kerchunk/tests/test_hdf.py Outdated Show resolved Hide resolved
kerchunk/hdf.py Outdated Show resolved Hide resolved
@@ -496,6 +516,8 @@ def _translator(
if h5obj.fletcher32:
logging.info("Discarding fletcher32 checksum")
v["size"] -= 4
key = str.removeprefix(h5obj.name, "/") + "/" + ".".join(map(str, k))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as what _chunk_key did? Maybe make it a function with a comment saying it's a copy/reimplementation.

By the way, is h5obj.name not actually a string, so you could have done h5obj.name.removeprefix()?

shape=h5obj.shape,
dtype=dt or h5obj.dtype,
chunks=h5obj.chunks or False,
fill_value=fill,
compression=None,
compressor=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here, you could reintroduce the compressor

filters = filters[:-1]
compressor = filters[-1]

but obviously it depends on whether there are indeed any filters at all.

It would still need back compat, since filters-only datasts definitely exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the big issue is that v3 cares about what type of operation it is, and v2w doesnt so moving them around doesnt necessarily fix that bug

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there needs to be a change upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kerchunk/hdf.py Outdated
Comment on lines 167 to 173
for k, v in self.store_dict.items():
if isinstance(v, zarr.core.buffer.cpu.Buffer):
key = str.removeprefix(k, "/")
new_keys[key] = v.to_bytes()
keys_to_remove.append(k)
for k in keys_to_remove:
del self.store_dict[k]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the hacky bit and could use some explanations. Even when requesting "v2", zarr makes Buffer objects, and the keys are also wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so two issues here:

  1. the keys we get from hdf are for example /depth/.zarray when then need to be depth/.zarray
  2. we cant jsonify buffers, which is how the internal MemoryStore in v3 stores its data. So we need to convert the buffers to bytes to be serialized

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - would appreciate comments on the code saying this.

kerchunk/hdf.py Outdated Show resolved Hide resolved
@mpiannucci
Copy link
Contributor Author

mpiannucci commented Oct 15, 2024

Also worth noting... zarr 3 doesn't support numcodecs codecs out of the box. There is a pr to help this zarr-developers/numcodecs#524 (see also here for updated version zarr-developers/numcodecs#597) but it would mean a change to codec names which causes an incompatibility. For the initial icechunk examples we handle this in virtualizarr but long term it probably belongs here to work standalone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants