Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an additional interface for defining chunks for asynchronous IO #182

Open
meggart opened this issue Aug 1, 2024 · 3 comments
Open

Comments

@meggart
Copy link
Collaborator

meggart commented Aug 1, 2024

I recently started thinking about implementing the v3 of the zarr specs, which also include sharding as an extension, in which several chunks are stored into a single storage unit. One problem that arises immediately is that when shards are present, chunks in the same shard can not be written to from different threads/tasks to avoid data corruption, so ideally downstream applications should be able to query sharding structure of an AbstractDiskArray. Handling zarr files is not the only situation, in which chunks do not align with storage units. For example it is quite common to concatenate a list of NetCDF/Tiff files using ConcatDiskArray. Then when writing to this array, parallel writes are allowed as long as they end up in different files, but these storage units are different from the actual chunks of the array which rather represents compression units.

So my suggestion would be to add something like a hasioblocks and eachioblock function to DiskArrays which DiskArray implementers can potentially extend. The point of the function would be that it returns an iterator of blocks chunk indices that belong to the same storage unit and where members of different storage blocks can be safely mutated in parallel.

This would be orthogonal to the haschunks and eachchunk pair. For example a single NetCDF or HDF5 file may have chunks internally but consists only of a single storage unit, so eachioblock would only return a length 1 iterator. For traditional Zarr arrays where every chunk is a separate file, eachioblock would have same length as eachchunk and of course there would be mixed situations for somehow sharded arrays.

I would be very happy if people have comments or naming suggestions or alternative ideas here. @mkitti had some ideas on the sharding topic as well, I would very much appreciate your opinion.

@mkitti
Copy link
Member

mkitti commented Aug 1, 2024

I've been playing with https://google.github.io/tensorstore/ recently which has a completely asynchronous API. Perhaps it would be good to take some inspiration from there.

@meggart
Copy link
Collaborator Author

meggart commented Aug 2, 2024

Thanks for the pointer to Tensorstore. I there is quite some overlap with the concepts in DIskArrays.jl, in particular when it comes to unifying indexing. I also think that the new trait I am proposing could support the creation of a similar async API in Julia.

@asinghvi17
Copy link
Member

If people are calling this sharding, should we also call this hasshards and eachshard?

Otherwise this sounds great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants