Create an additional interface for defining chunks for asynchronous IO #182

meggart · 2024-08-01T10:05:32Z

I recently started thinking about implementing the v3 of the zarr specs, which also include sharding as an extension, in which several chunks are stored into a single storage unit. One problem that arises immediately is that when shards are present, chunks in the same shard can not be written to from different threads/tasks to avoid data corruption, so ideally downstream applications should be able to query sharding structure of an AbstractDiskArray. Handling zarr files is not the only situation, in which chunks do not align with storage units. For example it is quite common to concatenate a list of NetCDF/Tiff files using ConcatDiskArray. Then when writing to this array, parallel writes are allowed as long as they end up in different files, but these storage units are different from the actual chunks of the array which rather represents compression units.

So my suggestion would be to add something like a hasioblocks and eachioblock function to DiskArrays which DiskArray implementers can potentially extend. The point of the function would be that it returns an iterator of blocks chunk indices that belong to the same storage unit and where members of different storage blocks can be safely mutated in parallel.

This would be orthogonal to the haschunks and eachchunk pair. For example a single NetCDF or HDF5 file may have chunks internally but consists only of a single storage unit, so eachioblock would only return a length 1 iterator. For traditional Zarr arrays where every chunk is a separate file, eachioblock would have same length as eachchunk and of course there would be mixed situations for somehow sharded arrays.

I would be very happy if people have comments or naming suggestions or alternative ideas here. @mkitti had some ideas on the sharding topic as well, I would very much appreciate your opinion.

The text was updated successfully, but these errors were encountered:

mkitti · 2024-08-01T15:47:27Z

I've been playing with https://google.github.io/tensorstore/ recently which has a completely asynchronous API. Perhaps it would be good to take some inspiration from there.

meggart · 2024-08-02T09:29:14Z

Thanks for the pointer to Tensorstore. I there is quite some overlap with the concepts in DIskArrays.jl, in particular when it comes to unifying indexing. I also think that the new trait I am proposing could support the creation of a similar async API in Julia.

asinghvi17 · 2024-11-09T21:40:41Z

If people are calling this sharding, should we also call this hasshards and eachshard?

Otherwise this sounds great!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an additional interface for defining chunks for asynchronous IO #182

Create an additional interface for defining chunks for asynchronous IO #182

meggart commented Aug 1, 2024 •

edited

Loading

mkitti commented Aug 1, 2024

meggart commented Aug 2, 2024

asinghvi17 commented Nov 9, 2024

Create an additional interface for defining chunks for asynchronous IO #182

Create an additional interface for defining chunks for asynchronous IO #182

Comments

meggart commented Aug 1, 2024 • edited Loading

mkitti commented Aug 1, 2024

meggart commented Aug 2, 2024

asinghvi17 commented Nov 9, 2024

meggart commented Aug 1, 2024 •

edited

Loading