You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently started thinking about implementing the v3 of the zarr specs, which also include sharding as an extension, in which several chunks are stored into a single storage unit. One problem that arises immediately is that when shards are present, chunks in the same shard can not be written to from different threads/tasks to avoid data corruption, so ideally downstream applications should be able to query sharding structure of an AbstractDiskArray. Handling zarr files is not the only situation, in which chunks do not align with storage units. For example it is quite common to concatenate a list of NetCDF/Tiff files using ConcatDiskArray. Then when writing to this array, parallel writes are allowed as long as they end up in different files, but these storage units are different from the actual chunks of the array which rather represents compression units.
So my suggestion would be to add something like a hasioblocks and eachioblock function to DiskArrays which DiskArray implementers can potentially extend. The point of the function would be that it returns an iterator of blocks chunk indices that belong to the same storage unit and where members of different storage blocks can be safely mutated in parallel.
This would be orthogonal to the haschunks and eachchunk pair. For example a single NetCDF or HDF5 file may have chunks internally but consists only of a single storage unit, so eachioblock would only return a length 1 iterator. For traditional Zarr arrays where every chunk is a separate file, eachioblock would have same length as eachchunk and of course there would be mixed situations for somehow sharded arrays.
I would be very happy if people have comments or naming suggestions or alternative ideas here. @mkitti had some ideas on the sharding topic as well, I would very much appreciate your opinion.
The text was updated successfully, but these errors were encountered:
I've been playing with https://google.github.io/tensorstore/ recently which has a completely asynchronous API. Perhaps it would be good to take some inspiration from there.
Thanks for the pointer to Tensorstore. I there is quite some overlap with the concepts in DIskArrays.jl, in particular when it comes to unifying indexing. I also think that the new trait I am proposing could support the creation of a similar async API in Julia.
I recently started thinking about implementing the v3 of the zarr specs, which also include sharding as an extension, in which several chunks are stored into a single storage unit. One problem that arises immediately is that when shards are present, chunks in the same shard can not be written to from different threads/tasks to avoid data corruption, so ideally downstream applications should be able to query sharding structure of an AbstractDiskArray. Handling zarr files is not the only situation, in which chunks do not align with storage units. For example it is quite common to concatenate a list of NetCDF/Tiff files using
ConcatDiskArray
. Then when writing to this array, parallel writes are allowed as long as they end up in different files, but these storage units are different from the actual chunks of the array which rather represents compression units.So my suggestion would be to add something like a
hasioblocks
andeachioblock
function to DiskArrays which DiskArray implementers can potentially extend. The point of the function would be that it returns an iterator of blocks chunk indices that belong to the same storage unit and where members of different storage blocks can be safely mutated in parallel.This would be orthogonal to the
haschunks
andeachchunk
pair. For example a single NetCDF or HDF5 file may have chunks internally but consists only of a single storage unit, soeachioblock
would only return a length 1 iterator. For traditional Zarr arrays where every chunk is a separate file,eachioblock
would have same length aseachchunk
and of course there would be mixed situations for somehow sharded arrays.I would be very happy if people have comments or naming suggestions or alternative ideas here. @mkitti had some ideas on the sharding topic as well, I would very much appreciate your opinion.
The text was updated successfully, but these errors were encountered: