Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About this package #1

Open
meggart opened this issue Dec 19, 2019 · 4 comments
Open

About this package #1

meggart opened this issue Dec 19, 2019 · 4 comments

Comments

@meggart
Copy link
Collaborator

meggart commented Dec 19, 2019

Hi @rafaqz ,

as discussed on discourse you showed some interest in defining something like a AbstractDiskArray I am not sure if you have started something like this already, but since I am trying to adapt the indexing behavior of NetCDF.jl and Zarr.jl and want to avoid double work I started this package which might be merged with your efforts. The basic idea is that all a user has to provide is a function that reads a cartesian subrange from a array that is on disk and this package takes care of the remaining index details, translating trailing and missing singleton dimensions etc.

Currently this supports getindex only, but you might get an idea of what my idea for such a package could be. Your opinion would be highly appreciated.

Next I would continue to remove the getindex implementations from Zarr and NetCDF and instead use this interface here, so that any indexing improvement would help both packages.

@rafaqz
Copy link
Collaborator

rafaqz commented Dec 19, 2019

Thanks for starting this. I haven't done anything yet, I have a lot of packages mid-way to release so my capacity is pretty low for a while, but I would like to contribute to this, especially if it can remove code from GeoData.jl.

Moving code from Zarr here sounds like a good strategy. Also I've just fixed HDF5 indexing interface and written one for ArchGDAL, which could help integrate those too at some stage. We should aim high for this :)

In terms of extra features I was thinking it would be great to handle broadcast?, reducing methods, and show in sane ways, so you can run code written for Array on AbstractDiskArray and it works without crashing/stalling? So summing a disk-based array should sum chunk by chunk so larger than ram files just work without ever having to think about it.

view is also interesting... I've had to write a bunch of windowing code in GeoData.jl to deal with lazy loading propagating views from stacks to arrays, and SubArrays not working with non-Arrays. So it would be good if DiskArray <: AbstractArray and we have methods to cover where Array methods break with disk based arrays.

@meggart
Copy link
Collaborator Author

meggart commented Dec 19, 2019

Moving code from Zarr here sounds like a good strategy. Also I've just fixed HDF5 indexing interface and written one for ArchGDAL, which could help integrate those too at some stage. We should aim high for this :)

Yes I was hoping this could help more packages, and as soon as we demonstrate this has some value we can try to make more packages backed by this.

In terms of extra features I was thinking it would be great to handle broadcast?, reducing methods, and show in sane ways, so you can run code written for Array on AbstractDiskArray and it works without crashing/stalling? So summing a disk-based array should sum chunk by chunk so larger than ram files just work without ever having to think about it.

Yes, this would definitely be on the roadmap. I think, in order to tackle low-hanging fruits first I would start with show and reduce. I think broadcast can already get a bit tricky, because multiple arrays are involved and you have to start thinking about chunk alignment and other tricky things. Since I have implemented a lot of this functionality already in ESDL (though in a much less principled way) I would not put priority into this, but would definitely support anyone else trying to tackle this.

I would also be happy to move any other pieces from GeoData that you think might be useful to a broader class of disk-based arrays. So far I was a bit hesitant to subtype AbstractArray, but maybe it would make sense to be ambitious and simply show that we finally want to provide a complete array experience, so I don't mind doing this.

@meggart
Copy link
Collaborator Author

meggart commented Dec 19, 2019

I forgot to mention, in order to implement reductions, we would need some concept of chunking. Do you think we should make that part of this package (i.e. move code from ChunkedArrayBase here) or shall we depend on this package?

@rafaqz
Copy link
Collaborator

rafaqz commented Dec 20, 2019

I agree broadcast will be the hardest part, that's a good idea to leave it until last.

I was imagining chunking would be integral to a lot of this too, but I'm not sure how your packages work.. but just depending on ChunkedArrayBase could be fine? maybe a lot of these methods would even be in chunked array base? I'm not sure what the best plan is there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants