Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing in other packages #2

Open
4 of 5 tasks
meggart opened this issue Jan 14, 2020 · 19 comments
Open
4 of 5 tasks

Implementing in other packages #2

meggart opened this issue Jan 14, 2020 · 19 comments

Comments

@meggart
Copy link
Collaborator

meggart commented Jan 14, 2020

I think the general indexing functionality for this package is there now, so that one could use this package as a dependency for NetCDF, HDF5 or Zarr without losing functionality. Here we can colltect a list to the packages that might profit from DiskArrays and link to the respective PRs:

@visr @rafaqz @Alexander-Barth any suggestions and feedback would be highly appreciated

@visr
Copy link
Contributor

visr commented Jan 14, 2020

Nice work! Do you think it would also make sense to apply this for GDAL raster grids in https://github.com/yeesian/ArchGDAL.jl, or at least https://github.com/evetion/GeoArrays.jl? @yeesian @evetion.

@meggart
Copy link
Collaborator Author

meggart commented Jan 14, 2020

Thanks for the feedback. Yes, both of them should definitely be on the radar. I just looked at the ArchGDAL raster interface and implementation there should be straightforward.

@visr
Copy link
Contributor

visr commented Jan 14, 2020

Great. I guess GeoArrays.jl can be removed from the list. Right know it only supports reading the entire grid. And if it would add support for subsets, since it uses ArchGDAL for IO, it could profit from ArchGDAL's use of DiskArrays.

@rafaqz
Copy link
Collaborator

rafaqz commented Jan 14, 2020

This looks good. Seems pretty comprehensive for indexing it will be great to have everything standardised.

My vote is also to implement it in ArchGDAL, then GeoData.jl can use it too. I recently added some basic array indexing to ArchGDAL, we could build off that.

@meggart
Copy link
Collaborator Author

meggart commented Jan 15, 2020

Ok, I created an initial PR for ArchGDAL. Maybe it would be a good idea to start registering ChunkedArrayBase and DiskArrays so that testing gets easier for dependent packages?

@visr
Copy link
Contributor

visr commented Jan 15, 2020

Great! Yeah I think registering may be good. Since both ChunkedArrayBase and DiskArrays are pure julia, no-deps, small projects, why do you prefer to split it into two packages? May be a good choice, but just wondering.

@meggart
Copy link
Collaborator Author

meggart commented Jan 15, 2020

I don't know what the best strategy would be here. I thought there might be packages that want to use the concept of chunking (maybe some DistributedArray-like package) that don't have the slow-access characteristics to be tackled here, but from a maintenance point of view I think you are right and it is better to merge these packages. I will copy the code from ChunkedArrayBase here then.

Are there any suggestions where DiskArrays should live? Is JuliaGeo the right place for it or should it stay in my namespace for now.

@visr
Copy link
Contributor

visr commented Jan 15, 2020

Ok yeah then I agree on merging them.

Since it is potentially quite a foundational package with many reverse dependencies, placing it in an organization is probably a good idea. Since it is created for and often used with geospatial rasters, JuliaGeo could be a good place for it to live. Another good option would be in https://github.com/JuliaArrays, living along the other special Array types with package names *Arrays.jl.

@rafaqz
Copy link
Collaborator

rafaqz commented Jan 15, 2020

I was also wondering if it should be in JuliaArrays... HDF5 is a lot more general than geospatial data, and there are probably other non geo applications.

@meggart
Copy link
Collaborator Author

meggart commented Jan 28, 2020

I think since testing the PRs depends so much on this package being registered, I decided to register the package as it is for now. We can still move to an org later.

@oschulz
Copy link

oschulz commented Feb 3, 2020

Support for HDF5.jl would be great!

@meggart
Copy link
Collaborator Author

meggart commented Apr 2, 2020

@oschulz I just opened JuliaIO/HDF5.jl#615 to discuss this

@oschulz
Copy link

oschulz commented Apr 2, 2020

Oh, thanks - almost forgot about this! I should also check if I can use this in UpROOT.jl.

@evetion
Copy link

evetion commented Jan 2, 2021

I've implemented this in https://github.com/evetion/GeoArrays.jl, based on the ArchGDAL.jl implementation. Only thing that it breaks is my support for missing values, for which I normally read the arrays into memory and convert them to Array{Union{Missing, T}}.

But Missing support is hard, the GDAL ecosystem supports a nodata value as well as complete masks.

@meggart
Copy link
Collaborator Author

meggart commented Jan 4, 2021

Great to hear about the GeoArrays implementation. Is there anything specific that could be done about the missing values on the DiskArrays side. I am struggling with this now and then as well and have some attempts at solving this e.g. here https://github.com/meggart/DiskArrayTools.jl/blob/6728512ed9c33e1fa261bf38b2de3b97589ce310/src/DiskArrayTools.jl#L155-L192
where the conversion from T -> Union{T,Missing} happens inside the readblock! function for CFDiskArrays, creating a copy of the data as well.

@Alexander-Barth
Copy link
Contributor

Last time I tried, DiskArrays in NCDatasets I had some issues with types of undefined size like String and probable also NetCDF4 variable-length arrays:

julia> sizeof(String)
ERROR: Type String does not have a definite size.
Stacktrace:
 [1] sizeof(::Type{T} where T) at ./essentials.jl:449
 [2] top-level scope at REPL[3]:1

See here: Alexander-Barth/NCDatasets.jl#79 (comment)

Can DiskArrays now be used with an arrays of Strings?

@meggart
Copy link
Collaborator Author

meggart commented Feb 23, 2021

@Alexander-Barth have you seen and tried this: #24 sorry for not notifying you earlier, let me know if this works for you.

@Alexander-Barth
Copy link
Contributor

Thanks a lot @meggart for let me know. I hope that I can find time soon to test this, but right now it is a bit difficult for me.

I had just a look to your patch:

"""
A fallback element size for arrays to determine a where elements have unknown
size like strings. Defaults to 100MB
"""
const fallback_element_size = Ref(100)

Should this be 100 bytes in doc strings ?

@Alexander-Barth
Copy link
Contributor

@visr thank you for looking back into this issue in your test here. I am wondering if there is still the plan to have variables with unlimited dimensions (I think this is in the branch resizable). For NCDatasets with the DiskArrays branch resizable I am hitting this issue:

Alexander-Barth/NCDatasets.jl#79 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants