-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fsspec source #467
Comments
Thank you for the suggestion. Just a remark. If you want to convert GRIB to xarray first you need to scan the whole file/files (all the messages) for metadata. So this is very much different to use case of NetCDF and zarr where this information is available "instantly". |
I didn’t know that, is it related to GRIB’s format? In that case, would the index files help? If so, would it be possible to check if a pre-generated index file is available as well (e.g. a file-like object passed alongside or a relative fsspec uri) and to use that to skip the initial full-file scan? In any case, it would be important that once the metadata has been extracted, the actual data is not kept in memory until requested by the user (so slices would still only be lazily loaded in). |
I like the idea of fsspec support. I would actually see fsspec as the de-facto standard for data access within the python environment, particularly because widely adopted tools like xarray support this standard natively. I would therefore argue that instead of earthkit developing access to the fsspec drivers, fsspec drivers for fdb, MARS, cds should be developed itself. Those stand-alone drivers would be truely general and can be used across a variety of tools. I just started developing ecmwfspec for this particular reason to provide a general fsspec driver to interact with ECFS (ECMWF File Storage). I could imagine PRs that would extend this to e.g. fdb. |
Is your feature request related to a problem? Please describe.
I haven't yet found a good way to open large (exceeds RAM) remote (not on my local file system) GRIB files in xarray.
Describe the solution you'd like
A new source would be added, e.g.
that would be similar to the "file" source in making use of random access but use Python's file-like interface (so perhaps "file-like" would be another name) and thus add support for fsspec's numerous backends to earthkit for free.
This new source should also support loading large GRIB datasets without reading the entire file. Ideally, loading the GRIB file into xarray would only read as little data as possible and defer any data reads until the user specifically asks for the data (similar to how NetCDF and Zarr support lazy-loading).
Describe alternatives you've considered
(inspired by ecmwf/cfgrib#326 (comment)) provides the closest current solution but treats the file pessimistically as only a stream and not as a random-access file, which results in excessive reads.
Additional context
I am working in an extremely memory-constrained environment and would like to support opening remote GRIB files (in addition to NetCDF and Zarr datasets which already work).
Organisation
University of Helsinki, ESiWACE3 project
The text was updated successfully, but these errors were encountered: