Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a notebook for faster GRIB aggregations. #64

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Anu-Ra-g
Copy link
Contributor

@Anu-Ra-g Anu-Ra-g commented Aug 26, 2024

This is notebook is developed as a part of Google Summer of Code 2024. It describes how we can make larger aggregations for the GRIB files hosted in NODD program, in a small amount of time. The functions and operations used in this notebook, will be a part of the new version of kerchunk. This notebook is still in a draft phase, because it depends on this PR #63 to be merged first as the old pre-commit configuration is failing on the commits and some other updates as needed.

Copy link

github-actions bot commented Aug 26, 2024

👋 Thanks for opening this PR! The Cookbook will be automatically built with GitHub Actions. To see the status of your deployment, click below.
🔍 Git commit SHA: adad038
✅ Deployment Preview URL: In Progress

@norlandrhagen
Copy link
Collaborator

Happy to merge and/or review this whenever it's ready!

@Anu-Ra-g
Copy link
Contributor Author

@norlandrhagen The notebook is ready but it needs review. Can you please review and suggest the changes?

@Anu-Ra-g Anu-Ra-g marked this pull request as ready for review August 29, 2024 15:14
@norlandrhagen
Copy link
Collaborator

Looks great @Anu-Ra-g!

A few small comments and copy edits. Thanks for contributing. It is probably worth asking @martindurant for a review of the implementation since he is the GRIB + Kerchunk guru.

Overview:

In this tutorial we are going to demonstrate building kerchunk aggregations of NODD grib2 weather forecasts fast.
->
In this tutorial we are going to demonstrate building preformance/fast/quick/WC kerchunk aggregations of NODD grib2 weather forecasts.


This workflow primarily involves xarray-datatree, pandas and grib_tree function released in kerchunkv0.2.3 for the operation.
->
This workflow primarily involves xarray-datatree, pandas and the new grib_tree function released in kerchunk v0.2.3.


For this operation we will be looking at GRIB2 files generated by NOAA Global Ensemble Forecast System (GEFS), is a weather forecast model made up of 21 separate forecasts, or ensemble members. With global coverage, GEFS is produced four times a day with weather forecasts going out to 16 days, with an update frequency of 4 times a day, every 6 hours starting at midnight.
->
In this notebook/demo/WC we will be looking at GRIB2 files generated by NOAA Global Ensemble Forecast System (GEFS). This dataset is a weather forecast model made up of 21 separate forecasts, or ensemble members. GEFS has global coverage and is produced four times a day with forecasts going out to 16 days. It is updated 4 times a day, every 6 hours starting at midnight.


For building the aggregation, we're going to build a hierarchical data model, to view the whole dataset ,from a set of scanned grib messages with the help of grib_tree function. This data model can be opened directly using either zarr or xarray datatree. This way of building the aggregation is very slow. Here we're going to use xarray-datatree to open and view it.
->
We are using the newly implemented Kerchunk grib_tree function to build a hierarchical data model from a set of scanned grib messages.This data model can be opened directly using either zarr or xarray datatree. This way of building the aggregation is very slow. Here we're going to use xarray-datatree to open and view it:


Every NODD cloud platform stores the grib file along with its .idx(index) file, in text format. The purpose of using the idx files in the aggregation is that the k(erchunk) index data looks a lot like the idx files that are present for every grib file in NODD's GCS and AWS archive though.

This way of building of aggregation only works for a particular horizon file irrespective of the run time of the model.
->
Accompanying each NODD GRIB its .idx(index) file. Kerchunk can use this as a shortcut to build references without scanning the entire GRIB message.

Note: This way of building of aggregation only works for a particular horizon file irrespective of the run time of the model.


Now we're going to need a mapping from our grib/zarr metadata(stored in the grib_tree output) to the attributes in the idx files. They are unique for each time horizon e.g. we need to build a unique mapping for the 1 hour forecast, the 2 hour forecast and so on. So in this step we're going to create a mapping for a single grib file and its corresponding idx files in order, which will be used in later steps for building the aggregation.

Before that let's see what grib data we're extracting from the datatree. The metadata that we'll be extracting will be static in nature. We're going to use a single node by accessing it.

->

Now we're going to need a mapping from our GRIB/Zarr metadata(stored in the grib_tree output) to the attributes in the .idx files. They are unique for each time horizon, so we need to build a unique mapping for the 1 hour forecast, the 2 hour forecast and so on. In this step we are going to create a mapping for a single GRIB file and its corresponding .idx files in order.

We'll start by examining the GRIB data. The metadata that we'll be extracting will be static in nature. We're going to use a single node by accessing it with datatree.


Now if we parse the runtime from the idx file , we can build a fully compatible k_index(kerchunk index) for that particular file. Before creating the index, we need to clean some of the data in the mapping and index dataframe for the some variables as they tend to contain duplicate values, as demonstrated below.

->

Now that we have pared the runtime from the .idx file, we can build a fully compatible Kerchunk index for each file. Before creating the index, we need to clean some of the data in the mapping and index dataframe for the some variables as they tend to contain duplicate values.


For the final step of the aggregation, we will create an index for each GRIB file to cover a two-month period starting from the specified date and convert it into one combined index and we can store this index for later use. We will be using the 6-hour horizon file for building the aggregation, this will be from 2017-01-01 to 2017-02-28. This is because as we already know this way of aggregation only works for a particular horizon file. With this way of building the aggregation we can index a whole of forecasts.

->

For the final step of the aggregation, we will create an index for each GRIB file to cover a two-month period and convert it into one combined index. We can store this index for later use. We will be using the 6-hour horizon file for building the aggregation, this will be from 2017-01-01 to 2017-02-28.


The difference between idx and k_index(kerchunk index) that we built in the above in the above step, is that the former indexes the grib messages and the latter indexes the variables in those messages. Now we'll need a tree model from grib_tree function to reinflate the part or the whole of the index i.e. variables in the messages as per our needs. The important point to note here is that the tree model should be made from the grib file(s) of the repository that we are indexing.

->

The difference between .idx and Kerchunk index that we built is that the former indexes the GRIB messages and the latter indexes the variables in those messages. Now we'll need a tree model from grib_tree function to reinflate index (variables in the messages). The important point to note here is that the tree model should be made from the GRIB files of the repository that we are indexing.


@Anu-Ra-g
Copy link
Contributor Author

@norlandrhagen I've made the suggested changes and updated some parts of the suggestions.

@norlandrhagen
Copy link
Collaborator

Nice. It looks like the book build is failing with:

ImportError: cannot import name 'AggregationType' from 'kerchunk.grib2' (/home/runner/miniconda3/envs/kerchunk-cookbook/lib/python3.10/site-packages/kerchunk/grib2.py)

Maybe the kerchunk version needs to be bumped?

@martindurant
Copy link
Contributor

@Anu-Ra-g , do you need a release of kerchunk? I'll merge your waiting PRs now, and we can handle any cleanup later.

@Anu-Ra-g
Copy link
Contributor Author

Anu-Ra-g commented Aug 30, 2024

@martindurant I've made the PRs #497, #498, #499 in order to support this notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants