Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a new documentation page for faster GRIB aggregations #495

Merged
merged 7 commits into from
Sep 9, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ so that blocks from one or more files can be arranged into aggregate datasets ac
beyond
nonzarr
reference
reference_aggregation
contributing
advanced

Expand Down
191 changes: 191 additions & 0 deletions docs/source/reference_aggregation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
Other Methods of Aggregations
Anu-Ra-g marked this conversation as resolved.
Show resolved Hide resolved
=============================

As we have already seen in this `page <https://fsspec.github.io/kerchunk/test_example.html#multi-file-jsons>`_,
that the main purpose of ``kerchunk`` it to generate references, to view whole archive of files like
GRIB2, NetCDF etc. allowing us for *in-situ* access to the data. In this part of the documentation,
Anu-Ra-g marked this conversation as resolved.
Show resolved Hide resolved
we will see some other efficient ways of combining references.

GRIB Aggregations
-----------------

This new method for reference aggregation, developed by **Camus Energy**, is based on GRIB2 files. Utilizing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be "new" for long, so drop this.

I would put the restrictions first:

  • must have .idx files
  • specialised for time-series data, each file having identical message structure

@emfdavid : an opinion on whether Camus wants to be referenced here and how that would look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emfdavid Should I mention Camus Energy like @martindurant asked here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked in with the team and it would be great if you could link https://www.camus.energy/ and if there is a place for attribution you can include my github handle @emfdavid, otherwise the contributors list is fine.

this method can significantly reduce the time required to combine references, cutting it down to
a fraction of the previous duration. In reality, this approach builds upon consolidating references
with ``kerchunk.combine.MultiZarrtoZarr``, making it faster. The functions and operations used in this
will be a part of the ``kerchunk``'s API. You can follow this `video <https://discourse.pangeo.io/t/pangeo-showcase-optimizations-for-kerchunk-aggregation-and-zarr-i-o-at-scale-for-machine-learning/4074>`_
for the intial discussion.
Anu-Ra-g marked this conversation as resolved.
Show resolved Hide resolved

*How is it faster*

Every GRIB file stored on cloud platforms such as **AWS** and **GCP** is accompanied by its
Anu-Ra-g marked this conversation as resolved.
Show resolved Hide resolved
corresponding ``.idx`` file. This file otherwise known as an *index* file contains the key
metadata of the messages in the GRIB files. These metadata include `index`, `offset`, `datetime`,
`variable` and `forecast time` for their respective messages stored in the files.

These metadata will be used to build an k_index for every GRIB message that we will be
Anu-Ra-g marked this conversation as resolved.
Show resolved Hide resolved
indexing. The indexing process primarily involves the `pandas <https://pandas.pydata.org/>`_ library.

.. list-table:: k_index for a single GRIB file
:header-rows: 1
:widths: 5 10 15 10 20 15 10 20 20 30 10 10 10

* -
- varname
- typeOfLevel
- stepType
- name
- step
- level
- time
- valid_time
- uri
- offset
- length
- inline_value
* - 0
- gh
- isobaricInhPa
- instant
- Geopotential height
- 0 days 06:00:00
- 0.0
- 2017-01-01 06:00:00
- 2017-01-01 12:00:00
- s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
- 0
- 47493
- None
* - 1
- t
- isobaricInhPa
- instant
- Temperature
- 0 days 06:00:00
- 0.0
- 2017-01-01 06:00:00
- 2017-01-01 12:00:00
- s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
- 47493
- 19438
- None
* - 2
- r
- isobaricInhPa
- instant
- Relative humidity
- 0 days 06:00:00
- 0.0
- 2017-01-01 06:00:00
- 2017-01-01 12:00:00
- s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
- 66931
- 10835
- None
* - 3
- u
- isobaricInhPa
- instant
- U component of wind
- 0 days 06:00:00
- 0.0
- 2017-01-01 06:00:00
- 2017-01-01 12:00:00
- s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
- 77766
- 22625
- None
* - 4
- v
- isobaricInhPa
- instant
- V component of wind
- 0 days 06:00:00
- 0.0
- 2017-01-01 06:00:00
- 2017-01-01 12:00:00
- s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
- 100391
- 20488
- None


.. note::
The index in ``idx`` file indexes the GRIB messages where as the ``k_index`` (kerchunk index)
we build as part of this workflow index the variables in those messages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note explains my question above, but is not very clear. Map out the steps we need to do before launching into the details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steps for how we build the index? Should I include the code for building the index?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No code, just brief points.


*What now*

After creating the k_index as per the desired duration, we will use the ``DataTree`` model
from the `xarray-datatree <https://xarray-datatree.readthedocs.io/en/latest/>`_ to view a
part of the aggregation or the whole. Below is a tree model made from an aggregation of
martindurant marked this conversation as resolved.
Show resolved Hide resolved
GRIB files produced from **GEFS** model hosted in AWS S3 bucket.

.. code-block:: bash

DataTree('None', parent=None)
├── DataTree('prmsl')
│ │ Dimensions: ()
│ │ Data variables:
│ │ *empty*
│ │ Attributes:
│ │ name: Pressure reduced to MSL
│ └── DataTree('instant')
│ │ Dimensions: ()
│ │ Data variables:
│ │ *empty*
│ │ Attributes:
│ │ stepType: instant
│ └── DataTree('meanSea')
│ Dimensions: (latitude: 181, longitude: 360, time: 1, step: 1,
│ model_horizons: 1, valid_times: 237)
│ Coordinates:
│ * latitude (latitude) float64 1kB 90.0 89.0 88.0 87.0 ... -88.0 -89.0 -90.0
│ * longitude (longitude) float64 3kB 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0
│ meanSea float64 8B ...
│ number (time, step) int64 8B ...
│ step (model_horizons, valid_times) timedelta64[ns] 2kB ...
│ time (model_horizons, valid_times) datetime64[ns] 2kB ...
│ valid_time (model_horizons, valid_times) datetime64[ns] 2kB ...
│ Dimensions without coordinates: model_horizons, valid_times
│ Data variables:
│ prmsl (model_horizons, valid_times, latitude, longitude) float64 124MB ...
│ Attributes:
│ typeOfLevel: meanSea
└── DataTree('ulwrf')
│ Dimensions: ()
│ Data variables:
│ *empty*
│ Attributes:
│ name: Upward long-wave radiation flux
└── DataTree('avg')
│ Dimensions: ()
│ Data variables:
│ *empty*
│ Attributes:
│ stepType: avg
└── DataTree('nominalTop')
Dimensions: (latitude: 181, longitude: 360, time: 1, step: 1,
model_horizons: 1, valid_times: 237)
Coordinates:
* latitude (latitude) float64 1kB 90.0 89.0 88.0 87.0 ... -88.0 -89.0 -90.0
* longitude (longitude) float64 3kB 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0
nominalTop float64 8B ...
number (time, step) int64 8B ...
step (model_horizons, valid_times) timedelta64[ns] 2kB ...
time (model_horizons, valid_times) datetime64[ns] 2kB ...
valid_time (model_horizons, valid_times) datetime64[ns] 2kB ...
Dimensions without coordinates: model_horizons, valid_times
Data variables:
ulwrf (model_horizons, valid_times, latitude, longitude) float64 124MB ...
Attributes:
typeOfLevel: nominalTop

.. tip::
For a full tutorial on this workflow, refer this `kerchunk cookbook <https://projectpythia.org/kerchunk-cookbook/README.html>`_
martindurant marked this conversation as resolved.
Show resolved Hide resolved
in `Project Pythia <https://projectpythia.org/>`_.

.. raw:: html

<script data-goatcounter="https://kerchunk.goatcounter.com/count"
async src="//gc.zgo.at/count.js"></script>
Loading