Skip to content

Commit

Permalink
Merge pull request #15 from CMIP-REF/requirements
Browse files Browse the repository at this point in the history
  • Loading branch information
lewisjared authored Nov 30, 2024
2 parents 42d93cb + 850b281 commit 594f581
Show file tree
Hide file tree
Showing 28 changed files with 1,048 additions and 51 deletions.
6 changes: 5 additions & 1 deletion .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ jobs:
env:
REF_DATA_ROOT: ${{ github.workspace }}/.esgpull/data
REF_OUTPUT_ROOT: ${{ github.workspace }}/out
REF_DATABASE_URL: "sqlite:///${{ github.workspace }}/.ref/db/ref.db"
steps:
- name: Check out repository
uses: actions/checkout@v4
Expand All @@ -48,7 +49,10 @@ jobs:
echo "Rerun after cache generation in tests job"
exit 1
- name: docs
run: uv run mkdocs build --strict
run: |
mkdir -p ${{ github.workspace }}/.ref/db
uv run ref ingest --source-type cmip6 .esgpull/data
uv run mkdocs build --strict
tests:
strategy:
Expand Down
5 changes: 3 additions & 2 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ build:
- asdf global uv latest
- uv sync --frozen
# Fetch test data from ESGF (needed by notebooks)
- uv run esgpull self install $READTHEDOCS_REPOSITORY_PATH/.esgf
- uv run esgpull self install $READTHEDOCS_REPOSITORY_PATH/.esgpull
- uv run python scripts/fetch_test_data.py
- uv run ref ingest $READTHEDOCS_REPOSITORY_PATH/.esgpull/data
# Run a strict build
- NO_COLOR=1 REF_DATA_ROOT=$READTHEDOCS_REPOSITORY_PATH/.esgf/data uv run mkdocs build --strict --site-dir $READTHEDOCS_OUTPUT/html
- NO_COLOR=1 REF_DATA_ROOT=$READTHEDOCS_REPOSITORY_PATH/.esgpull/data uv run mkdocs build --strict --site-dir $READTHEDOCS_OUTPUT/html
22 changes: 17 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,14 +91,26 @@ dependency management. To get started, you will need to make sure that uv
is installed
([instructions here](https://docs.astral.sh/uv/getting-started/installation/)).

For all of work, we use our `Makefile`.
We use our `Makefile` to provide an easy way to run common developer commands.
You can read the instructions out and run the commands by hand if you wish,
but we generally discourage this because it can be error prone.
In order to create your environment, run `make virtual-environment`.

If you wish to run the test suite,
some input data must be fetched from ESGF.
To do this, you will need to run `make fetch-data`.
The following steps are required to set up a development environment.
This will install the required dependencies and fetch some test data,
as well as set up the configuration for the REF.

```bash
make virtual-environment
uv run esgpull self install $PWD/.esgpull
uv run ref config list > $PWD/.ref/ref.toml
export REF_CONFIGURATION=$PWD/.ref
make fetch-test-data
uv run ref ingest --source-type cmip6 $PWD/.esgpull/data
```

The local `ref.toml` configuration file will make it easier to play around with settings.
By default, the database will be stored in your home directory,
this can be modified by changing the `db.database_url` setting in the `ref.toml` file.

The test suite can then be run using `make test`.
This will run the test suites for each package and finally the integration test suite.
Expand Down
7 changes: 7 additions & 0 deletions changelog/15.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Added a `DataRequirement` class to declare the requirements for a metric.

This provides the ability to:

* filter a data catalog
* group datasets together to be used in a metric calculation
* declare constraints on the data that is required for a metric calculation
20 changes: 19 additions & 1 deletion docs/explanation.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ An example implementation of a metric provider is provided in the `ref_metrics_e
### Metrics

A metric represents a specific calculation or analysis that can be performed on a dataset
or set of datasets with the aim for benchmarking the performance of different models.
or group of datasets with the aim for benchmarking the performance of different models.
These metrics often represent a specific aspects of the Earth system and are compared against
observations of the same quantities.

Expand All @@ -40,6 +40,24 @@ The Earth System Metrics and Diagnostics Standards
provide a community standard for reporting outputs.
This enables the ability to generate standardised outputs that can be distributed.

## Datasets

The REF aims to support a variety of input datasets,
including CMIP6, CMIP7+, Obs4MIPs, and other observational datasets.

When ingesting these datasets into the REF,
the metadata used to uniquely describe the datasets is stored in a database.
This metadata includes information such as:

* the model that produced the dataset
* the experiment that was run
* the variable and units of the data
* the time period of the data

The facets (or dimensions) of the metadata depend on the dataset type.
This metadata, in combination with the data requirements from a Metric,
are used to determine which new metric executions are required.

## Execution Environments

The REF aims to support the execution of metrics in a variety of environments.
Expand Down
216 changes: 216 additions & 0 deletions docs/how-to-guides/dataset-selection.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.16.4
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---

# %% [markdown]
# # Dataset Selection
# A metric defines the requirements for the data it needs to run.
# The requirements are defined in the `data_requirements` attribute of the metric class.
#
# This notebook provides some examples querying and filtering datasets.

# %% tags=["hide_code"]
import pandas as pd
from IPython.display import display
from ref_core.datasets import FacetFilter, SourceDatasetType
from ref_core.metrics import DataRequirement

from ref.cli.config import load_config
from ref.database import Database

# %% tags=["hide_code"]
config = load_config()
db = Database.from_config(config)

# %% [markdown]
#
# Each source dataset type has a corresponding adapter that can be used to load the data catalog.
#
# The adapter provides a consistent interface for ingesting
# and querying datasets across different dataset types.
# It contains information such as the columns that are expected.
# %%
from ref.datasets import get_dataset_adapter

adapter = get_dataset_adapter("cmip6")
adapter

# %% [markdown]
# Below is an example of a data catalog of the CMIP6 datasets that have already been ingested.
#
# This data catalog contains information about the datasets that are available for use in the metrics.
# The data catalog is a pandas DataFrame that contains information about the datasets,
# such as the variable, source_id, and other metadata.
#
# Each row represents an individual NetCDF file,
# with the rows containing the metadata associated with that file.
# There are ~36 different **facets** of metadata for a CMIP6 data file.
# Each of these facets can be used to refine the datasets that are needed for a given metric execution.

# %%
data_catalog = adapter.load_catalog(db)
data_catalog


# %% [markdown]
# A dataset may consist of more than one file. In the case of CMIP6 datasets,
# the modelling centers who produce the data may chunk a dataset along the time axis.
# The size of these chunks is at the descression of the modelling center.
#
# Datasets share a common set of metadata (see `adapter.dataset_specific_metadata`)
# which do not vary for a given dataset,
# while some facets vary by dataset (`adapter.file_specific_metadata`).
#
# Each data catalog will have a facet that can be used to split the catalog into unique datasets
# (See `adapter.slug_column`).

# %%
adapter.slug_column

# %%
for unique_id, dataset_files in data_catalog.groupby(adapter.slug_column):
print(unique_id)
display(dataset_files)
print()

# %% [markdown]
# Each metric may be run multiple times with different groups of datasets.
#
# Determining which metric executions should be performed is a three-step process:
# 1. Filter the data catalog based on the metric's requirements
# 2. Group the filtered data catalog using unique metadata fields
# 3. Apply constraints to the groups to ensure the correct data is available
#
# Each group that passes the constraints is a valid group for the metric to be executed.
#
# ## Examples
# Below are some examples showing different data requests
# and the corresponding groups of datasets that would be executed.

# %%
from ref.solver import extract_covered_datasets


# %% tags=["hide_code"]
def display_groups(frames):
for frame in frames:
display(frame[["instance_id", "source_id", "variable_id"]].drop_duplicates())


# %% [markdown]
# The simplest data request is a `FacetFilter`.
# This filters the data catalog to include only the data required for a given metric run.

# %%
data_requirement = DataRequirement(
source_type=SourceDatasetType.CMIP6,
filters=(
# Only include "tas" and "rsut"
FacetFilter(facets={"variable_id": ("tas", "rsut")}),
),
group_by=None,
)

groups = extract_covered_datasets(data_catalog, data_requirement)

display_groups(groups)

# %% [markdown]
# The `group_by` field can be used to split the filtered data into multiple groups,
# each of which has a unique set of values in the specified facets.
# This results in multiple groups of datasets, each of which would correspond to a metric execution.

# %%
data_requirement = DataRequirement(
source_type=SourceDatasetType.CMIP6,
filters=(
# Only include "tas" and "rsut"
FacetFilter(facets={"variable_id": ("tas", "rsut")}),
),
group_by=(
"variable_id",
"source_id",
),
)

groups = extract_covered_datasets(data_catalog, data_requirement)

display_groups(groups)


# %% [markdown]
# A data requirement can optionally specify `Constraint`s.
# These constraints are applied to each group independtly to modify a group or ignore it.
# All constraints much hold for a group to be executed.
#
# One type of constraint is a `GroupOperation`.
# This constraint allows for the manipulation of a given group.
# This can be used to remove datasets or include additional datasets from the catalog,
# which is useful into select common datasets for all groups (e.g. cell areas).
#
# Below an `IncludeTas` GroupOperation is included which adds the corresponding `tas` dataset to each group.


# %%
class IncludeTas:
def apply(self, group: pd.DataFrame, data_catalog: pd.DataFrame) -> pd.DataFrame:
# we will probably need to include some helpers
tas = data_catalog[
(data_catalog["variable_id"] == "tas")
& data_catalog["source_id"].isin(group["source_id"].unique())
]

return pd.concat([group, tas])


data_requirement = DataRequirement(
source_type=SourceDatasetType.CMIP6,
filters=(FacetFilter(facets={"frequency": "mon"}),),
group_by=("variable_id", "source_id", "member_id"),
constraints=(IncludeTas(),),
)

groups = extract_covered_datasets(data_catalog, data_requirement)

display_groups(groups)


# %% [markdown]
# In addition to operations, a `GroupValidator` constraint can be specified.
# This validator is used to determine if a group is valid or not.
# If the validator does not return True, then the group is excluded from the list of groups for execution.


# %%
class AtLeast2:
def validate(self, group: pd.DataFrame) -> bool:
return len(group["instance_id"].drop_duplicates()) >= 2


# %% [markdown]
# Here we add a simple validator which ensures that at least 2 unique datasets are present.
# This removes the tas-only group from above.

# %%
data_requirement = DataRequirement(
source_type=SourceDatasetType.CMIP6,
filters=(FacetFilter(facets={"frequency": "mon"}),),
group_by=("variable_id", "source_id", "member_id"),
constraints=(IncludeTas(), AtLeast2()),
)

groups = extract_covered_datasets(data_catalog, data_requirement)

display_groups(groups)

# %%
4 changes: 2 additions & 2 deletions docs/how-to-guides/running-metrics-locally.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@
# This can be overridden by specifying the `REF_EXECUTOR` environment variable.

# %%
result = run_metric("example", provider, configuration=configuration, trigger=trigger)
result = run_metric("global_mean_timeseries", provider, configuration=configuration, trigger=trigger)
result

# %%
Expand All @@ -87,7 +87,7 @@
# This will not perform and validation/verification of the output results.

# %%
metric = provider.get("example")
metric = provider.get("global_mean_timeseries")

direct_result = metric.run(configuration=configuration, trigger=trigger)
assert direct_result.successful
Expand Down
Loading

0 comments on commit 594f581

Please sign in to comment.