Merge pull request #15 from CMIP-REF/requirements

CMIP-REF · Nov 30, 2024 · 594f581 · 594f581
2 parents 42d93cb + 850b281
commit 594f581
Show file tree

Hide file tree

Showing 28 changed files with 1,048 additions and 51 deletions.
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -33,6 +33,7 @@ jobs:
     env:
       REF_DATA_ROOT: ${{ github.workspace }}/.esgpull/data
       REF_OUTPUT_ROOT: ${{ github.workspace }}/out
+      REF_DATABASE_URL: "sqlite:///${{ github.workspace }}/.ref/db/ref.db"
     steps:
       - name: Check out repository
         uses: actions/checkout@v4
@@ -48,7 +49,10 @@ jobs:
           echo "Rerun after cache generation in tests job"
           exit 1
       - name: docs
-        run: uv run mkdocs build --strict
+        run: |
+          mkdir -p ${{ github.workspace }}/.ref/db
+          uv run ref ingest --source-type cmip6 .esgpull/data
+          uv run mkdocs build --strict
 
   tests:
     strategy:

diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -19,7 +19,8 @@ build:
     - asdf global uv latest
     - uv sync --frozen
     # Fetch test data from ESGF (needed by notebooks)
-    - uv run esgpull self install $READTHEDOCS_REPOSITORY_PATH/.esgf
+    - uv run esgpull self install $READTHEDOCS_REPOSITORY_PATH/.esgpull
     - uv run python scripts/fetch_test_data.py
+    - uv run ref ingest $READTHEDOCS_REPOSITORY_PATH/.esgpull/data
     # Run a strict build
-    - NO_COLOR=1 REF_DATA_ROOT=$READTHEDOCS_REPOSITORY_PATH/.esgf/data uv run mkdocs build --strict --site-dir $READTHEDOCS_OUTPUT/html
+    - NO_COLOR=1 REF_DATA_ROOT=$READTHEDOCS_REPOSITORY_PATH/.esgpull/data uv run mkdocs build --strict --site-dir $READTHEDOCS_OUTPUT/html
diff --git a/README.md b/README.md
@@ -91,14 +91,26 @@ dependency management. To get started, you will need to make sure that uv
 is installed
 ([instructions here](https://docs.astral.sh/uv/getting-started/installation/)).
 
-For all of work, we use our `Makefile`.
+We use our `Makefile` to provide an easy way to run common developer commands.
 You can read the instructions out and run the commands by hand if you wish,
 but we generally discourage this because it can be error prone.
-In order to create your environment, run `make virtual-environment`.
 
-If you wish to run the test suite,
-some input data must be fetched from ESGF.
-To do this, you will need to run `make fetch-data`.
+The following steps are required to set up a development environment.
+This will install the required dependencies and fetch some test data,
+as well as set up the configuration for the REF.
+
+```bash
+make virtual-environment
+uv run esgpull self install $PWD/.esgpull
+uv run ref config list > $PWD/.ref/ref.toml
+export REF_CONFIGURATION=$PWD/.ref
+make fetch-test-data
+uv run ref ingest --source-type cmip6 $PWD/.esgpull/data
+```
+
+The local `ref.toml` configuration file will make it easier to play around with settings.
+By default, the database will be stored in your home directory,
+this can be modified by changing the `db.database_url` setting in the `ref.toml` file.
 
 The test suite can then be run using `make test`.
 This will run the test suites for each package and finally the integration test suite.

diff --git a/changelog/15.feature.md b/changelog/15.feature.md
@@ -0,0 +1,7 @@
+Added a `DataRequirement` class to declare the requirements for a metric.
+
+This provides the ability to:
+
+* filter a data catalog
+* group datasets together to be used in a metric calculation
+* declare constraints on the data that is required for a metric calculation
diff --git a/docs/explanation.md b/docs/explanation.md
@@ -23,7 +23,7 @@ An example implementation of a metric provider is provided in the `ref_metrics_e
 ### Metrics
 
 A metric represents a specific calculation or analysis that can be performed on a dataset
-or set of datasets with the aim for benchmarking the performance of different models.
+or group of datasets with the aim for benchmarking the performance of different models.
 These metrics often represent a specific aspects of the Earth system and are compared against
 observations of the same quantities.
 
@@ -40,6 +40,24 @@ The Earth System Metrics and Diagnostics Standards
 provide a community standard for reporting outputs.
 This enables the ability to generate standardised outputs that can be distributed.
 
+## Datasets
+
+The REF aims to support a variety of input datasets,
+including CMIP6, CMIP7+, Obs4MIPs, and other observational datasets.
+
+When ingesting these datasets into the REF,
+the metadata used to uniquely describe the datasets is stored in a database.
+This metadata includes information such as:
+
+* the model that produced the dataset
+* the experiment that was run
+* the variable and units of the data
+* the time period of the data
+
+The facets (or dimensions) of the metadata depend on the dataset type.
+This metadata, in combination with the data requirements from a Metric,
+are used to determine which new metric executions are required.
+
 ## Execution Environments
 
 The REF aims to support the execution of metrics in a variety of environments.

diff --git a/docs/how-to-guides/dataset-selection.py b/docs/how-to-guides/dataset-selection.py
@@ -0,0 +1,216 @@
+# ---
+# jupyter:
+#   jupytext:
+#     text_representation:
+#       extension: .py
+#       format_name: percent
+#       format_version: '1.3'
+#       jupytext_version: 1.16.4
+#   kernelspec:
+#     display_name: Python 3 (ipykernel)
+#     language: python
+#     name: python3
+# ---
+
+# %% [markdown]
+# # Dataset Selection
+# A metric defines the requirements for the data it needs to run.
+# The requirements are defined in the `data_requirements` attribute of the metric class.
+#
+# This notebook provides some examples querying and filtering datasets.
+
+# %% tags=["hide_code"]
+import pandas as pd
+from IPython.display import display
+from ref_core.datasets import FacetFilter, SourceDatasetType
+from ref_core.metrics import DataRequirement
+
+from ref.cli.config import load_config
+from ref.database import Database
+
+# %% tags=["hide_code"]
+config = load_config()
+db = Database.from_config(config)
+
+# %% [markdown]
+#
+# Each source dataset type has a corresponding adapter that can be used to load the data catalog.
+#
+# The adapter provides a consistent interface for ingesting
+# and querying datasets across different dataset types.
+# It contains information such as the columns that are expected.
+# %%
+from ref.datasets import get_dataset_adapter
+
+adapter = get_dataset_adapter("cmip6")
+adapter
+
+# %% [markdown]
+# Below is an example of a data catalog of the CMIP6 datasets that have already been ingested.
+#
+# This data catalog contains information about the datasets that are available for use in the metrics.
+# The data catalog is a pandas DataFrame that contains information about the datasets,
+# such as the variable, source_id, and other metadata.
+#
+# Each row represents an individual NetCDF file,
+# with the rows containing the metadata associated with that file.
+# There are ~36 different **facets** of metadata for a CMIP6 data file.
+# Each of these facets can be used to refine the datasets that are needed for a given metric execution.
+
+# %%
+data_catalog = adapter.load_catalog(db)
+data_catalog
+
+
+# %% [markdown]
+# A dataset may consist of more than one file. In the case of CMIP6 datasets,
+# the modelling centers who produce the data may chunk a dataset along the time axis.
+# The size of these chunks is at the descression of the modelling center.
+#
+# Datasets share a common set of metadata (see `adapter.dataset_specific_metadata`)
+# which do not vary for a given dataset,
+# while some facets vary by dataset (`adapter.file_specific_metadata`).
+#
+# Each data catalog will have a facet that can be used to split the catalog into unique datasets
+# (See `adapter.slug_column`).
+
+# %%
+adapter.slug_column
+
+# %%
+for unique_id, dataset_files in data_catalog.groupby(adapter.slug_column):
+    print(unique_id)
+    display(dataset_files)
+    print()
+
+# %% [markdown]
+# Each metric may be run multiple times with different groups of datasets.
+#
+# Determining which metric executions should be performed is a three-step process:
+# 1. Filter the data catalog based on the metric's requirements
+# 2. Group the filtered data catalog using unique metadata fields
+# 3. Apply constraints to the groups to ensure the correct data is available
+#
+# Each group that passes the constraints is a valid group for the metric to be executed.
+#
+# ## Examples
+# Below are some examples showing different data requests
+# and the corresponding groups of datasets that would be executed.
+
+# %%
+from ref.solver import extract_covered_datasets
+
+
+# %% tags=["hide_code"]
+def display_groups(frames):
+    for frame in frames:
+        display(frame[["instance_id", "source_id", "variable_id"]].drop_duplicates())
+
+
+# %% [markdown]
+# The simplest data request is a `FacetFilter`.
+# This filters the data catalog to include only the data required for a given metric run.
+
+# %%
+data_requirement = DataRequirement(
+    source_type=SourceDatasetType.CMIP6,
+    filters=(
+        # Only include "tas" and "rsut"
+        FacetFilter(facets={"variable_id": ("tas", "rsut")}),
+    ),
+    group_by=None,
+)
+
+groups = extract_covered_datasets(data_catalog, data_requirement)
+
+display_groups(groups)
+
+# %% [markdown]
+# The `group_by` field can be used to split the filtered data into multiple groups,
+# each of which has a unique set of values in the specified facets.
+# This results in multiple groups of datasets, each of which would correspond to a metric execution.
+
+# %%
+data_requirement = DataRequirement(
+    source_type=SourceDatasetType.CMIP6,
+    filters=(
+        # Only include "tas" and "rsut"
+        FacetFilter(facets={"variable_id": ("tas", "rsut")}),
+    ),
+    group_by=(
+        "variable_id",
+        "source_id",
+    ),
+)
+
+groups = extract_covered_datasets(data_catalog, data_requirement)
+
+display_groups(groups)
+
+
+# %% [markdown]
+# A data requirement can optionally specify `Constraint`s.
+# These constraints are applied to each group independtly to modify a group or ignore it.
+# All constraints much hold for a group to be executed.
+#
+# One type of constraint is a `GroupOperation`.
+# This constraint allows for the manipulation of a given group.
+# This can be used to remove datasets or include additional datasets from the catalog,
+# which is useful into select common datasets for all groups (e.g. cell areas).
+#
+# Below an `IncludeTas` GroupOperation is included which adds the corresponding `tas` dataset to each group.
+
+
+# %%
+class IncludeTas:
+    def apply(self, group: pd.DataFrame, data_catalog: pd.DataFrame) -> pd.DataFrame:
+        # we will probably need to include some helpers
+        tas = data_catalog[
+            (data_catalog["variable_id"] == "tas")
+            & data_catalog["source_id"].isin(group["source_id"].unique())
+        ]
+
+        return pd.concat([group, tas])
+
+
+data_requirement = DataRequirement(
+    source_type=SourceDatasetType.CMIP6,
+    filters=(FacetFilter(facets={"frequency": "mon"}),),
+    group_by=("variable_id", "source_id", "member_id"),
+    constraints=(IncludeTas(),),
+)
+
+groups = extract_covered_datasets(data_catalog, data_requirement)
+
+display_groups(groups)
+
+
+# %% [markdown]
+# In addition to operations, a `GroupValidator` constraint can be specified.
+# This validator is used to determine if a group is valid or not.
+# If the validator does not return True, then the group is excluded from the list of groups for execution.
+
+
+# %%
+class AtLeast2:
+    def validate(self, group: pd.DataFrame) -> bool:
+        return len(group["instance_id"].drop_duplicates()) >= 2
+
+
+# %% [markdown]
+# Here we add a simple validator which ensures that at least 2 unique datasets are present.
+# This removes the tas-only group from above.
+
+# %%
+data_requirement = DataRequirement(
+    source_type=SourceDatasetType.CMIP6,
+    filters=(FacetFilter(facets={"frequency": "mon"}),),
+    group_by=("variable_id", "source_id", "member_id"),
+    constraints=(IncludeTas(), AtLeast2()),
+)
+
+groups = extract_covered_datasets(data_catalog, data_requirement)
+
+display_groups(groups)
+
+# %%
diff --git a/docs/how-to-guides/running-metrics-locally.py b/docs/how-to-guides/running-metrics-locally.py
@@ -71,7 +71,7 @@
 # This can be overridden by specifying the `REF_EXECUTOR` environment variable.
 
 # %%
-result = run_metric("example", provider, configuration=configuration, trigger=trigger)
+result = run_metric("global_mean_timeseries", provider, configuration=configuration, trigger=trigger)
 result
 
 # %%
@@ -87,7 +87,7 @@
 # This will not perform and validation/verification of the output results.
 
 # %%
-metric = provider.get("example")
+metric = provider.get("global_mean_timeseries")
 
 direct_result = metric.run(configuration=configuration, trigger=trigger)
 assert direct_result.successful