Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement coupling of two or more models #76

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open

Conversation

b8raoult
Copy link
Collaborator

@b8raoult b8raoult commented Dec 7, 2024

The new command anemoi-inference couple config.yaml will run two or more models that requires forcing variables from each others. The models can run in different threads, different processes or different MPI ranks. They can use the same GPU (if the fit) or different GPUs.

Below is a sample configuration file, that runs an atmosphere model and an ocean model:

# How to run the model:
transport: processes

# The list of tasks to run
tasks:
  atmos:
    runner:
      config: atmos.yaml
      overrides:
        device: cuda:0
  ocean:
    runner:
      config: ocean.yaml
      overrides:
        device: cuda:1

# List of variables to exchange
couplings:
- atmos -> ocean:
  - lsm
  - 10u
  - 10v
  - 2t
  - 2d
  - ssrd
  - strd
  - tp
  - msl

- ocean -> atmos:
  - avg_tos
  - avg_siconc

@mchantry
Copy link
Member

mchantry commented Dec 7, 2024

Would be good to develop this multi-GPU with the expected multi-GPU inference for a single model in mind. Perhaps @cathalobrien and/or @japols can have a look at this?

@@ -47,7 +48,8 @@ def _init(self, state):
if os.path.exists(self.path):
os.remove(self.path)

self.ncfile = Dataset(self.path, "w", format="NETCDF4")
with LOCK:
Copy link
Contributor

@dietervdb-meteo dietervdb-meteo Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing anemoi-inference run with netcdf output, it hangs forever on with LOCK: (Not even proceeding to the next line). Replacing this with if True: the code proceeds to hang on the next instance of with LOCK. Possibly this is related to something missing in my installation. But no error messages or warnings appear.

I suggest the use of with LOCK is checked, in case it works in certain instances then at least an assert statement, making sure it will work or when not throw a message, needs to be added.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll have a look. Maybe we need an RLock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that solves the problem.

Copy link
Contributor

@dietervdb-meteo dietervdb-meteo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning the reformulation of various types of forcings the code looks correct and an explicit test using a LAM checkpoint leading to inference with boundary forcings is succesful: results identical to those of earlier anemoi-inference versions are produced.

Concerning other parts of the code I suggest another reviewer has a closer look.



class CoupleCmd(Command):
"""Inspect the contents of a checkpoint file."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring should be updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do there seem to be two commands to do the same thing?

  • couple
  • coupled

Comment on lines +18 to +19
from .couple import CoupleConfiguration as CoupleConfiguration
from .run import RunConfiguration as RunConfiguration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these importing as the same name?
All beyond the as could be removed

Comment on lines +41 to +43
transport: str
couplings: list[dict[str, list[str]]]
tasks: dict[str, dict[str, dict[str, Any]]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these have docs like all the other properties?

else:
result = np.stack([fields[v] for v in variables], axis=0)

assert len(result.shape) == 3 and result.shape[0] == len(variables) and result.shape[1] == len(dates), (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how specific this assertion is, an explanation might be useful

and r["time"] in ("0600", "1800")
):

r["stream"] = "scda"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the use of this stream?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that this input does nothing at the moment?
Both functions raise a NotImplementedError, why is it being added?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be addressed by #84

floriankrb
floriankrb previously approved these changes Dec 18, 2024
@b8raoult b8raoult dismissed floriankrb’s stale review December 18, 2024 09:45

The merge-base changed after approval.

Comment on lines +73 to +76
if trace_path:
from .trace import Trace

self.trace = Trace(trace_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have forgotten to commit the trace.py file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ATS
Development

Successfully merging this pull request may close these issues.

5 participants