Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor[cartesian]: use DeviceType from core_defn in LayoutInfo #1867

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

romanc
Copy link
Contributor

@romanc romanc commented Feb 14, 2025

Description

Follow-up from PR #1843, which isolates the refactor to re-use core_defn.DeviceType instead of "cpu" | "gpu" literals in cartesian's storage_layout.

I leave this as draft since I have some questions whether or not we are on the right track. I'd like to open the floor for a first round of feedback. I'd be especially happy to get feedback from @havogt and @egparedes, who have been with this codebase for a long time.

Once we agree on the bigger questions, I'll go and run pyFV3 tests with that branch just to be sure that we didn't mess up anything that would only show later (In which case, we should probably add some test coverage in this repo).

So yeah, this is not urgent in any way and might drag on on the side for a while ... @FlorianDeconinck we'll see how this aligns with other priorities if this get bigger and/or takes longer. I'd also be fine to have all the discussions / gather all the info and then re-visit this later or write a proper refactor issue for summer. I don't expect any perf improvements from this refactor.

Requirements

  • All fixes and/or new features come with corresponding tests.
  • Important design decisions have been documented in the appropriate ADR inside the docs/development/ADRs/ folder.

Copy link
Contributor Author

@romanc romanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turned out to be a bit bigger than expected. Main pain point was the test generation (which needs a static way of knowing which backend/layout needs a GPU).

I left a couple of questions & remarks inline. Let's have a discussion ...

@@ -45,9 +45,9 @@
REGISTRY = gt_utils.Registry()


def from_name(name: str) -> Optional[Type[Backend]]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated: The return type is not optional. It raises an exception in case the backend is not found.

Comment on lines +439 to +454
def _decorator(cls: Type[Backend]) -> Type[Backend]:
def _no_generate(obj) -> Type[StencilObject]:
raise NotImplementedError(
f"Disabled '{cls.name}' backend: 'f{message}'\n",
f"You can still enable the backend by hand using the environment variable '{enabled_env_var}=1'",
)

# Replace generate method with raise
if not hasattr(cls, "generate"):
raise ValueError(f"Coding error. Expected a generate method on {cls}")
# Flag that it got disabled for register lookup
cls.disabled = True # type: ignore
cls.generate = _no_generate # type: ignore
return cls

return _decorator
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated: no else needed after return inside if. Reduces the amount of indentation needed.

Comment on lines -780 to -787
storage_info = {
"alignment": 1,
"device": "cpu",
"layout_map": gt_storage.layout.layout_maker_factory((0, 1, 2)),
"is_optimal_layout": gt_storage.layout.layout_checker_factory(
gt_storage.layout.layout_maker_factory((0, 1, 2))
),
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to gt_storage.layout like all the other layouts

@@ -206,7 +207,7 @@ def hyp_wrapper(test_hyp, hypothesis_data):
)

marks = test["marks"].copy()
if gt4pyc.backend.from_name(test["backend"]).storage_info["device"] == "gpu":
if test["backend"] in GPU_BACKEND_NAMES:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We come by here during test setup. This happens for CPU/GPU tests alike. In case cupy isn't automatically detected (or in case there's no GPU (e.g. on GHA runners), CUPY_DEVICE will not be defined and default to None and this won't correctly trigger the requires_gpu marks.

For the test system, we thus need something that is independent of cupy availability to denote that a certain tests require GPU while other don't. This could also be done differently, e.g. a requires_gpu property on the base Backend class. Honestly, the longer I think about it, the more it feels to me that we should be able to tell from the backend. If there were multiple layouts per backend (as far as I understand this is possible), I'd expect all of them to either run on CPU or GPU (or possibly another accelerator in the future).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think the earlier design was to split layout from backend by addressing them by their underlying allocator (numpy/cupy) to the backend. Though it's correct architecturally, I would argue it's a bit of an over engineering and repeating data is not an issue. Which leads me to the fact that Backend should be the source of truth, including for requires_gpu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we could extend it even further by re-routing allocation through, etc. I am no fan of massive class that does it all, but in this case Backend seems underused a single source of truth

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably out of scope for this PR

Comment on lines +22 to +24
CUPY_DEVICE: Final[Literal[DeviceType.CUDA, DeviceType.ROCM] | None] = (
None if not cp else (DeviceType.ROCM if cp.cuda.get_hipcc_path() else DeviceType.CUDA)
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could share this between next and cartesian. If so, should it go into _core?

@@ -58,15 +58,13 @@ def register(backend_cls: Type[Backend]) -> Type[Backend]:
assert issubclass(backend_cls, Backend) and backend_cls.name is not None

if isinstance(backend_cls.name, str):
gt_storage.register(backend_cls.name, backend_cls.storage_info)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layouts were registered twice. Once here and once in the storage/cartesian/layout.py. Most layouts were registered under the same name (i.e. the backend name), however some were duplicated. I choose to remove the layout registration here and changed all the names in storage/cartesian/layout.py to match backend names. This keeps layout tests separate.

Comment on lines -64 to -68
else:
raise ValueError(
"Invalid 'name' attribute ('{name}') in backend class '{cls}'".format(
name=backend_cls.name, cls=backend_cls
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated: no else needed after return in if

@@ -51,22 +52,21 @@ def _compute_domain_origin_cache_key(

@dataclass
class ArgsInfo:
device: str
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never used, so I removed it

@@ -11,7 +11,7 @@
from . import cartesian
from .cartesian import layout
from .cartesian.interface import empty, from_array, full, ones, zeros
from .cartesian.layout import from_name, register
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

layout registration is now private. Not sure if this is the idea. Happy to revert and register layouts in the base backend instead.

"layout_map": lambda axes: tuple(i for i in range(len(axes))),
"is_optimal_layout": lambda *_: True,
}
register("naive_cpu", NaiveCPULayout)
_register("numpy", NaiveCPULayout)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to match backend names. This seems fragile... on the other hand, like this we are able to test layouts independently in tests/storage_tests/unit_tests/test_layout.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't layout yelling to be part of the Backend class, with the name and everything else? Probably not in the scope of this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at this part of the cartesian codebase in years but I think the root of the problem is that there is no actual separation in the code between storage layouts and the backend, although they are orthogonal. I mean, the layout definition inside a backend encodes the optimal layout for that particular backend, but most of the backends will work anyway with other layouts, and several backends can have the same optimal layout, so they are clearly different concepts.

The registry in this file was probably meant to encode the mapping between layout names and its specification and it would then make sense that the backend object contained the layout name, which should be looked up in this registry. But since the backend object contains the whole LayoutInfo instance, this registry doesn't make any sense and it should be then deleted, since the backend registry is enough to recover the same information:

backend.REGISTRY[backend_name].storage_info == layout.REGISTRY[backend_name]

So I would either add a new indirection layer by storing only the layout name in the backend or just remove the layout registry and only use the backend registry for everything (at this point I think it only makes sense to go with the second option to make the codebase simpler and delete redundant code).

@romanc romanc changed the title refactor[cartesian]: use DeviceType from core_defn refactor[cartesian]: use DeviceType from core_defn in LayoutInfo Feb 14, 2025
@romanc romanc requested review from havogt, FlorianDeconinck and egparedes and removed request for havogt and FlorianDeconinck February 14, 2025 19:43
@romanc
Copy link
Contributor Author

romanc commented Feb 17, 2025

Digging a bit more, I found that the only reason (left) for storage to depend on cartesian is GT4PY_USE_HIP (see PR #1278) which was added in gt4py/cartesian/config.py instead of next to the aloocators in the storage package. If we are doing this, it might be worth to get rid of that dependency cycle too.

To me, it feels like this AMD/NVIDIA device handling should all be centralized and re-used in next and cartesian. Like, the CUPY_DEVICE should live in _core and a potential GT4PY_USE_HIP override too.

Question for @stubbiali: Do you guys actually use the GT4PY_USE_HIP env variable (when running on LUMI)?

@stubbiali
Copy link
Contributor

IIRC, we introduced GT4PY_USE_HIP following cupy's CUPY_INSTALL_USE_HIP (see here), but the automatic detection using cupy.cuda.get_hipcc_path() has been working well, so there's actually no reason to set GT4PY_USE_HIP via an env variable.

Copy link
Contributor

@FlorianDeconinck FlorianDeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Test failing, some musing about where layout metadata should live



def _get_backends_with_storage_info(storage_info_kind: str):
def _filter_backends(filter: Callable[[Backend], bool]) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure the test generated are the same as before. pytest --collect-only will collect all combinations of tests and not run them

"layout_map": lambda axes: tuple(i for i in range(len(axes))),
"is_optimal_layout": lambda *_: True,
}
register("naive_cpu", NaiveCPULayout)
_register("numpy", NaiveCPULayout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't layout yelling to be part of the Backend class, with the name and everything else? Probably not in the scope of this PR

@@ -206,7 +207,7 @@ def hyp_wrapper(test_hyp, hypothesis_data):
)

marks = test["marks"].copy()
if gt4pyc.backend.from_name(test["backend"]).storage_info["device"] == "gpu":
if test["backend"] in GPU_BACKEND_NAMES:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think the earlier design was to split layout from backend by addressing them by their underlying allocator (numpy/cupy) to the backend. Though it's correct architecturally, I would argue it's a bit of an over engineering and repeating data is not an issue. Which leads me to the fact that Backend should be the source of truth, including for requires_gpu

@@ -206,7 +207,7 @@ def hyp_wrapper(test_hyp, hypothesis_data):
)

marks = test["marks"].copy()
if gt4pyc.backend.from_name(test["backend"]).storage_info["device"] == "gpu":
if test["backend"] in GPU_BACKEND_NAMES:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we could extend it even further by re-routing allocation through, etc. I am no fan of massive class that does it all, but in this case Backend seems underused a single source of truth

@@ -206,7 +207,7 @@ def hyp_wrapper(test_hyp, hypothesis_data):
)

marks = test["marks"].copy()
if gt4pyc.backend.from_name(test["backend"]).storage_info["device"] == "gpu":
if test["backend"] in GPU_BACKEND_NAMES:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably out of scope for this PR

Copy link
Contributor

@egparedes egparedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question and my two cents about the design of the backend-layout relationship.

PS: I'm sorry for the late reply, I took a long vacation earlier this month.

@@ -84,7 +82,7 @@ class Backend(abc.ABC):
#: Backend-specific storage parametrization:
#:
#: - "alignment": in bytes
#: - "device": "cpu" | "gpu"
#: - "device": core_defs.DeviceType | None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: why is None allowed here? It shouldn't be allowed if the old documentation is correct, right?

"layout_map": lambda axes: tuple(i for i in range(len(axes))),
"is_optimal_layout": lambda *_: True,
}
register("naive_cpu", NaiveCPULayout)
_register("numpy", NaiveCPULayout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at this part of the cartesian codebase in years but I think the root of the problem is that there is no actual separation in the code between storage layouts and the backend, although they are orthogonal. I mean, the layout definition inside a backend encodes the optimal layout for that particular backend, but most of the backends will work anyway with other layouts, and several backends can have the same optimal layout, so they are clearly different concepts.

The registry in this file was probably meant to encode the mapping between layout names and its specification and it would then make sense that the backend object contained the layout name, which should be looked up in this registry. But since the backend object contains the whole LayoutInfo instance, this registry doesn't make any sense and it should be then deleted, since the backend registry is enough to recover the same information:

backend.REGISTRY[backend_name].storage_info == layout.REGISTRY[backend_name]

So I would either add a new indirection layer by storing only the layout name in the backend or just remove the layout registry and only use the backend registry for everything (at this point I think it only makes sense to go with the second option to make the codebase simpler and delete redundant code).

register("gpu", GPULayout)
_register("gt:gpu", GPULayout)

DaceCPULayout: Final[LayoutInfo] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dace layout is useless if the dace backend is not installed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants