refactor[cartesian]: use `DeviceType` from core_defn in `LayoutInfo` #1867

romanc · 2025-02-14T16:56:17Z

Description

Follow-up from PR #1843, which isolates the refactor to re-use core_defn.DeviceType instead of "cpu" | "gpu" literals in cartesian's storage_layout.

I leave this as draft since I have some questions whether or not we are on the right track. I'd like to open the floor for a first round of feedback. I'd be especially happy to get feedback from @havogt and @egparedes, who have been with this codebase for a long time.

Once we agree on the bigger questions, I'll go and run pyFV3 tests with that branch just to be sure that we didn't mess up anything that would only show later (In which case, we should probably add some test coverage in this repo).

So yeah, this is not urgent in any way and might drag on on the side for a while ... @FlorianDeconinck we'll see how this aligns with other priorities if this get bigger and/or takes longer. I'd also be fine to have all the discussions / gather all the info and then re-visit this later or write a proper refactor issue for summer. I don't expect any perf improvements from this refactor.

Requirements

All fixes and/or new features come with corresponding tests.
Important design decisions have been documented in the appropriate ADR inside the docs/development/ADRs/ folder.

romanc

This turned out to be a bit bigger than expected. Main pain point was the test generation (which needs a static way of knowing which backend/layout needs a GPU).

I left a couple of questions & remarks inline. Let's have a discussion ...

romanc · 2025-02-14T18:00:06Z

src/gt4py/cartesian/backend/base.py

@@ -45,9 +45,9 @@
 REGISTRY = gt_utils.Registry()


-def from_name(name: str) -> Optional[Type[Backend]]:


Unrelated: The return type is not optional. It raises an exception in case the backend is not found.

romanc · 2025-02-14T18:01:02Z

src/gt4py/cartesian/backend/base.py

+    def _decorator(cls: Type[Backend]) -> Type[Backend]:
+        def _no_generate(obj) -> Type[StencilObject]:
+            raise NotImplementedError(
+                f"Disabled '{cls.name}' backend: 'f{message}'\n",
+                f"You can still enable the backend by hand using the environment variable '{enabled_env_var}=1'",
+            )
+
+        # Replace generate method with raise
+        if not hasattr(cls, "generate"):
+            raise ValueError(f"Coding error. Expected a generate method on {cls}")
+        # Flag that it got disabled for register lookup
+        cls.disabled = True  # type: ignore
+        cls.generate = _no_generate  # type: ignore
+        return cls
+
+    return _decorator


Unrelated: no else needed after return inside if. Reduces the amount of indentation needed.

romanc · 2025-02-14T18:05:53Z

src/gt4py/cartesian/backend/dace_backend.py

-    storage_info = {
-        "alignment": 1,
-        "device": "cpu",
-        "layout_map": gt_storage.layout.layout_maker_factory((0, 1, 2)),
-        "is_optimal_layout": gt_storage.layout.layout_checker_factory(
-            gt_storage.layout.layout_maker_factory((0, 1, 2))
-        ),
-    }


Moved to gt_storage.layout like all the other layouts

romanc · 2025-02-14T18:16:43Z

src/gt4py/cartesian/testing/suites.py

@@ -206,7 +207,7 @@ def hyp_wrapper(test_hyp, hypothesis_data):
                )

                marks = test["marks"].copy()
-                if gt4pyc.backend.from_name(test["backend"]).storage_info["device"] == "gpu":
+                if test["backend"] in GPU_BACKEND_NAMES:


We come by here during test setup. This happens for CPU/GPU tests alike. In case cupy isn't automatically detected (or in case there's no GPU (e.g. on GHA runners), CUPY_DEVICE will not be defined and default to None and this won't correctly trigger the requires_gpu marks.

For the test system, we thus need something that is independent of cupy availability to denote that a certain tests require GPU while other don't. This could also be done differently, e.g. a requires_gpu property on the base Backend class. Honestly, the longer I think about it, the more it feels to me that we should be able to tell from the backend. If there were multiple layouts per backend (as far as I understand this is possible), I'd expect all of them to either run on CPU or GPU (or possibly another accelerator in the future).

Agreed. I think the earlier design was to split layout from backend by addressing them by their underlying allocator (numpy/cupy) to the backend. Though it's correct architecturally, I would argue it's a bit of an over engineering and repeating data is not an issue. Which leads me to the fact that Backend should be the source of truth, including for requires_gpu

And we could extend it even further by re-routing allocation through, etc. I am no fan of massive class that does it all, but in this case Backend seems underused a single source of truth

Probably out of scope for this PR

romanc · 2025-02-14T18:17:51Z

src/gt4py/storage/cartesian/cupy_device.py

+CUPY_DEVICE: Final[Literal[DeviceType.CUDA, DeviceType.ROCM] | None] = (
+    None if not cp else (DeviceType.ROCM if cp.cuda.get_hipcc_path() else DeviceType.CUDA)
+)


We could share this between next and cartesian. If so, should it go into _core?

romanc · 2025-02-14T19:20:53Z

src/gt4py/cartesian/backend/base.py

@@ -58,15 +58,13 @@ def register(backend_cls: Type[Backend]) -> Type[Backend]:
    assert issubclass(backend_cls, Backend) and backend_cls.name is not None

    if isinstance(backend_cls.name, str):
-        gt_storage.register(backend_cls.name, backend_cls.storage_info)


Layouts were registered twice. Once here and once in the storage/cartesian/layout.py. Most layouts were registered under the same name (i.e. the backend name), however some were duplicated. I choose to remove the layout registration here and changed all the names in storage/cartesian/layout.py to match backend names. This keeps layout tests separate.

romanc · 2025-02-14T19:21:33Z

src/gt4py/cartesian/backend/base.py

-    else:
-        raise ValueError(
-            "Invalid 'name' attribute ('{name}') in backend class '{cls}'".format(
-                name=backend_cls.name, cls=backend_cls
-            )


Unrelated: no else needed after return in if

romanc · 2025-02-14T19:23:08Z

src/gt4py/cartesian/stencil_object.py

@@ -51,22 +52,21 @@ def _compute_domain_origin_cache_key(

 @dataclass
 class ArgsInfo:
-    device: str


Never used, so I removed it

romanc · 2025-02-14T19:25:03Z

src/gt4py/storage/__init__.py

@@ -11,7 +11,7 @@
 from . import cartesian
 from .cartesian import layout
 from .cartesian.interface import empty, from_array, full, ones, zeros
-from .cartesian.layout import from_name, register


layout registration is now private. Not sure if this is the idea. Happy to revert and register layouts in the base backend instead.

romanc · 2025-02-14T19:28:18Z

src/gt4py/storage/cartesian/layout.py

    "layout_map": lambda axes: tuple(i for i in range(len(axes))),
    "is_optimal_layout": lambda *_: True,
 }
-register("naive_cpu", NaiveCPULayout)
+_register("numpy", NaiveCPULayout)


Renamed to match backend names. This seems fragile... on the other hand, like this we are able to test layouts independently in tests/storage_tests/unit_tests/test_layout.py.

Isn't layout yelling to be part of the Backend class, with the name and everything else? Probably not in the scope of this PR

I haven't looked at this part of the cartesian codebase in years but I think the root of the problem is that there is no actual separation in the code between storage layouts and the backend, although they are orthogonal. I mean, the layout definition inside a backend encodes the optimal layout for that particular backend, but most of the backends will work anyway with other layouts, and several backends can have the same optimal layout, so they are clearly different concepts.

The registry in this file was probably meant to encode the mapping between layout names and its specification and it would then make sense that the backend object contained the layout name, which should be looked up in this registry. But since the backend object contains the whole LayoutInfo instance, this registry doesn't make any sense and it should be then deleted, since the backend registry is enough to recover the same information:

backend.REGISTRY[backend_name].storage_info == layout.REGISTRY[backend_name]

So I would either add a new indirection layer by storing only the layout name in the backend or just remove the layout registry and only use the backend registry for everything (at this point I think it only makes sense to go with the second option to make the codebase simpler and delete redundant code).

romanc · 2025-02-17T09:58:22Z

Digging a bit more, I found that the only reason (left) for storage to depend on cartesian is GT4PY_USE_HIP (see PR #1278) which was added in gt4py/cartesian/config.py instead of next to the aloocators in the storage package. If we are doing this, it might be worth to get rid of that dependency cycle too.

To me, it feels like this AMD/NVIDIA device handling should all be centralized and re-used in next and cartesian. Like, the CUPY_DEVICE should live in _core and a potential GT4PY_USE_HIP override too.

Question for @stubbiali: Do you guys actually use the GT4PY_USE_HIP env variable (when running on LUMI)?

stubbiali · 2025-02-17T10:51:49Z

IIRC, we introduced GT4PY_USE_HIP following cupy's CUPY_INSTALL_USE_HIP (see here), but the automatic detection using cupy.cuda.get_hipcc_path() has been working well, so there's actually no reason to set GT4PY_USE_HIP via an env variable.

FlorianDeconinck

Looking good. Test failing, some musing about where layout metadata should live

FlorianDeconinck · 2025-02-18T14:36:35Z

tests/cartesian_tests/definitions.py



-def _get_backends_with_storage_info(storage_info_kind: str):
+def _filter_backends(filter: Callable[[Backend], bool]) -> List[str]:


Let's make sure the test generated are the same as before. pytest --collect-only will collect all combinations of tests and not run them

FlorianDeconinck · 2025-02-18T14:38:14Z

src/gt4py/storage/cartesian/layout.py

    "layout_map": lambda axes: tuple(i for i in range(len(axes))),
    "is_optimal_layout": lambda *_: True,
 }
-register("naive_cpu", NaiveCPULayout)
+_register("numpy", NaiveCPULayout)


Isn't layout yelling to be part of the Backend class, with the name and everything else? Probably not in the scope of this PR

FlorianDeconinck · 2025-02-18T14:41:53Z

src/gt4py/cartesian/testing/suites.py

@@ -206,7 +207,7 @@ def hyp_wrapper(test_hyp, hypothesis_data):
                )

                marks = test["marks"].copy()
-                if gt4pyc.backend.from_name(test["backend"]).storage_info["device"] == "gpu":
+                if test["backend"] in GPU_BACKEND_NAMES:


Agreed. I think the earlier design was to split layout from backend by addressing them by their underlying allocator (numpy/cupy) to the backend. Though it's correct architecturally, I would argue it's a bit of an over engineering and repeating data is not an issue. Which leads me to the fact that Backend should be the source of truth, including for requires_gpu

FlorianDeconinck · 2025-02-18T14:43:26Z

src/gt4py/cartesian/testing/suites.py

@@ -206,7 +207,7 @@ def hyp_wrapper(test_hyp, hypothesis_data):
                )

                marks = test["marks"].copy()
-                if gt4pyc.backend.from_name(test["backend"]).storage_info["device"] == "gpu":
+                if test["backend"] in GPU_BACKEND_NAMES:


And we could extend it even further by re-routing allocation through, etc. I am no fan of massive class that does it all, but in this case Backend seems underused a single source of truth

FlorianDeconinck · 2025-02-18T14:43:49Z

src/gt4py/cartesian/testing/suites.py

@@ -206,7 +207,7 @@ def hyp_wrapper(test_hyp, hypothesis_data):
                )

                marks = test["marks"].copy()
-                if gt4pyc.backend.from_name(test["backend"]).storage_info["device"] == "gpu":
+                if test["backend"] in GPU_BACKEND_NAMES:


Probably out of scope for this PR

egparedes

Just a question and my two cents about the design of the backend-layout relationship.

PS: I'm sorry for the late reply, I took a long vacation earlier this month.

egparedes · 2025-02-26T16:41:49Z

src/gt4py/cartesian/backend/base.py

@@ -84,7 +82,7 @@ class Backend(abc.ABC):
    #: Backend-specific storage parametrization:
    #:
    #:  - "alignment": in bytes
-    #:  - "device": "cpu" | "gpu"
+    #:  - "device": core_defs.DeviceType | None


Just a question: why is None allowed here? It shouldn't be allowed if the old documentation is correct, right?

egparedes · 2025-02-26T17:15:01Z

src/gt4py/storage/cartesian/layout.py

    "layout_map": lambda axes: tuple(i for i in range(len(axes))),
    "is_optimal_layout": lambda *_: True,
 }
-register("naive_cpu", NaiveCPULayout)
+_register("numpy", NaiveCPULayout)


I haven't looked at this part of the cartesian codebase in years but I think the root of the problem is that there is no actual separation in the code between storage layouts and the backend, although they are orthogonal. I mean, the layout definition inside a backend encodes the optimal layout for that particular backend, but most of the backends will work anyway with other layouts, and several backends can have the same optimal layout, so they are clearly different concepts.

The registry in this file was probably meant to encode the mapping between layout names and its specification and it would then make sense that the backend object contained the layout name, which should be looked up in this registry. But since the backend object contains the whole LayoutInfo instance, this registry doesn't make any sense and it should be then deleted, since the backend registry is enough to recover the same information:

backend.REGISTRY[backend_name].storage_info == layout.REGISTRY[backend_name]

So I would either add a new indirection layer by storing only the layout name in the backend or just remove the layout registry and only use the backend registry for everything (at this point I think it only makes sense to go with the second option to make the codebase simpler and delete redundant code).

egparedes · 2025-02-26T17:16:35Z

src/gt4py/storage/cartesian/layout.py

-register("gpu", GPULayout)
+_register("gt:gpu", GPULayout)
+
+DaceCPULayout: Final[LayoutInfo] = {


The dace layout is useless if the dace backend is not installed...

romanc added 2 commits February 14, 2025 17:54

refactor[cartesian]: use DeviceType from core_defn

cf9c442

Fixup: cleanups from self-review of code

4ecdfa1

romanc commented Feb 14, 2025

View reviewed changes

romanc changed the title ~~refactor[cartesian]: use DeviceType from core_defn~~ refactor[cartesian]: use DeviceType from core_defn in LayoutInfo Feb 14, 2025

romanc requested review from havogt, FlorianDeconinck and egparedes and removed request for havogt and FlorianDeconinck February 14, 2025 19:43

romanc mentioned this pull request Feb 14, 2025

refactor[cartesian]: unexpanded sdfg cleanups #1843

Closed

2 tasks

FlorianDeconinck requested changes Feb 18, 2025

View reviewed changes

romanc mentioned this pull request Feb 20, 2025

refactor: storage, layout, and devices (mostly cartesian) #1880

Open

5 tasks

egparedes reviewed Feb 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor[cartesian]: use `DeviceType` from core_defn in `LayoutInfo` #1867

refactor[cartesian]: use `DeviceType` from core_defn in `LayoutInfo` #1867

romanc commented Feb 14, 2025 •

edited

Loading

romanc left a comment

romanc Feb 14, 2025

romanc Feb 14, 2025

romanc Feb 14, 2025

romanc Feb 14, 2025

FlorianDeconinck Feb 18, 2025

FlorianDeconinck Feb 18, 2025

FlorianDeconinck Feb 18, 2025

romanc Feb 14, 2025

romanc Feb 14, 2025

romanc Feb 14, 2025

romanc Feb 14, 2025

romanc Feb 14, 2025

romanc Feb 14, 2025

FlorianDeconinck Feb 18, 2025

egparedes Feb 26, 2025

romanc commented Feb 17, 2025

stubbiali commented Feb 17, 2025

FlorianDeconinck left a comment

FlorianDeconinck Feb 18, 2025

FlorianDeconinck Feb 18, 2025

FlorianDeconinck Feb 18, 2025

FlorianDeconinck Feb 18, 2025

FlorianDeconinck Feb 18, 2025

egparedes left a comment •

edited

Loading

egparedes Feb 26, 2025

egparedes Feb 26, 2025

egparedes Feb 26, 2025

		@@ -45,9 +45,9 @@
		REGISTRY = gt_utils.Registry()


		def from_name(name: str) -> Optional[Type[Backend]]:



		def _get_backends_with_storage_info(storage_info_kind: str):
		def _filter_backends(filter: Callable[[Backend], bool]) -> List[str]:

refactor[cartesian]: use DeviceType from core_defn in LayoutInfo #1867

Are you sure you want to change the base?

refactor[cartesian]: use DeviceType from core_defn in LayoutInfo #1867

Conversation

romanc commented Feb 14, 2025 • edited Loading

Description

Requirements

romanc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romanc commented Feb 17, 2025

stubbiali commented Feb 17, 2025

FlorianDeconinck left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

egparedes left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

refactor[cartesian]: use `DeviceType` from core_defn in `LayoutInfo` #1867

refactor[cartesian]: use `DeviceType` from core_defn in `LayoutInfo` #1867

romanc commented Feb 14, 2025 •

edited

Loading

egparedes left a comment •

edited

Loading