-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor[cartesian]: use DeviceType
from core_defn in LayoutInfo
#1867
base: main
Are you sure you want to change the base?
refactor[cartesian]: use DeviceType
from core_defn in LayoutInfo
#1867
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned out to be a bit bigger than expected. Main pain point was the test generation (which needs a static way of knowing which backend/layout needs a GPU).
I left a couple of questions & remarks inline. Let's have a discussion ...
@@ -45,9 +45,9 @@ | |||
REGISTRY = gt_utils.Registry() | |||
|
|||
|
|||
def from_name(name: str) -> Optional[Type[Backend]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated: The return type is not optional. It raises an exception in case the backend is not found.
def _decorator(cls: Type[Backend]) -> Type[Backend]: | ||
def _no_generate(obj) -> Type[StencilObject]: | ||
raise NotImplementedError( | ||
f"Disabled '{cls.name}' backend: 'f{message}'\n", | ||
f"You can still enable the backend by hand using the environment variable '{enabled_env_var}=1'", | ||
) | ||
|
||
# Replace generate method with raise | ||
if not hasattr(cls, "generate"): | ||
raise ValueError(f"Coding error. Expected a generate method on {cls}") | ||
# Flag that it got disabled for register lookup | ||
cls.disabled = True # type: ignore | ||
cls.generate = _no_generate # type: ignore | ||
return cls | ||
|
||
return _decorator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated: no else
needed after return
inside if
. Reduces the amount of indentation needed.
storage_info = { | ||
"alignment": 1, | ||
"device": "cpu", | ||
"layout_map": gt_storage.layout.layout_maker_factory((0, 1, 2)), | ||
"is_optimal_layout": gt_storage.layout.layout_checker_factory( | ||
gt_storage.layout.layout_maker_factory((0, 1, 2)) | ||
), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to gt_storage.layout
like all the other layouts
@@ -206,7 +207,7 @@ def hyp_wrapper(test_hyp, hypothesis_data): | |||
) | |||
|
|||
marks = test["marks"].copy() | |||
if gt4pyc.backend.from_name(test["backend"]).storage_info["device"] == "gpu": | |||
if test["backend"] in GPU_BACKEND_NAMES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We come by here during test setup. This happens for CPU/GPU tests alike. In case cupy
isn't automatically detected (or in case there's no GPU (e.g. on GHA runners), CUPY_DEVICE
will not be defined and default to None
and this won't correctly trigger the requires_gpu
marks.
For the test system, we thus need something that is independent of cupy availability to denote that a certain tests require GPU while other don't. This could also be done differently, e.g. a requires_gpu
property on the base Backend
class. Honestly, the longer I think about it, the more it feels to me that we should be able to tell from the backend. If there were multiple layouts per backend (as far as I understand this is possible), I'd expect all of them to either run on CPU or GPU (or possibly another accelerator in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I think the earlier design was to split layout from backend by addressing them by their underlying allocator (numpy/cupy) to the backend. Though it's correct architecturally, I would argue it's a bit of an over engineering and repeating data is not an issue. Which leads me to the fact that Backend should be the source of truth, including for requires_gpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we could extend it even further by re-routing allocation through, etc. I am no fan of massive class that does it all, but in this case Backend
seems underused a single source of truth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably out of scope for this PR
CUPY_DEVICE: Final[Literal[DeviceType.CUDA, DeviceType.ROCM] | None] = ( | ||
None if not cp else (DeviceType.ROCM if cp.cuda.get_hipcc_path() else DeviceType.CUDA) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could share this between next and cartesian. If so, should it go into _core
?
@@ -58,15 +58,13 @@ def register(backend_cls: Type[Backend]) -> Type[Backend]: | |||
assert issubclass(backend_cls, Backend) and backend_cls.name is not None | |||
|
|||
if isinstance(backend_cls.name, str): | |||
gt_storage.register(backend_cls.name, backend_cls.storage_info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Layouts were registered twice. Once here and once in the storage/cartesian/layout.py
. Most layouts were registered under the same name (i.e. the backend name), however some were duplicated. I choose to remove the layout registration here and changed all the names in storage/cartesian/layout.py
to match backend names. This keeps layout tests separate.
else: | ||
raise ValueError( | ||
"Invalid 'name' attribute ('{name}') in backend class '{cls}'".format( | ||
name=backend_cls.name, cls=backend_cls | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated: no else
needed after return
in if
@@ -51,22 +52,21 @@ def _compute_domain_origin_cache_key( | |||
|
|||
@dataclass | |||
class ArgsInfo: | |||
device: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never used, so I removed it
@@ -11,7 +11,7 @@ | |||
from . import cartesian | |||
from .cartesian import layout | |||
from .cartesian.interface import empty, from_array, full, ones, zeros | |||
from .cartesian.layout import from_name, register |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layout registration is now private. Not sure if this is the idea. Happy to revert and register layouts in the base backend instead.
"layout_map": lambda axes: tuple(i for i in range(len(axes))), | ||
"is_optimal_layout": lambda *_: True, | ||
} | ||
register("naive_cpu", NaiveCPULayout) | ||
_register("numpy", NaiveCPULayout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to match backend names. This seems fragile... on the other hand, like this we are able to test layouts independently in tests/storage_tests/unit_tests/test_layout.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't layout yelling to be part of the Backend
class, with the name and everything else? Probably not in the scope of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at this part of the cartesian codebase in years but I think the root of the problem is that there is no actual separation in the code between storage layouts and the backend, although they are orthogonal. I mean, the layout definition inside a backend encodes the optimal layout for that particular backend, but most of the backends will work anyway with other layouts, and several backends can have the same optimal layout, so they are clearly different concepts.
The registry in this file was probably meant to encode the mapping between layout names and its specification and it would then make sense that the backend object contained the layout name, which should be looked up in this registry. But since the backend object contains the whole LayoutInfo
instance, this registry doesn't make any sense and it should be then deleted, since the backend registry is enough to recover the same information:
backend.REGISTRY[backend_name].storage_info == layout.REGISTRY[backend_name]
So I would either add a new indirection layer by storing only the layout name in the backend or just remove the layout registry and only use the backend registry for everything (at this point I think it only makes sense to go with the second option to make the codebase simpler and delete redundant code).
DeviceType
from core_defn in LayoutInfo
Digging a bit more, I found that the only reason (left) for To me, it feels like this AMD/NVIDIA device handling should all be centralized and re-used in next and cartesian. Like, the Question for @stubbiali: Do you guys actually use the |
IIRC, we introduced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Test failing, some musing about where layout metadata should live
|
||
|
||
def _get_backends_with_storage_info(storage_info_kind: str): | ||
def _filter_backends(filter: Callable[[Backend], bool]) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure the test generated are the same as before. pytest --collect-only
will collect all combinations of tests and not run them
"layout_map": lambda axes: tuple(i for i in range(len(axes))), | ||
"is_optimal_layout": lambda *_: True, | ||
} | ||
register("naive_cpu", NaiveCPULayout) | ||
_register("numpy", NaiveCPULayout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't layout yelling to be part of the Backend
class, with the name and everything else? Probably not in the scope of this PR
@@ -206,7 +207,7 @@ def hyp_wrapper(test_hyp, hypothesis_data): | |||
) | |||
|
|||
marks = test["marks"].copy() | |||
if gt4pyc.backend.from_name(test["backend"]).storage_info["device"] == "gpu": | |||
if test["backend"] in GPU_BACKEND_NAMES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I think the earlier design was to split layout from backend by addressing them by their underlying allocator (numpy/cupy) to the backend. Though it's correct architecturally, I would argue it's a bit of an over engineering and repeating data is not an issue. Which leads me to the fact that Backend should be the source of truth, including for requires_gpu
@@ -206,7 +207,7 @@ def hyp_wrapper(test_hyp, hypothesis_data): | |||
) | |||
|
|||
marks = test["marks"].copy() | |||
if gt4pyc.backend.from_name(test["backend"]).storage_info["device"] == "gpu": | |||
if test["backend"] in GPU_BACKEND_NAMES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we could extend it even further by re-routing allocation through, etc. I am no fan of massive class that does it all, but in this case Backend
seems underused a single source of truth
@@ -206,7 +207,7 @@ def hyp_wrapper(test_hyp, hypothesis_data): | |||
) | |||
|
|||
marks = test["marks"].copy() | |||
if gt4pyc.backend.from_name(test["backend"]).storage_info["device"] == "gpu": | |||
if test["backend"] in GPU_BACKEND_NAMES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably out of scope for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question and my two cents about the design of the backend-layout relationship.
PS: I'm sorry for the late reply, I took a long vacation earlier this month.
@@ -84,7 +82,7 @@ class Backend(abc.ABC): | |||
#: Backend-specific storage parametrization: | |||
#: | |||
#: - "alignment": in bytes | |||
#: - "device": "cpu" | "gpu" | |||
#: - "device": core_defs.DeviceType | None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question: why is None
allowed here? It shouldn't be allowed if the old documentation is correct, right?
"layout_map": lambda axes: tuple(i for i in range(len(axes))), | ||
"is_optimal_layout": lambda *_: True, | ||
} | ||
register("naive_cpu", NaiveCPULayout) | ||
_register("numpy", NaiveCPULayout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at this part of the cartesian codebase in years but I think the root of the problem is that there is no actual separation in the code between storage layouts and the backend, although they are orthogonal. I mean, the layout definition inside a backend encodes the optimal layout for that particular backend, but most of the backends will work anyway with other layouts, and several backends can have the same optimal layout, so they are clearly different concepts.
The registry in this file was probably meant to encode the mapping between layout names and its specification and it would then make sense that the backend object contained the layout name, which should be looked up in this registry. But since the backend object contains the whole LayoutInfo
instance, this registry doesn't make any sense and it should be then deleted, since the backend registry is enough to recover the same information:
backend.REGISTRY[backend_name].storage_info == layout.REGISTRY[backend_name]
So I would either add a new indirection layer by storing only the layout name in the backend or just remove the layout registry and only use the backend registry for everything (at this point I think it only makes sense to go with the second option to make the codebase simpler and delete redundant code).
register("gpu", GPULayout) | ||
_register("gt:gpu", GPULayout) | ||
|
||
DaceCPULayout: Final[LayoutInfo] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dace layout is useless if the dace backend is not installed...
Description
Follow-up from PR #1843, which isolates the refactor to re-use
core_defn.DeviceType
instead of"cpu" | "gpu"
literals in cartesian's storage_layout.I leave this as draft since I have some questions whether or not we are on the right track. I'd like to open the floor for a first round of feedback. I'd be especially happy to get feedback from @havogt and @egparedes, who have been with this codebase for a long time.
Once we agree on the bigger questions, I'll go and run pyFV3 tests with that branch just to be sure that we didn't mess up anything that would only show later (In which case, we should probably add some test coverage in this repo).
So yeah, this is not urgent in any way and might drag on on the side for a while ... @FlorianDeconinck we'll see how this aligns with other priorities if this get bigger and/or takes longer. I'd also be fine to have all the discussions / gather all the info and then re-visit this later or write a proper refactor issue for summer. I don't expect any perf improvements from this refactor.
Requirements