Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass KernelOps into SharedMemoryCache constructor #1531

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

jcosborn
Copy link
Contributor

@jcosborn jcosborn commented Jan 4, 2025

This will allow the SYCL target to allocate shared memory at kernel launch and pass the pointer in through the KernelOps structure.

@jcosborn jcosborn requested a review from a team as a code owner January 4, 2025 23:00
Copy link
Member

@maddyscientist maddyscientist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments on the PR, but the main issue I have is with
1.) The lack of documentation regarding using KernelOps. E.g., how does someone who is writing a new kernel that uses shared memory know what to do. We need a extensive wiki page describing this.
2.) How KernelOps will be tested in CI without any SYCL target. How can we ensure that KernelOps are developed correctly?

We can discuss in our call.

include/gauge_fix_ovr_hit_devf.cuh Outdated Show resolved Hide resolved
@@ -303,7 +304,7 @@ namespace quda {
#pragma unroll
for (int s = 0; s < uvSpin; s++) {
if constexpr (Arg::compute_max) {
uv_max = fmax(UV[s].abs_max(), uv_max);
uv_max = max(UV[s].abs_max(), uv_max);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file uses a mixed of quda::max and max. Can you make it consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quda::max was needed in some places to avoid conflict with local variables named max.
I've renamed the local variables and replaced quda::max with just max.
768f37b

@@ -205,16 +221,19 @@ namespace quda {
int idx1 = (((x[3] * X[2] + x[2]) * X[1] + x[1]) * X[0] + x[0]) >> 1;
Link link1 = arg.u(mu, idx1, 1 - parity);

switch (Arg::type) {
if constexpr (Arg::type == 3) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change here from switch to if ? The former is more logical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be constexpr now since the different routines being called in each case have different variants of the SharedMemoryCache. This corresponding KernelOps are selected here https://github.com/lattice/quda/pull/1531/files#diff-aa538a7e33ac26c68858653fe4c25222aae7dc96f95459c62bf86ea5acd254b4R135 at compile time. Due to the compile-time consistency checking of the SharedMemoryCache instantiation, we can't compile the branches that don't have the correct KernelOps even if they are never called.

SharedMemoryCache<Float> cache;
auto elems = cache.data();
SharedMemoryCache<array<Float, 4>> cache(ftor);
Float *elems = &(*cache.data())[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not liking this change. Why is this pathology necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there was no need to use an inner array. I've made it similar to the others now.
99632a5

include/kernels/block_transpose.cuh Show resolved Hide resolved
@jcosborn
Copy link
Contributor Author

I've left some comments on the PR, but the main issue I have is with 1.) The lack of documentation regarding using KernelOps. E.g., how does someone who is writing a new kernel that uses shared memory know what to do. We need a extensive wiki page describing this. 2.) How KernelOps will be tested in CI without any SYCL target. How can we ensure that KernelOps are developed correctly?

We can discuss in our call.

I've added some documentation here
https://github.com/lattice/quda/wiki/KernelOps-usage
It will need to be updated as more pieces are merged in from the SYCL branch.
Let me know if anything isn't clear so far.

As we discussed on the call, the KernelOps handling should be fairly robust (once fully merged), so that instantiating tracked operations will require passing in a kernel functor that inherited from the appropriate KernelOps type. This type checking happens at compile time.

@maddyscientist
Copy link
Member

maddyscientist commented Jan 16, 2025

I've added some documentation here https://github.com/lattice/quda/wiki/KernelOps-usage
It will need to be updated as more pieces are merged in from the SYCL branch. Let me know if anything isn't clear so far.

Documentation is exactly what I wanted, thanks @jcosborn.

@weinbe2
Copy link
Contributor

weinbe2 commented Jan 30, 2025

This design looks good to me given the requirements, thanks @jcosborn. The wiki page is greatly appreciated.

I don't "like" the decltype(*this) that needed to be used in one of the coarse-related file, but since it's a one-off in a single file I'll let it go. I understand why template deduction couldn't be used without a bunch of unnecessary function argument rearranging.

My only request is for the wiki page; I think the links to files + line-numbers should be modified to be permalinks to a specific commit as opposed to just develop. This isn't a blocking requirement for this PR; if anything it can be updated with the commit id of the merge.

@jcosborn
Copy link
Contributor Author

Thanks for the review. Making the wiki use permalinks is a good idea. Once this is merged I'll update it with permalinks to develop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants