-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EAMxx: Kokkos 4.5 update, fixes and workarounds for Kokkos update, Aurora support #6916
base: master
Are you sure you want to change the base?
Conversation
this commit is running on sunspot
|
i should have not committed to here turning off compose and mam, will remove once this PR is finalized. |
Looks good. A couple of points:
|
if (Kokkos_ENABLE_SYCL) | ||
#enable_language(SYCL) | ||
set (EAMXX_ENABLE_GPU TRUE CACHE BOOL "" FORCE) | ||
set (SYCL_BUILD TRUE CACHE BOOL "" FORCE) #needed for yakl if kokkos vars are not visible there? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could turn off COMPOSE here, I believe. (I'll fix the COMPOSE issues once I'm on the machine.)
@@ -23,7 +23,8 @@ set(BUILD_HOMME_PREQX_KOKKOS OFF CACHE BOOL "") | |||
set(BUILD_HOMME_PESE OFF CACHE BOOL "") | |||
set(BUILD_HOMME_SWIM OFF CACHE BOOL "") | |||
set(BUILD_HOMME_PRIM OFF CACHE BOOL "") | |||
set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "") | |||
#set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "") | |||
set(HOMME_ENABLE_COMPOSE OFF CACHE BOOL "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting here this is what needs to be removed from this PR.
|
||
#<<<<<<< HEAD | ||
#find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib) | ||
#target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change? The second line I'm guessing is due to Kokkos, but what about the NETCDF_C line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i need to ask Jim about this -- iirc i had to set io libs in general, but also some cmake variables for io for rrtmgp separately. so i left this file messy, because i still need to sort it out.
i id not understant EKAT comment -- i have an ekat branch, oksanaguba/spot, which i maintained only to point to a different kokkos. so i periodically merge master into it, but not too frequently, so it may be behind the ekat used in master, but its kokkos points not to e3sm kokkos, but to kokkos develop branch. this is another issue to sort out (but i would not know how). |
Re: EKAT, it appears there are missing commits on your branch w.r.t. the submodule point in E3SM: E3SM-Project/EKAT@oksanaguba/spot...4231383 @bartgol am I seeing that correctly? Also, is current EKAT master ready to be used in E3SM? If so, Oksana, I recommend rebasing your ekat branch on master, then pointing the ekat submodule to that cleaned-up branch. More generally, testing across machines will help to flush out issues, if there are any, including any with ekat. |
there are some December commits in the diff you posted, yes, at least those are expected. my ekat branch was not updated too frequently. will do what you suggested, thanks. |
As an example, here is one commit I'm either confused about or is truly missing from your branch: E3SM-Project/EKAT@5804bfa. This commit was merged in late Nov 2024, which is why I'm thinking there's a fair bit missing. Edit: I think EKAT PR 347 was the last merged into your branch. PRs 349-356 are missing. (357-359 are in EKAT master but not used yet in E3SM, and 348 was closed without merging.) Thus, what I wrote above is probably the correct solution: Rebase your branch on EKAT master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I have a couple of questions/remarks, but nothing too important. I'm going to approve, modulo the fix of EKAT submodule that you are already working on.
string(APPEND CMAKE_CXX_FLAGS_RELEASE " -O2") | ||
string(APPEND CMAKE_Fortran_FLAGS_DEBUG " -O0 -g -fpe0") | ||
|
||
#adding -g here leads to linker internal errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yet I see -g
below. Is this comment outdated, and link is fine now? If so, we should remove it to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"just -g" is in debug flags only, and i haven't tried to build debug. so i only changed release flags.
i was able to compile and to run with "just -g" on sunspot, however, the folder size for e3sm build gets from say 4 gb to 140 gb with full -g. last i tried "just -g" on aurora, there were linking issues. so this is all WIP. i'd say keep this as is now and sort it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that the comment is confusing: it makes it sound like we should not use -g
b/c of link errors, but we do use it across the board. Maybe the comment needs to be clarified. Something like "WARNING: the -g flag seems to have a wild behavior sometimes, with side effects ranging from VERY large binaries to link errors. If you experience errors, remove it".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done
@@ -426,6 +426,10 @@ do_remap_fwd() | |||
const int team_size = std::min(256, std::min(128*m_num_phys_cols,32*(concurrency/this->m_num_fields+31)/32)); | |||
#endif | |||
|
|||
#ifdef KOKKOS_ENABLE_SYCL | |||
const int team_size = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am puzzled by this. Do we really want just 4 on sycl? It's staggeringly different from other GPU platforms' default...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change is from Nov 2023. i looked at my notes and did not see why it was needed. i will try to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the P-D remapper (not the GllFvPhys one), so its relevance is somewhat limited (doesn't run at runtime, usually). So not a huge deal. But if things work without, no point in limiting team size to 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used hip TS calculation and it worked at least for ne4 run.
#<<<<<<< HEAD | ||
#find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib) | ||
#target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share) | ||
#======= | ||
find_library(NETCDF_C netcdf HINTS ${NetCDF_C_PATH}/lib) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While at it, do you mind adding ${NetCDF_C_PATH}/lib64
to the hints? Since lib64
is quite common, we should be proactive and add it to the list of hints (e..g, my laptop and workstation use lib64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done
Looks like you need to merge ekat master into your branch one more time to get the macro name fix. |
ran this commit with kokkos 4.5.01 tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
Note that SCREAM nightlies run on master, not next, so I always run e3sm_scream_v1_medres on Frontier, Chrysalis, and pm-gpu before merging a complex PR to avoid having to fix master later. I recommend it to others, as well.
I added Conrad to the reviewers since he's working on upgrading Kokkos for multiple reasons, including Aurora readiness. |
CI is failing with a checkout on the ekat submodule.
Looking at the branch, https://github.com/E3SM-Project/EKAT/commits/oksanaguba/spot/, it's unclear where 1eae1a is coming from. @oksanaguba I suggest checking the submodule state in this PR w.r.t. to your ekat branch. |
@ambrad my ekat branch had to use kokkos main repo. if CI somehow does not know to add a remote (the default remote in ekat is poinitng to e3sm kokkos (whatever it is)), then it would not be able to fetch my commit. |
I'm sure I'm confused, but on your ekat branch I see this as the Kokkos submodule: https://github.com/E3SM-Project/kokkos/tree/e6e5c4598d16d756db62225ab7f937ee833bd660. This appears to be in E3SM-Project/kokkos. |
Re: commit 1eae1a, I don't see where it's coming from. I don't see it in your EKAT spot branch, but I could easily be mistaken. |
Re: Kokkos: Ok, I'm guessing what's happening is you have your local submodule pointed to the Kokkos repo with this commit, but, as you wrote, the submodule metadata points to E3SM-Project/Kokkos. Generally, we have to put some extra commits on top of a Kokkos version, which is why E3SM-Project/Kokkos exists. Thus, I suggest you start by creating a branch in E3SM-Project/Kokkos with the desired Kokkos state and then use this in the submodule. That will let testing proceed. |
i'll ask Conrad as he prob already has a branch |
i pointed the branch to an e3sm kokkos branch, but haven't had a chance to see if CI is happier. also, will do more testing tomorrow. |
Previously, there were two. Now, with two additional template parameters, there are eight. This is to try to work around an Nvidia internal compiler error on one platform.
d4e4894
to
4af0429
Compare
@oksanaguba In order to get the CI running and verify things work, I rebased on master (the conflict prevented CI from running). The only real conflict was the rrtmgp one (the workflow file I knew how to handle): in there, there were two ways to specify the netcdf C path. I stuck with the version from master. If that's not what you think is needed, I also pushed a branch |
4af0429
to
0c944e5
Compare
0c944e5
to
1e0d4a3
Compare
Note: CI tests are failing since we did not regen baselines once Jim's PR got merged. I triggered the generate job now, but it will be a while until it gets through the queue. We can simply re-run the failed job later to check if this PR is bfb or not. |
Ok, some good news and bad news. The CUDA dbg CI job is finally getting past the bld phase, and running tests. The bad news is that, despite the rebase on master, we get some errors. E.g., for
I will have to investigate hopping on blake on monday. The interesting thing is that the error shows up during finalization (as you can see from the fact that the unit test completed 100% of the iterations). The release build only has the baseline_cmp fails, which should go away if we retest after the baseline-gen job runs. |
I think I know the cause. It's some new nullify routines I created to handle Kokkos's new (>= 4.3, I think) way of dealing with view-of-view initi/finalization. I'll push a fix. |
No description provided.