Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EAMxx: Kokkos 4.5 update, fixes and workarounds for Kokkos update, Aurora support #6916

Draft
wants to merge 43 commits into
base: master
Choose a base branch
from

Conversation

oksanaguba
Copy link
Contributor

No description provided.

@oksanaguba oksanaguba mentioned this pull request Jan 19, 2025
@oksanaguba
Copy link
Contributor Author

this commit is running on sunspot

~/ess/v1-jan19/run/atm.log.11365537.amn-0001.250119-204314 

@oksanaguba
Copy link
Contributor Author

oksanaguba commented Jan 19, 2025

i should have not committed to here turning off compose and mam, will remove once this PR is finalized.

@ambrad ambrad requested a review from bartgol January 19, 2025 21:29
@ambrad
Copy link
Member

ambrad commented Jan 19, 2025

Looks good. A couple of points:

  1. In addition to any usual pre-merge testing, I strongly recommend running e3sm_scream_v1_medres with this PR on Frontier, pm-gpu, and Chrysalis, as well as homme_integration on Chrysalis.
  2. I'm thinking the EKAT update isn't right. No, it's fine. I looked more closely. I'm seeing some commits from Luca that diff, but I suspect that's intended. Edit 2: Ok, maybe my initial analysis was right: there seems to be some unintended diffs in ekat.

if (Kokkos_ENABLE_SYCL)
#enable_language(SYCL)
set (EAMXX_ENABLE_GPU TRUE CACHE BOOL "" FORCE)
set (SYCL_BUILD TRUE CACHE BOOL "" FORCE) #needed for yakl if kokkos vars are not visible there?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could turn off COMPOSE here, I believe. (I'll fix the COMPOSE issues once I'm on the machine.)

@@ -23,7 +23,8 @@ set(BUILD_HOMME_PREQX_KOKKOS OFF CACHE BOOL "")
set(BUILD_HOMME_PESE OFF CACHE BOOL "")
set(BUILD_HOMME_SWIM OFF CACHE BOOL "")
set(BUILD_HOMME_PRIM OFF CACHE BOOL "")
set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "")
#set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "")
set(HOMME_ENABLE_COMPOSE OFF CACHE BOOL "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting here this is what needs to be removed from this PR.


#<<<<<<< HEAD
#find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib)
#target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change? The second line I'm guessing is due to Kokkos, but what about the NETCDF_C line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need to ask Jim about this -- iirc i had to set io libs in general, but also some cmake variables for io for rrtmgp separately. so i left this file messy, because i still need to sort it out.

@oksanaguba
Copy link
Contributor Author

i id not understant EKAT comment -- i have an ekat branch, oksanaguba/spot, which i maintained only to point to a different kokkos. so i periodically merge master into it, but not too frequently, so it may be behind the ekat used in master, but its kokkos points not to e3sm kokkos, but to kokkos develop branch.

this is another issue to sort out (but i would not know how).

@ambrad
Copy link
Member

ambrad commented Jan 19, 2025

i id not understant EKAT comment

Re: EKAT, it appears there are missing commits on your branch w.r.t. the submodule point in E3SM: E3SM-Project/EKAT@oksanaguba/spot...4231383

@bartgol am I seeing that correctly? Also, is current EKAT master ready to be used in E3SM?

If so, Oksana, I recommend rebasing your ekat branch on master, then pointing the ekat submodule to that cleaned-up branch.

More generally, testing across machines will help to flush out issues, if there are any, including any with ekat.

@oksanaguba
Copy link
Contributor Author

there are some December commits in the diff you posted, yes, at least those are expected. my ekat branch was not updated too frequently. will do what you suggested, thanks.

@ambrad
Copy link
Member

ambrad commented Jan 19, 2025

As an example, here is one commit I'm either confused about or is truly missing from your branch: E3SM-Project/EKAT@5804bfa. This commit was merged in late Nov 2024, which is why I'm thinking there's a fair bit missing.

Edit: I think EKAT PR 347 was the last merged into your branch. PRs 349-356 are missing. (357-359 are in EKAT master but not used yet in E3SM, and 348 was closed without merging.)

Thus, what I wrote above is probably the correct solution: Rebase your branch on EKAT master.

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I have a couple of questions/remarks, but nothing too important. I'm going to approve, modulo the fix of EKAT submodule that you are already working on.

string(APPEND CMAKE_CXX_FLAGS_RELEASE " -O2")
string(APPEND CMAKE_Fortran_FLAGS_DEBUG " -O0 -g -fpe0")

#adding -g here leads to linker internal errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yet I see -g below. Is this comment outdated, and link is fine now? If so, we should remove it to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"just -g" is in debug flags only, and i haven't tried to build debug. so i only changed release flags.

i was able to compile and to run with "just -g" on sunspot, however, the folder size for e3sm build gets from say 4 gb to 140 gb with full -g. last i tried "just -g" on aurora, there were linking issues. so this is all WIP. i'd say keep this as is now and sort it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that the comment is confusing: it makes it sound like we should not use -g b/c of link errors, but we do use it across the board. Maybe the comment needs to be clarified. Something like "WARNING: the -g flag seems to have a wild behavior sometimes, with side effects ranging from VERY large binaries to link errors. If you experience errors, remove it".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done

@@ -426,6 +426,10 @@ do_remap_fwd()
const int team_size = std::min(256, std::min(128*m_num_phys_cols,32*(concurrency/this->m_num_fields+31)/32));
#endif

#ifdef KOKKOS_ENABLE_SYCL
const int team_size = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am puzzled by this. Do we really want just 4 on sycl? It's staggeringly different from other GPU platforms' default...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change is from Nov 2023. i looked at my notes and did not see why it was needed. i will try to remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the P-D remapper (not the GllFvPhys one), so its relevance is somewhat limited (doesn't run at runtime, usually). So not a huge deal. But if things work without, no point in limiting team size to 4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used hip TS calculation and it worked at least for ne4 run.

#<<<<<<< HEAD
#find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib)
#target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share)
#=======
find_library(NETCDF_C netcdf HINTS ${NetCDF_C_PATH}/lib)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While at it, do you mind adding ${NetCDF_C_PATH}/lib64 to the hints? Since lib64 is quite common, we should be proactive and add it to the list of hints (e..g, my laptop and workstation use lib64)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done

@ambrad
Copy link
Member

ambrad commented Jan 20, 2025

Looks like you need to merge ekat master into your branch one more time to get the macro name fix.

@oksanaguba
Copy link
Contributor Author

ran this commit with kokkos 4.5.01 tag tail ~/ess/v1-jan21/run/atm.log.11682723.amn-0001.250121-164543 on sunspot.

Copy link
Member

@ambrad ambrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

Note that SCREAM nightlies run on master, not next, so I always run e3sm_scream_v1_medres on Frontier, Chrysalis, and pm-gpu before merging a complex PR to avoid having to fix master later. I recommend it to others, as well.

@ambrad ambrad requested a review from tcclevenger January 21, 2025 18:51
@ambrad
Copy link
Member

ambrad commented Jan 21, 2025

I added Conrad to the reviewers since he's working on upgrading Kokkos for multiple reasons, including Aurora readiness.

@ambrad
Copy link
Member

ambrad commented Jan 21, 2025

CI is failing with a checkout on the ekat submodule.

  Error: fatal: remote error: upload-pack: not our ref 1eae1ad5a01cdb510726624ac647e7f96dd883d0
  Error: fatal: Fetched in submodule path 'externals/ekat', but it did not contain 1eae1ad5a01cdb510726624ac647e7f96dd883d0. Direct fetching of that commit failed.

Looking at the branch, https://github.com/E3SM-Project/EKAT/commits/oksanaguba/spot/, it's unclear where 1eae1a is coming from. @oksanaguba I suggest checking the submodule state in this PR w.r.t. to your ekat branch.

@oksanaguba
Copy link
Contributor Author

@ambrad my ekat branch had to use kokkos main repo. if CI somehow does not know to add a remote (the default remote in ekat is poinitng to e3sm kokkos (whatever it is)), then it would not be able to fetch my commit.

@ambrad
Copy link
Member

ambrad commented Jan 21, 2025

my ekat branch had to use kokkos main repo.

I'm sure I'm confused, but on your ekat branch I see this as the Kokkos submodule: https://github.com/E3SM-Project/kokkos/tree/e6e5c4598d16d756db62225ab7f937ee833bd660. This appears to be in E3SM-Project/kokkos.

@ambrad
Copy link
Member

ambrad commented Jan 21, 2025

Re: commit 1eae1a, I don't see where it's coming from. I don't see it in your EKAT spot branch, but I could easily be mistaken.

@ambrad
Copy link
Member

ambrad commented Jan 21, 2025

Re: Kokkos: Ok, I'm guessing what's happening is you have your local submodule pointed to the Kokkos repo with this commit, but, as you wrote, the submodule metadata points to E3SM-Project/Kokkos. Generally, we have to put some extra commits on top of a Kokkos version, which is why E3SM-Project/Kokkos exists. Thus, I suggest you start by creating a branch in E3SM-Project/Kokkos with the desired Kokkos state and then use this in the submodule. That will let testing proceed.

@oksanaguba
Copy link
Contributor Author

i'll ask Conrad as he prob already has a branch

@oksanaguba
Copy link
Contributor Author

i pointed the branch to an e3sm kokkos branch, but haven't had a chance to see if CI is happier. also, will do more testing tomorrow.

@bartgol bartgol force-pushed the oksanaguba/eamxx/spot-clean branch from d4e4894 to 4af0429 Compare February 15, 2025 00:54
@bartgol
Copy link
Contributor

bartgol commented Feb 15, 2025

@oksanaguba In order to get the CI running and verify things work, I rebased on master (the conflict prevented CI from running). The only real conflict was the rrtmgp one (the workflow file I knew how to handle): in there, there were two ways to specify the netcdf C path. I stuck with the version from master. If that's not what you think is needed, I also pushed a branch oksanaguba/eamxx/spot-clean-bartgol-bkp, so you can revert to the state before the rebase.

@bartgol bartgol force-pushed the oksanaguba/eamxx/spot-clean branch from 4af0429 to 0c944e5 Compare February 15, 2025 01:08
@bartgol bartgol force-pushed the oksanaguba/eamxx/spot-clean branch from 0c944e5 to 1e0d4a3 Compare February 15, 2025 01:09
@bartgol
Copy link
Contributor

bartgol commented Feb 15, 2025

Note: CI tests are failing since we did not regen baselines once Jim's PR got merged. I triggered the generate job now, but it will be a while until it gets through the queue. We can simply re-run the failed job later to check if this PR is bfb or not.

@bartgol
Copy link
Contributor

bartgol commented Feb 15, 2025

Ok, some good news and bad news. The CUDA dbg CI job is finally getting past the bld phase, and running tests. The bad news is that, despite the rebase on master, we get some errors. E.g., for homme_shoc_cld_p3_rrtmgp_pg2:

  - Iteration   4 completed       [100%]
[EAMxx] Finalize ...
Constructor for Kokkos::View 'UNMANAGED' has mismatched number of arguments. The number of arguments = 0 neither matches the dynamic rank = 1 nor the total rank = 1

I will have to investigate hopping on blake on monday. The interesting thing is that the error shows up during finalization (as you can see from the fact that the unit test completed 100% of the iterations).

The release build only has the baseline_cmp fails, which should go away if we retest after the baseline-gen job runs.

@ambrad
Copy link
Member

ambrad commented Feb 15, 2025

The interesting thing is that the error shows up during finalization (as you can see from the fact that the unit test completed 100% of the iterations).

I think I know the cause. It's some new nullify routines I created to handle Kokkos's new (>= 4.3, I think) way of dealing with view-of-view initi/finalization. I'll push a fix.

@ambrad ambrad changed the title [wip] aurora config, clean up [wip] Aurora changes with required Kokkos update Feb 15, 2025
@ambrad ambrad changed the title [wip] Aurora changes with required Kokkos update [wip] Aurora changes, required Kokkos update, fixes and workarounds for Kokkos update Feb 15, 2025
@ambrad ambrad changed the title [wip] Aurora changes, required Kokkos update, fixes and workarounds for Kokkos update [wip] EAMxx: Aurora support, required Kokkos update, fixes and workarounds for Kokkos update Feb 15, 2025
@ambrad ambrad changed the title [wip] EAMxx: Aurora support, required Kokkos update, fixes and workarounds for Kokkos update EAMxx: Aurora support, required Kokkos update, fixes and workarounds for Kokkos update Feb 15, 2025
@ambrad ambrad added EAMxx PRs focused on capabilities for EAMxx and removed wip work in progress labels Feb 20, 2025
@mahf708 mahf708 changed the title EAMxx: Aurora support, required Kokkos update, fixes and workarounds for Kokkos update EAMxx: Kokkos 4.5 update, fixes and workarounds for Kokkos update, Aurora support Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: workflow change approved Allow testing on ghci-snl-* machines of PRs that alter a worfklow file EAMxx PRs focused on capabilities for EAMxx Machine Files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants