Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New cam_dev files cause non-simple physics compsets to fail #55

Closed
gdicker1 opened this issue Jun 7, 2024 · 3 comments
Closed

New cam_dev files cause non-simple physics compsets to fail #55

gdicker1 opened this issue Jun 7, 2024 · 3 comments
Assignees
Labels
bug Something isn't working workaround An issue has a workaround or a pull request implements a workaround (instead of a fix)

Comments

@gdicker1
Copy link
Contributor

gdicker1 commented Jun 7, 2024

Real-data IC and topo files added in PRs EWOrg/CAM #16 and EWOrg/CAM #20 cause run-time fails of compsets that use "full physics" - especially those with chemistry and radiation interactions.

"Simple-physics" compsets like FADIAB and any MPAS-A standalone runs are successful. (NOTE: some compsets like FHS94 are also unaffected since they use separate "notopo" files.)

Compsets affected: F2000climo (when requesting these files), F2000dev, and CHAOS2000dev.

Example case on Derecho: /glade/derecho/scratch/gdicker/tstgrids_F2000dev_2_mpasa480_20240606-110547/run

  • Especially the files associated w/ job number 4744653. The rank104.cesm.log.4744653.txt file maybe helpful

How to reproduce:

  1. Clone the most recent release or the development branch of EarthWorks
  2. Create a compset that uses full physics and mpas grid (e.g. ./create_newcase --compset F2000dev --res mpasa480_mpasa480 )
  3. Submit the case after setting up and building

Symptoms seen:

  • In between "Dynamics timestep ..." output there will be lines in the atm.log.* file like "imp_sol: Time step 1.1250000000000E+01 failed to converge @ (lchnk,lev,col,nstep) = 2 32 4 0" (see first the code-block below for a longer example).
    • These convergence failures cause the simulation time to increase (due to repeated iterations) and eventually lead to the simulation crashing.
  • There may also be a message from the MPAS-A dycore in the atm.log.* file and in one log.atmosphere.${rank}.err file per rank: "CRITICAL ERROR: NaN detected in 'w' field." This crashes the simulation, but typically only after a few timesteps.
  • There may be a segmentation fault that occurs in the rrtmgp radiation which also cause runs to fail (TODO fill-in these details).
  • The number of timesteps before failure seems to be resolution dependent. 480km tests fail after a few timesteps while 120km test cases require runs longer than 5 days to fail.

Sample output seen in atm.log.* and cesm.log.*

 imp_sol: Time step   1.1250000000000E+01 failed to converge @ (lchnk,lev,col,nstep) =      2    32     4     0
 imp_sol: Failed to converge @ (lchnk,lev,col,nstep,dt,time) =      2    32     4     0  1.1250000000000E+01  1.7887500000000E+03
 num_a1    1.000E+00
 so4_a1    1.000E+00
 imp_sol: Time step   1.1250000000000E+01 failed to converge @ (lchnk,lev,col,nstep) =      2    32     4     0
 imp_sol: Failed to converge @ (lchnk,lev,col,nstep,dt,time) =      2    32     4     0  1.1250000000000E+01  1.8000000000000E+03
 num_a1    1.000E+00
 so4_a1    1.000E+00
 imp_sol : @ (lchnk,lev,col) =            2          32           4  failed          165  times

Workaround

While I continue to investigate, this issue can be mitigated by just not using these files. Supply paths to the relevant files in the ncdata and bnd_topo files in a case's user_nl_cam file. If you wish to use cam_dev physics in a case with the old files, you will need to supply use_gw_front = .false. in the user_nl_cam file.

@adamrher
Copy link

This is a shot in the dark, but Francis found a bug cam_dev where the liquid and ice paths being fed into radiation aren't being initialized. Maybe try his changes to physpkg.F90 and cloud_diagnostics.F90 in ESCOMP/CAM#1074 ?

@gdicker1 gdicker1 added the workaround An issue has a workaround or a pull request implements a workaround (instead of a fix) label Jul 10, 2024
@gdicker1
Copy link
Contributor Author

gdicker1 commented Aug 2, 2024

The changes in EarthWorksOrg/CAM#23 will address this issue. It is waiting on an upstream version of the PR in ESCOMP/CAM (#1095) before being merged into EarthWorks and the files in EarthWorksOrg/CAM#16 and EarthWorksOrg/CAM#20 will be added again.

@gdicker1
Copy link
Contributor Author

Resolved by merge of #73. Please revive this issue or start a new one if problems are noticed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working workaround An issue has a workaround or a pull request implements a workaround (instead of a fix)
Projects
None yet
Development

No branches or pull requests

2 participants