You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are 8 np loops in limiter_optim_iter_full subroutine in prim_advection_mod.F90. In most cases, np is 4 and most of the loops have trip counts of 4-by-4, 4, or 16. Since the call to this subroutine is already inside a nested OMP parallel region, further improvement should be done with SIMD. If vectorization is not possible, we should explore loop unroll by a factor of 4.
The text was updated successfully, but these errors were encountered:
Are you talking about manually unrolling? I think this is something the compiler should be doing for us, right? Regarding SIMD, a lot of those loops (not all though) are reductions. Can SIMD instructions run on reduction loops? I know that for the GPU port, we don't thread down into the np x np loops because of reductions over these small np x np chunks of data.
A few months ago I looked at compiler generated listings for other subroutines in derivative_mod.F90 and saw that neither unroll nor SIMD was happening. SIMD was not done because it was deemed 'not profitable'. IIRC, np was also not deduced to be a compile-time constant to enable further optimizations. I am logging this issue here to put in our backlog tasks. This subroutine is called over a million times.
I saw an improvement with manual unroll in edge_mod.F90 based on GPTL timers. Will check how we do after integrating into ACME/models/atm/cam/src/dynamics/se/share/edge_mod.F90.
There are 8 np loops in limiter_optim_iter_full subroutine in prim_advection_mod.F90. In most cases, np is 4 and most of the loops have trip counts of 4-by-4, 4, or 16. Since the call to this subroutine is already inside a nested OMP parallel region, further improvement should be done with SIMD. If vectorization is not possible, we should explore loop unroll by a factor of 4.
The text was updated successfully, but these errors were encountered: