Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Norm computation issue #31

Open
antoine-morvan opened this issue Aug 14, 2023 · 6 comments
Open

Norm computation issue #31

antoine-morvan opened this issue Aug 14, 2023 · 6 comments
Assignees

Comments

@antoine-morvan
Copy link

antoine-morvan commented Aug 14, 2023

Hello,

Problem

I tried a configuration that make the computation go way off, (E+03). However, when using --norm to check deviation, the max error combined is 0. This is because 0 is greater than -999.

image

Still, the computation is wrong and this should get caught.

My wrapper script was missing this deviation because it is focusing solely on the combined result.

Setup

To reproduce:

  • compile with aocc 4.1.0; OpenMPI 4.1.5; OpenBLAS 0.3.23; FFTW 3.3.10
export CFLAGS="-O3 -march=native -mtune=native"
export CXXFLAGS="$CFLAGS"
export FCFLAGS="$CFLAGS"
  • could reproduce on latest AMD & Intel CPUs
  • run ectrans-benchmark-dp --norms -n 5 -l 137 -t 319 --vordiv --scders

Some Leads ?

After few investigation, I spotted 2 potential causes:

  1. The max error is initialized with zmaxerr(:) = -999.0. It would be wiser to initialize the max error with 0.
    zmaxerr(:) = -999.0
  2. When enabling verbosity (and printing the divider), we could observe half of the arrays znormvor(:) znormdiv(:) znormt(:) znormsp(:) comming with NaN values.
    do ifld = 1, nflevg
    zerr(3) = abs(znormvor1(ifld)/znormvor(ifld) - 1.0d0)
    zmaxerr(3) = max(zmaxerr(3), zerr(3))
    if (verbosity >= 1) then
    write(nout,'("norm zspvor( ",i4,") = ",f20.15," error = ",e10.3)') ifld, znormvor1(ifld), zerr(3)
    endif
    enddo
    do ifld = 1, nflevg
    zerr(2) = abs(znormdiv1(ifld)/znormdiv(ifld) - 1.0d0)
    zmaxerr(2) = max(zmaxerr(2),zerr(2))
    if (verbosity >= 1) then
    write(nout,'("norm zspdiv( ",i4,",:) = ",f20.15," error = ",e10.3)') ifld, znormdiv1(ifld), zerr(2)
    endif
    enddo
    do ifld = 1, nflevg
    zerr(4) = abs(znormt1(ifld)/znormt(ifld) - 1.0d0)
    zmaxerr(4) = max(zmaxerr(4), zerr(4))
    if (verbosity >= 1) then
    write(nout,'("norm zspsc3a(",i4,",:,1) = ",f20.15," error = ",e10.3)') ifld, znormt1(ifld), zerr(4)
    endif
    enddo
    do ifld = 1, 1
    zerr(1) = abs(znormsp1(ifld)/znormsp(ifld) - 1.0d0)
    zmaxerr(1) = max(zmaxerr(1), zerr(1))
    if (verbosity >= 1) then
    write(nout,'("norm zspsc2( ",i4,",:) = ",f20.15," error = ",e10.3)') ifld, znormsp1(ifld), zerr(1)
    endif
    enddo

Use this to print the divider:

  verbosity=1
  zmaxerr(:) = 0
  do ifld = 1, nflevg
    zerr(3) = abs(znormvor1(ifld)/znormvor(ifld) - 1.0d0)
    zmaxerr(3) = max(zmaxerr(3), zerr(3))
    if (verbosity >= 1) then
      write(nout,'("norm zspvor( ",i4,")     = ",f20.15,f20.15,"        error = ",e10.3)') ifld, znormvor1(ifld), znormvor(ifld), zerr(3)
    endif
  enddo
  do ifld = 1, nflevg
    zerr(2) = abs(znormdiv1(ifld)/znormdiv(ifld) - 1.0d0)
    zmaxerr(2) = max(zmaxerr(2),zerr(2))
    if (verbosity >= 1) then
      write(nout,'("norm zspdiv( ",i4,",:)   = ",f20.15,f20.15,"        error = ",e10.3)') ifld, znormdiv1(ifld), znormdiv(ifld), zerr(2)
    endif
  enddo
  do ifld = 1, nflevg
    zerr(4) = abs(znormt1(ifld)/znormt(ifld) - 1.0d0)
    zmaxerr(4) = max(zmaxerr(4), zerr(4))
    if (verbosity >= 1) then
      write(nout,'("norm zspsc3a(",i4,",:,1) = ",f20.15,f20.15,"        error = ",e10.3)') ifld, znormt1(ifld),znormt(ifld), zerr(4)
    endif
  enddo
  do ifld = 1, 1
    zerr(1) = abs(znormsp1(ifld)/znormsp(ifld) - 1.0d0)
    zmaxerr(1) = max(zmaxerr(1), zerr(1))
    if (verbosity >= 1) then
      write(nout,'("norm zspsc2( ",i4,",:)   = ",f20.15,f20.15,"        error = ",e10.3)') ifld, znormsp1(ifld),znormsp(ifld), zerr(1)
    endif
  enddo

This could come from these arrays being initialized with a function declared as C binding, iterating over non-contiguous segments.

But that would require more investigation to confirm :)

Best.

@wdeconinck
Copy link
Collaborator

Thank you for this report. We will look into this.

@samhatfield
Copy link
Collaborator

Hi Antoine,

I don't understand how it's possible to get a max error of -999. The initial value of -999 is compared against the output of abs which is positive semidefinite. Even when e.g. znormvor(ifld) is NaN, and therefore zerr(3) is NaN, I found that max then also produces a NaN. NaN should then appear as the max error combined. This was for the Intel compiler.

Could this be compiler-specific? Does the operation max(-999, NaN) return -999 for AOCC?

@antoine-morvan
Copy link
Author

antoine-morvan commented Oct 3, 2023

Hello,

I observed this behavior with AOCC 4.1 only (well, I did not try the whole thing with all the SW stacks at hand). Despite the max operation behaving similarly with other compilers (see below), something must be wrong somewhere else too.

Regarding the result of this specific max operation, here is the result with some compilers :

! print NaN too
print *, max(-999, NaN), NaN

With default flags (e.g., gcc input.F90)

aocc:4.1.0 :             0            0
gcc:13.2.0 :   1064675189  1064675189
nvhpc:23.7 :             0            0
llvm:16.0.6 :  0 0
oneapi:2023.1.0 :         -999           0
ifort:2023.1.0 :            0           0

With aggressive flags (e.g., gcc -O3 -fastmath -march=native -mtune=native input.F90)

aocc:4.1.0 :             0            0
gcc:13.2.0 :            0           0
nvhpc:23.7 :          -999     15208769
llvm:16.0.6 :  1423361856 858928177
oneapi:2023.1.0 :         -999           0
ifort:2023.1.0 :            0           0

@samhatfield
Copy link
Collaborator

Is this pseudo-code, or should I actually be able to compile this?:

print *, max(-999, NaN), NaN

The reason I ask is because I don't recognise NaN as a Fortran keyword. And indeed, ifort gives

main.f90(4): error #6404: This name does not have a type, and must have an explicit type.   [NAN]
    print *, max(-999, NaN), NaN
-----------------------^

Also I'm not sure what result you should expect when comparing -999 (an integer literal) with NaN (a floating-point literal).

Going back to the problem, I would need to be able to reproduce it exactly to figure out what's going wrong. Could you share the modifications you've made to produce these error norms? Perhaps I can reproduce the problem with ifort?

I tried running your benchmark command after building ecTrans with intel/2021.4.0 but it gives this

======= End of spectral transforms  =======

max error zspvor(1:nlev,:)    =  0.999E-14
max error zspdiv(1:nlev,:)    =  0.999E-14
max error zspsc3a(1:nlev,:,1) =  0.173E-13
max error zspsc2(1:1,:)       =  0.173E-13

max error combined =          =  0.173E-13

======= Start of time step stats =======

Again I can't see how it's possible for this calculation to give -0.999E+03:

! MUST be >= 0.0
zerr(3) = abs(znormvor1(ifld)/znormvor(ifld) - 1.0d0)
! Also MUST be >= 0.0
zmaxerr(3) = max(zmaxerr(3), zerr(3))

I think the problem must be related to NaNs somehow but unless I can reproduce it with ifort I'm not much help :(

@wdeconinck
Copy link
Collaborator

@antoine-morvan is there any update on this issue?

@antoine-morvan
Copy link
Author

Hello,

I did not have time to work on this lately.

Best regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants