Opened 7 years ago
Closed 7 years ago
#1954 closed Bug (fixed)
Floating invalid in zdftmx
Reported by: | mcastril | Owned by: | gsamson |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | ZDF | Version: | v3.6 |
Severity: | major | Keywords: | |
Cc: |
Description (last modified by nemo)
Context
We have been having a floating invalid error in zdftmx.f90:
forrtl: error (65): floating invalid
Image PC Routine Line Source
nemo.exe 0000000001B61F92 Unknown Unknown Unknown
libpthread-2.22.s 00002B6FC2922B10 Unknown Unknown Unknown
nemo.exe 000000000068A1EE zdftmx_mp_tmx_itf 345 zdftmx.f90
nemo.exe 0000000000682C77 zdftmx_mp_zdf_tmx 245 zdftmx.f90
nemo.exe 000000000048474A step_mp_stp_ 198 step.f90
nemo.exe 0000000000438E0B nemogcm_mp_nemo_g 147 nemogcm.f90
nemo.exe 0000000000438D5D MAIN__ 18 nemo.f90
nemo.exe 0000000000438D1E Unknown Unknown Unknown
libc-2.22.so 00002B6FC2E4C6E5 __libc_start_main Unknown Unknown
nemo.exe 0000000000438C29 Unknown Unknown Unknown
Analysis
I debugged it and I've seen that in that line zdn2dz values are traversed from 0 to jk
zcoef = 0.5 - SIGN( 0.5, zdn2dz(ji,jj,jk) ) ! =0 if dN2/dz > 0, =1 otherwise
However, the 3rd dimension of that array is only set from 0 to jk-1 (jkm1)
zdn2dz (:,:,jk) = rn2(:,:,jk) - rn2(:,:,jk+1) ! Vertical profile of dN2/dz
I looked into zdn2dz values and the ones in (:,:,75) are totally random. The same thing could happen with other variables in that routine, that I saw are used in the same way.
Fix
The last jk value should be set to a proper value.
Commit History (2)
Changeset | Author | Time | ChangeLog |
---|---|---|---|
8789 | gsamson | 2017-11-22T19:04:59+01:00 | fix ticket #1954 in trunk |
8788 | gsamson | 2017-11-22T19:01:02+01:00 | fix ticket #1954 in nemo_v3_6_STABLE |
Change History (7)
comment:1 Changed 7 years ago by clevy
comment:2 Changed 7 years ago by nemo
- Description modified (diff)
comment:3 Changed 7 years ago by mcastril
Hi. We are using a version from early this year, but I checked the routine has not been updated for 20 months. The floating invalid error is appearing for runs with ORCA025L75, and it can happen for a number of processors and don't reproduce using another combination.
We are not using such an option, only -f and -fpe0 to stop on floating point errors. But the fact that the program does not stop if not applying -fpe0 does not mean it is not wrong. What is happening is that the last jk value of the array is not being filled and when it is accessed it has random values (not NaN), in fact it has the values stored previously in that memory region (garbage). There are compilers that initialize arrays by default, and there are compiler options to initialize arrays (-init=zero -init=arrays in Intel), but I think it is a mistake to relay on the compiler and also these flags reduce the performance. In general all values have to be initialized before being read to avoid an unexpected behavior.
comment:4 Changed 7 years ago by mcastril
I have been looking at other variables in the routine and contrarily to my first assumption there I think everyone are initialized to 0 (zsum, zempba_3d...) with the exception of zdn2dz, so it seems it was simply forgotten.
comment:5 Changed 7 years ago by clevy
- Owner set to gsamson
- Severity changed from critical to major
- Status changed from new to assigned
I have not been able to reproduce this floating invalid error while running SETTE tests (= all reference configurations), even in debug mode. Summary of results available here: https://forge.ipsl.jussieu.fr/nemo/wiki/SystemTeam/Validation/Validation_3_6_STABLE_release. I guess this is because of a different memory allocation, and maybe linked to the fact no reference configuration has as much vertical levels as ORCA025L75.
Nevertheless, it seems reasonable to set some initial value (=0) to zdn2dz in zdftmx routine.
comment:6 Changed 7 years ago by mcastril
Thanks Claire,
In our case it was happening with some domain distributions and not with others. For example it was failing with 1008 and 1056 cores, but not with 1920. But the fix was easy, only to add the line:
zdn2dz(:,:,jpk) = 0.e0
before
zempba_3d_1(:,:,jpk) = 0.e0 zempba_3d_2(:,:,jpk) = 0.e0
comment:7 Changed 7 years ago by gsamson
- Resolution set to fixed
- Status changed from assigned to closed
What is the revision number of 3_6_STABLE used, and for which reference configuration(s)?
I guess you are also activating a compiler debugging option setting undefined values to NaN. Or are you implying there could be an out of bounds in memory?