Opened 10 months ago

Closed 9 months ago

#1954 closed Bug (fixed)

Floating invalid in zdftmx

Reported by:Miguel Castrillo Component: ZDF
Version: release-3.6 Severity: major
Keywords: Cc:
Management
Owned by:Guillaume Samson Milestone:
Priority: normal

Description (last modified by nemo)

Context

We have been having a floating invalid error in zdftmx.f90:

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source             
nemo.exe           0000000001B61F92  Unknown               Unknown  Unknown
libpthread-2.22.s  00002B6FC2922B10  Unknown               Unknown  Unknown
nemo.exe           000000000068A1EE  zdftmx_mp_tmx_itf         345  zdftmx.f90
nemo.exe           0000000000682C77  zdftmx_mp_zdf_tmx         245  zdftmx.f90
nemo.exe           000000000048474A  step_mp_stp_              198  step.f90
nemo.exe           0000000000438E0B  nemogcm_mp_nemo_g         147  nemogcm.f90
nemo.exe           0000000000438D5D  MAIN__                     18  nemo.f90
nemo.exe           0000000000438D1E  Unknown               Unknown  Unknown
libc-2.22.so       00002B6FC2E4C6E5  __libc_start_main     Unknown  Unknown
nemo.exe           0000000000438C29  Unknown               Unknown  Unknown

Analysis

I debugged it and I've seen that in that line zdn2dz values are traversed from 0 to jk

zcoef = 0.5 - SIGN( 0.5, zdn2dz(ji,jj,jk) )       ! =0 if dN2/dz > 0, =1 otherwise

However, the 3rd dimension of that array is only set from 0 to jk-1 (jkm1)

zdn2dz     (:,:,jk) = rn2(:,:,jk) - rn2(:,:,jk+1)           ! Vertical profile of dN2/dz

I looked into zdn2dz values and the ones in (:,:,75) are totally random. The same thing could happen with other variables in that routine, that I saw are used in the same way.

Fix

The last jk value should be set to a proper value.

Commit History (2)

ChangesetAuthorTimeChangeLog
8789gsamson2017-11-22T19:04:59+01:00

fix ticket #1954 in trunk

8788gsamson2017-11-22T19:01:02+01:00

fix ticket #1954 in nemo_v3_6_STABLE

Change History (7)

comment:1 Changed 10 months ago by Claire Levy

What is the revision number of 3_6_STABLE used, and for which reference configuration(s)?
I guess you are also activating a compiler debugging option setting undefined values to NaN. Or are you implying there could be an out of bounds in memory?

comment:2 Changed 10 months ago by nemo

  • Description modified (diff)

comment:3 Changed 10 months ago by Miguel Castrillo

Hi. We are using a version from early this year, but I checked the routine has not been updated for 20 months. The floating invalid error is appearing for runs with ORCA025L75, and it can happen for a number of processors and don't reproduce using another combination.

We are not using such an option, only -f and -fpe0 to stop on floating point errors. But the fact that the program does not stop if not applying -fpe0 does not mean it is not wrong. What is happening is that the last jk value of the array is not being filled and when it is accessed it has random values (not NaN), in fact it has the values stored previously in that memory region (garbage). There are compilers that initialize arrays by default, and there are compiler options to initialize arrays (-init=zero -init=arrays in Intel), but I think it is a mistake to relay on the compiler and also these flags reduce the performance. In general all values have to be initialized before being read to avoid an unexpected behavior.

comment:4 Changed 10 months ago by Miguel Castrillo

I have been looking at other variables in the routine and contrarily to my first assumption there I think everyone are initialized to 0 (zsum, zempba_3d…) with the exception of zdn2dz, so it seems it was simply forgotten.

comment:5 Changed 10 months ago by Claire Levy

  • Owner set to gsamson
  • Severity changed from critical to major
  • Status changed from new to assigned

I have not been able to reproduce this floating invalid error while running SETTE tests (= all reference configurations), even in debug mode. Summary of results available here: https://forge.ipsl.jussieu.fr/nemo/wiki/SystemTeam/Validation/Validation_3_6_STABLE_release. I guess this is because of a different memory allocation, and maybe linked to the fact no reference configuration has as much vertical levels as ORCA025L75.

Nevertheless, it seems reasonable to set some initial value (=0) to zdn2dz in zdftmx routine.

comment:6 Changed 10 months ago by Miguel Castrillo

Thanks Claire,

In our case it was happening with some domain distributions and not with others. For example it was failing with 1008 and 1056 cores, but not with 1920. But the fix was easy, only to add the line:

zdn2dz(:,:,jpk) = 0.e0

before

zempba_3d_1(:,:,jpk) = 0.e0
zempba_3d_2(:,:,jpk) = 0.e0


Last edited 10 months ago by nemo (previous) (diff)

comment:7 Changed 9 months ago by Guillaume Samson

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.