Opened 5 years ago

Closed 5 years ago

Last modified 3 years ago

#1561 closed Bug (duplicate)

NEMO-LIM3-BDY crashes unexpectedly after less than 150,000 time steps

Reported by: clem Owned by: nemo
Priority: low Milestone:
Component: LIM3 Version: release-3.6
Severity: Keywords: LIM*
Cc:

Description

I have been struggling with simulations crashes for quite a few days now, so I think it is time to share my experience, and see if anyone else has similar issues.

I am running a regional configuration (OPA + LIM3 + BDY) at very high resolution (2 km) for which I get random crashes with no explicit error message (no error in ocean.output), except this one from the job output:

Error [void noMemory(void)] : In file '/workgpfs/rech/omr/romr008/XIOS/src/memory.cpp', line 9 → Out of memory
ERROR: 0031-250 task 64: Segmentation fault
Insufficient memory to allocate Fortran RTL message buffer, message #78 = hex 0000004e.

I first thought of a memory leak in xios but it also crashes without it (without the xios error message then), so I do not think we can blame xios. So, let's go back to NEMO.

I tested 3 time steps (60s, 90s, 120s). For each of them, the simulation is able to run about 130,000 time steps (more or less 10,000). Basically I can run 3, 4.5 or 6 months depending on the time step. So, I have been able to run 1 year but with restarts every 4 months. A 1 year simulation without restarts does not work (by the way, my simulations are restartable).

I am now testing ORCA2-LIM3 (100 years without restarts) to see if I can reproduce this behavior in a reference configuration.

Clem

ps: I use 124 cores on IBM Power6 (Ada): 120 for nemo and 4 for xios. It takes about 200h CPU per month (<2h elapsed).

Commit History (2)

ChangesetAuthorTimeChangeLog
5564clem2015-07-08T14:10:46+02:00

stop the simulation if the max number of seconds between each restart is larger than the largest integer the computer can represent (kind 4). It avoids unexpected crashes due to bad forcing fields reading (see part of ticket #1561).

5563clem2015-07-08T14:10:24+02:00

stop the simulation if the max number of seconds between each restart is larger than the largest integer the computer can represent (kind 4). It avoids unexpected crashes due to bad forcing fields reading (see part of ticket #1561).

Change History (17)

comment:1 Changed 5 years ago by clem

ORCA2-LIM3 (revision 5519, so basically the 3.6 stable) is crashing after about 60 years of simulation in the Gulf of Aden (i=101,j=87) with ocean velocities > 20 m/s

The only difference I have with the reference configuration is nn_fsbc = 1, and ln_mskland = true, and I do not initialize the ice.

I am currently running a series of tests before and after the merge.

comment:2 Changed 5 years ago by clevy

On my side, I have 30 years of ORCA1_LIM3 (NEMO rev 5507, right before 3_6_STABLE) whcih are apparently ok.
I will go on for 70 more years and see what happens…

comment:3 Changed 5 years ago by jchanut

Maybe not related, but time splitting parameters have not been tuned and set to the default.
It is by far better to use ln_bt_fw=.FALSE. the other scheme losing part of second order accuracy with the off-centering of forcing terms. It is also weakly unstable with respect to advection. It allows local and global conservation of tracers, although a fix exists if ln_bt_fw=.TRUE. (still in discussion). Lots of advantages. It is however twice more expensive… The chosen maximum courant number (0.8) may be too high also. Below 0.5, may be safer.

comment:4 Changed 5 years ago by clem

Ok, a little update. There may be two different bugs here.
Let's first consider ORCA2-LIM3. The crash after 68 years of simulation in a raw is due to a problem in reading the surface forcing fields, more precisely the monthly files (runoff, chla etc). Until time step 372533, everything goes as normal. The model reads:

read U_10_MOD (rec: 63) in ./u_10.15JUNE2009_fill.nc ok
read V_10_MOD (rec: 63) in ./v_10.15JUNE2009_fill.nc ok
read Q_10_MOD (rec: 63) in ./q_10.15JUNE2009_fill.nc ok
read T_10_MOD (rec: 63) in ./t_10.15JUNE2009_fill.nc ok
read sorunoff (rec: 2) in ./runoff_core_monthly.nc ok
read sss (rec: 2) in ./sss_data.nc ok
read CHLA (rec: 2) in ./chlorophyll.nc ok
read votemper (rec: 2) in ./data_1m_potential_temperature_nomask.nc ok
read vosaline (rec: 2) in ./data_1m_salinity_nomask.nc ok

Then, at the next time step, it reads that (which is wrong):

read sorunoff (rec: 1) in ./runoff_core_monthly.nc ok
read sorunoff (rec: 2) in ./runoff_core_monthly.nc ok
read sss (rec: 1) in ./sss_data.nc ok
read sss (rec: 2) in ./sss_data.nc ok
read CHLA (rec: 1) in ./chlorophyll.nc ok
read CHLA (rec: 2) in ./chlorophyll.nc ok
read votemper (rec: 1) in ./data_1m_potential_temperature_nomask.nc ok
read votemper (rec: 2) in ./data_1m_potential_temperature_nomask.nc ok
read vosaline (rec: 1) in ./data_1m_salinity_nomask.nc ok
read vosaline (rec: 2) in ./data_1m_salinity_nomask.nc ok

I am no specialist in in/output but it looks like fldread is messing the thing up. Is it a known issue? I can look in more details if necessary.

comment:5 Changed 5 years ago by clem

Now concerning the regional simulation (with BDY and LIM3) that crashes for no very clear reasons after 5 months, it appears that putting nn_fsbc=5 instead of 1 solves the issue and I have been able to run one 1 year in a raw. But I still have no clue why.
Jerome's advice on time splitting did not change the problem (it crashes almost exactly at the same time step with or without the change in the namelist_ref).

comment:6 Changed 5 years ago by rblod

About ORCA2_LIM stuff (even if you may prefer a BDY answer), rdt* 372533 (rdt=5760) is bigger than the biggest integer the computer can represent (for kind=4). nrec_a and nrec_b in fldread are actually integers. Not sure the problem corresponds exactly to these variables but it's definitely a precision issue.
Rachid

comment:7 Changed 5 years ago by smasson

Rachid is right, 68 years rings the bell: exceed the number of seconds that can be represented with integer4.
There is an old ticket about this: #1146
Are you also doing 68 years without any restart?

comment:8 Changed 5 years ago by clem

Ok. If I understand correctly, a fixed has been done to avoid such mistake in NEMO3.4 (about 2 years ago) at revision r4086 but it does not seem to work now. Misplaced statement maybe?

comment:9 Changed 5 years ago by timgraham

Am I missing something here with the test that was added? Doesn't the test rely on nsec1jan000 being correctly calculated but if it's already bigger than the largest possible integer then the test will probably fail? Am I missing something subtle here?

               IF( nsec1jan000 >= 2 * (2**30 - nsecd * nyear_len(1) / 2 ) ) THEN   ! test integer 4 max value
                  CALL ctl_stop( 'The number of seconds between Jan. 1st 00h of nit000 year and Jan. 1st 00h ',   &
                     &           'of the current year is exceeding the INTEGER 4 max VALUE: 2^31-1 -> 68.09 years in seconds', &
                     & 'You must do a restart at higher frequency (or remove this STOP and recompile everything in I8)' )
               ENDIF

comment:10 Changed 5 years ago by clem

Can't we write something like that at the beginning of day_init instead, so the simulation is stopped before even started (which would avoid waste of CPU):

IF( REAL(nstock) * rdt > REAL( 2**31 - 1 ) ) THEN
      CALL ctl_stop( 'The number of seconds between Jan. 1st 00h of nit000 year and Jan. 1st 00h ',   &
         &           'of the current year is exceeding the INTEGER 4 max VALUE: 2^31-1 -> 68.09 years in seconds', &
         & 'You must do a restart at higher frequency (or remove this STOP and recompile everything in I8)' )
ENDIF

comment:11 Changed 5 years ago by timgraham

That seems like a more sensible solution but I don't think it should use nstock for the test.
Sometimes people output restart files part way through a run so this wouldn't always catch the problem.

Would REAL(nitend) * rdt be a better option?

comment:12 Changed 5 years ago by clem

If there is no objection. Here is the fix I propose at the beginning of day_ini:

      IF( REAL( nitend - nit000 + 1 ) * rdt > REAL( HUGE( nsec1jan000 ) ) ) THEN
         CALL ctl_stop( 'The number of seconds between each restart exceeds the integer 4 max value: 2^31-1. ',   &
            &           'You must do a restart at higher frequency (or remove this stop and recompile the code in I8)' )
      ENDIF

comment:13 Changed 5 years ago by clem

The stop in ORCA2 in case of too low restart frequency is now activated in revisions r5563 and r5564

comment:14 Changed 5 years ago by clem

  • Resolution set to duplicate
  • Status changed from new to closed

comment:15 Changed 3 years ago by nemo

  • Keywords LIM* added

comment:16 Changed 3 years ago by nemo

  • Keywords release-3.6* added

comment:17 Changed 3 years ago by nemo

  • Keywords release-3.6* removed
Note: See TracTickets for help on using tickets.