New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
#1402 (AMM12 SETTE failed on Archer) – NEMO

Opened 8 years ago

Closed 8 years ago

Last modified 5 years ago

#1402 closed Bug (fixed)

AMM12 SETTE failed on Archer

Reported by: mathiot Owned by: timgraham
Priority: low Milestone:
Component: OCE Version: trunk
Severity: Keywords: restartability
Cc:

Description

I failed to run AMM12 (LONG as REPRO) from the trunk version on Archer with the file provided by the NEMO website (http://dodsp.idris.fr/gaya/reee451/NEMO/AMM12_v3.6.tar).
I try 3 different restart files:

  • With the web files, AMM12 blows up after 1 time step (solver.stat: it : 1 ssh2: 0.7507457299E+03 Umax: 0.6441274809E+00 Smin: -Infinity).

In the next two tests, I changed only the restart file amm12_restart_oce.nc.

  • With the full restart provided by Andrew C. (the web restart file amm12_restart_oce.nc is a reduced restart, it contains only tn, sn, un, vn and sshn), AMM12 passed SETTE tests.
  • With a reduced restart (tn,sn,un,vn and sshn) extracted from the full restart provided by Andrew C., AMM12 failed to pass SETTE. As in the first test, it blows up after 1 time step with -Infinity salinity)

It seems AMM12 is not compatible with the reduced restart (at least on Archer).

Useful information:
AMM12 is compiled with the arch-XC_ARCHER_INTEL.fcm arch file.
The trunk NEMO version used to test AMM12 is the 4816.

Commit History (3)

ChangesetAuthorTimeChangeLog
4839timgraham2014-11-10T15:43:23+01:00

Corrected bug fix for unitialised variables in GLS scheme when using AMM12 with reduced restarts - see ticket #1402. Also set nn_istate=0 for AMM12 as uses too much disk space in SETTE tests.

4822timgraham2014-10-23T15:32:41+02:00

An extra change required in fix to #1402. taum must also be initialised.

4821timgraham2014-10-23T15:15:56+02:00

Fix for ticket #1402 - Unitialised variables in zdfgls when using a reduced restart file that does not contain avt, avm etc.

Change History (10)

comment:1 Changed 8 years ago by timgraham

  • Owner changed from NEMO team to timgraham

I have discussed this with Pierre as we get different behaviour on our IBM PW7 at the Met Office. Using the restart file from the dodsp server the AMM12 SETTE test runs and is reproducible from the head of the trunk (r4816) but is not reproducible at r4650 where we have started our development branches. Using an old restart file in which all variables are present both versions of the trunk pass SETTE.

I have isolated the change that causes the difference (on our machine) to a variable declaration in SBC/fldread.F90:

CHARACTER(len = 256) ::   wname       ! generic name of a NetCDF weights file to be used, blank if not

at the head of the trunk and

CHARACTER(len = 34) ::   wname       ! generic name of a NetCDF weights file to be used, blank if not

in r4650.
I've tried to check for any occasions when we could have a weights file name longer than this but that hasn't solved the problem. Other than that I don't really see why this variable should break reproducibility so it looks like there could be some underlying memory issues. Furthermore this doesn't have any effect on archer.

I've also tried compiling with all variables initialised to zero and all variables initialised to NaNs? and the tests still complete but still aren't reproducible across different decompositions.

I've taken ownership of the ticket for now but I'd appreciate any suggestions for a solution.

comment:2 Changed 8 years ago by cetlod

I've obtained exactly the same behavior with the config AMM12 running on our IBM x3750 computer in Paris. The model ( revision 4614 of the trunk ) blows up after 1 time step with -Infinity salinity, when using the files provided on the dodsp server.

comment:3 Changed 8 years ago by timgraham

  • Resolution set to fixed
  • Status changed from new to closed

Working with Pierre we have tracked this down to uninitialised variables (avt_k, avm_k, avmu_k, avmv_k) in gls_rst when using a reduced restart file that does not contain these variables. This did not cause a failure on the IBM because the compiler was initialising them to zero but on archer the intel compiler initialised them to 10200.

Pierre has tested this on archer and it now works. This is now on the trunk at r4821.

This fix solves the issue of the AMM12 not running on Archer so I will close the ticket but it still doesn't explain the change in behaviour after r4650 on the IBM (as described in my comment above).

comment:4 Changed 8 years ago by flavoni

  • Resolution fixed deleted
  • Status changed from closed to reopened

I think that it will be necessary to find why old restart files do not work anymore,

and if there are some initialized variables to fix them in the code and not with the compilation option,

(if not we don't know which are), and I think for this reason it's important to re-open the ticket.

I'm passing SETTE test for the revision 4836 of the trunk to prepare MERGE.

For the trunk revision 4836 I've the same problem, for AMM12 compiled on IBM x3750 with   "-DCPP_PARA -i4 -r8 -O0 -xAVX -fp-model precise" options.

At which moment the old restart did not work anymore? Why do we have "reduced restarts?" 

THanks, Simona

comment:5 Changed 8 years ago by timgraham

Simona,

The old restarts do work (at least on our IBM and ARCHER) but in NEMO 3.6 modifications were made to allow the use of reduced restarts which only contain the variables sn, sshn, tn, un and vn.

The commits at r4821 and r4822 are code changes to initialize the variables correctly in the code (not just compiler changes).

Are you saying that you are still unable to run the SETTE using the latest files on the dodsp server at r4836 of the trunk?

comment:6 Changed 8 years ago by flavoni

Hi Tim,

yes, exactly,

I run SETTE today and I found the same problem of Christian (same machine) and mathiot

First time I run with old restart and it did not work, then I download new "reduced restarts"  and I can run 1 time step, but I find at kt=1 salinity problem.

Simona

comment:7 Changed 8 years ago by timgraham

  • Resolution set to fixed
  • Status changed from reopened to closed

Fixed in r4839.

comment:8 Changed 8 years ago by jchanut

Hi,

A quick comment about the problems encountered in this ticket.
AMM12 makes use of a reduced restart for initialization although, the whole code has not been designed for that: I mean not starting from rest with ln_rstart=.TRUE. It seems that this capacity subitely appeared in AMM12 settings without any concertation. It raised a number of problems when transitionning to new code version We have tried to adapt new developments to that as much as possible, but it is likely that this may only work with AMM12 settings, as they are today. We will certainly face similar issues in future versions.

To be short, I think it is dangerous to allow users "hot starting" with ln_rstart=T until a specific task is scheduled.

Jérôme

comment:9 Changed 6 years ago by nicolasmartin

  • Milestone NEMO Validation: SETTE (previously NVTK) deleted

Milestone NEMO Validation: SETTE (previously NVTK) deleted

comment:10 Changed 5 years ago by nemo

  • Keywords restartability added; restart removed
Note: See TracTickets for help on using tickets.