Opened 5 months ago

Last modified 6 weeks ago

#2418 reopened Bug

model does not stop properly in stpctl

Reported by: smasson Owned by: systeam
Priority: low Milestone:
Component: MULTIPLE Version: trunk
Severity: minor Keywords:
Cc:

Description

Context

If an error is detected in stpctl, the model will or will not stop properly according to the namelist choices used for sn_cfctl

Analysis

For example sn_cfctl%l_glochk = .true. and all others sn_cfctl% defined to .false. the model does not stop because of a dead lock problem. The variable lsomeoce is also a potential source of dead lock.

Fix

Large debugging/cleaning of stpctl is needed…

Commit History (23)

ChangesetAuthorTimeChangeLog
13136smasson2020-06-22T08:29:44+02:00

trunk: fix maxval values on land subdomains for stpctl, see #2418

13115smasson2020-06-16T20:58:06+02:00

trunk: fix potential deadlock, see #2418

13011smasson2020-06-03T09:56:28+02:00

trunk: make sure error messages are visible, see #2418

12935smasson2020-05-15T14:15:31+02:00

delete #2418 branch

12933smasson2020-05-15T10:06:25+02:00

trunk: merge back r12581_ticket2418 branch into the trunk, see #2418

12932smasson2020-05-15T10:01:11+02:00

r12581_ticket2418: update sette version in svn externals definition, see #2418

12931smasson2020-05-15T09:59:05+02:00

sette: suppress set_namelist for sn_cfctl%l_config, see #2418

12930smasson2020-05-15T09:51:24+02:00

r12581_ticket2418: update with trunk @12929, see #2418

12858smasson2020-05-03T11:04:27+02:00

r12581_ticket2418: bugfix not seen on X64_IRENE, see #2418

12856smasson2020-05-02T11:55:39+02:00

r12581_ticket2418: stupid bugfix following [12855], see #2418

12855smasson2020-05-01T19:09:33+02:00

r12581_ticket2418: add check for Infinity, see #2418

12853smasson2020-05-01T18:56:02+02:00

r12581_ticket2418: merge with trunk@12852, see #2418

12846smasson2020-05-01T14:07:29+02:00

r12581_ticket2418: merge with trunk@12845, see #2418

12844smasson2020-05-01T12:57:50+02:00

r12581_ticket2418: merge with trunk@12843, see #2418

12840smasson2020-05-01T10:58:58+02:00

r12581_ticket2418: improve stpctl error messages and release the max of 9999 MPI tasks in files names, see #2418

12835smasson2020-04-30T08:55:37+02:00

r12581_ticket2418: suppress l_allon and l_config namelist parameters, see #2418

12718smasson2020-04-08T17:21:05+02:00

r12581_ticket2418: bugfix for C1D and STATION_ASF, see #2418

12685smasson2020-04-06T11:52:15+02:00

r12581_ticket2418: end cleaning, see #2418

12684smasson2020-04-05T18:47:37+02:00

r12581_ticket2418: additional cleaning, see #2418

12655smasson2020-04-03T11:35:09+02:00

r12581_ticket2418: merge with trunk@12654, see #2418

12623smasson2020-03-28T08:38:26+01:00

r12581_ticket2418: merge with trunk@12622, see #2418

12593smasson2020-03-24T16:52:17+01:00

r12581_ticket2418, first commit see #2418

12582smasson2020-03-21T11:58:26+01:00

r12581_ticket2418: create branch from trunk@12581, see #2418

Change History (35)

comment:1 Changed 5 months ago by smasson

In 12582:

r12581_ticket2418: create branch from trunk@12581, see #2418

comment:2 Changed 4 months ago by smasson

In 12593:

r12581_ticket2418, first commit see #2418

comment:3 Changed 4 months ago by smasson

This is quite a large commit for a bugfix. So I used a branch to share it before merge it back to the trunk.

One of the complexity of this ticket is coming from the duplication of the "same" routines in different directories. Maintaining and synchronizing this directories is not always properly done, especially when these configurations/tests cases are not tested by sette…

There are the main points of this commit

  • fix the reported bug
  • get back the process number of which the error is found
  • add a "CALL SLEEP (60)" in urgent and imperative stop to make sure that all processes have time to write their error messages and their abort file
  • some minor bugfixes: arguments order in one call to dia_wri_state
  • some minor optimisations: use of llmsk, do not look for min/max if not needed etc…
  • general cleaning of stpctl (with more comments to follow which processus is doing what)
  • synchronization of nemogcm, step and stpctl in OCE, SAS, C1D. version of OFF seems to be OK.
  • add error tests in SAS/stpctl
  • partly rewriting stpctl to minimize the differences between stpctl in OCE and SAS.
  • suppres sn_cfctl%l_glochk which had no real use (or I miss it)

There are the missing points of this commit

  • c1d and STATION_ASF are not in the sette tests and must therefore be tested (and included in the sette test, I guess). One must check that this configurations are still working AND that they are properly stoping when an error has to be detected. This last test must be done by forcing an error in step, just before calling stp_ctl
  • I would like to suppress sn_cfctl%l_allon and sn_cfctl%l_config which I find not very usefull
  • test on ice temperature has been move from -100 to -101 as errors where detected in ICE_AGRIF (could not get sette reproducibility). This error in ICE_AGRIF should be solved independently of this ticket
  • we should try to limit the number of subroutines in STATION_ASF/MY_SRC to make it sustainable…

this branch pass all sette tests and gives the same results as trunk@12563:

-bash-4.2$ ./sette_rpt.sh

Current code is : NEMO/branches/2020/r12581_ticket2418 @ r12582  ( last change @ r12582 )

SETTE validation report generated for :

       NEMO/branches/2020/r12581_ticket2418 @ r12582+ (last changed revision)

       on X64_IRENE arch file


!!---------------1st pass------------------!!

   !----restart----!
WGYRE_PISCES_ST              run.stat    restartability  passed :  12582+
WGYRE_PISCES_ST              tracer.stat restartability  passed :  12582+
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  12582+
WORCA2_ICE_PISCES_ST         tracer.stat restartability  passed :  12582+
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  12582+
WAMM12_ST                    run.stat    restartability  passed :  12582+
WORCA2_SAS_ICE_ST            run.stat    restartability  passed :  12582+
WAGRIF_DEMO_ST               run.stat    restartability  passed :  12582+
WSPITZ12_ST                  run.stat    restartability  passed :  12582+
WISOMIP_ST                   run.stat    restartability  passed :  12582+
WOVERFLOW_ST                 run.stat    restartability  passed :  12582+
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  12582+
WVORTEX_ST                   run.stat    restartability  passed :  12582+
WICE_AGRIF_ST                run.stat    restartability  passed :  12582+

   !----repro----!
WGYRE_PISCES_ST              run.stat    reproducibility passed :  12582+
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  12582+
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  12582+
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  12582+
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  12582+
WAMM12_ST                    run.stat    reproducibility passed :  12582+
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  12582+
WORCA2_ICE_OBS_ST            run.stat    reproducibility passed :  12582+
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  12582+
WSPITZ12_ST                  run.stat    reproducibility passed :  12582+
WISOMIP_ST                   run.stat    reproducibility passed :  12582+
WVORTEX_ST                   run.stat    reproducibility passed :  12582+
WICE_AGRIF_ST                run.stat    reproducibility passed :  12582+

   !----agrif check----!
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  12582+ 12582+

   !----result comparison check----!

check result differences between :
VALID directory : /ccc/scratch/cont005/ra0542/massons/r12581_ticket2418/NEMO_VALIDATION at rev 12582+
and
REFERENCE directory : /ccc/scratch/cont005/ra0542/massons/trunk/NEMO_VALIDATION at rev 12563

WGYRE_PISCES_ST       run.stat    files are identical
WGYRE_PISCES_ST       tracer.stat files are identical
WORCA2_ICE_PISCES_ST  run.stat    files are identical
WORCA2_ICE_PISCES_ST  tracer.stat files are identical
WORCA2_OFF_PISCES_ST  tracer.stat files are identical
WAMM12_ST             run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WORCA2_SAS_ICE_ST     run.stat    files are identical
WAGRIF_DEMO_ST        run.stat    files are identical
WSPITZ12_ST           run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WVORTEX_ST            run.stat    files are identical
WICE_AGRIF_ST         run.stat    files are identical

comment:4 Changed 4 months ago by smasson

In 12623:

r12581_ticket2418: merge with trunk@12622, see #2418

comment:5 Changed 4 months ago by smasson

In 12655:

r12581_ticket2418: merge with trunk@12654, see #2418

comment:6 Changed 4 months ago by smasson

In 12684:

r12581_ticket2418: additional cleaning, see #2418

comment:7 Changed 4 months ago by smasson

this branch pass all sette tests with GCC and gives the same results as trunk@12650:

Current code is : NEMO/branches/2020/r12581_ticket2418 @ r12683  ( last change @ r12655 )

SETTE validation report generated for :

       NEMO/branches/2020/r12581_ticket2418 @ r12655+ (last changed revision)

       on X64_IRENE_GCC arch file


!!---------------1st pass------------------!!

   !----restart----!
WGYRE_PISCES_ST              run.stat    restartability  passed :  12655+
WGYRE_PISCES_ST              tracer.stat restartability  passed :  12655+
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  12655+
WORCA2_ICE_PISCES_ST         tracer.stat restartability  passed :  12655+
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  12655+
WAMM12_ST                    run.stat    restartability  passed :  12655+
WORCA2_SAS_ICE_ST            run.stat    restartability  passed :  12655+
WAGRIF_DEMO_ST               run.stat    restartability  passed :  12655+
WSPITZ12_ST                  run.stat    restartability  passed :  12655+
WISOMIP_ST                   run.stat    restartability  passed :  12655+
WOVERFLOW_ST                 run.stat    restartability  passed :  12655+
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  12655+
WVORTEX_ST                   run.stat    restartability  passed :  12655+
WICE_AGRIF_ST                run.stat    restartability  passed :  12655+

   !----repro----!
WGYRE_PISCES_ST              run.stat    reproducibility passed :  12655+
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  12655+
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  12655+
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  12655+
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  12655+
WAMM12_ST                    run.stat    reproducibility passed :  12655+
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  12655+
WORCA2_ICE_OBS_ST            run.stat    reproducibility passed :  12655+
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  12655+
WSPITZ12_ST                  run.stat    reproducibility passed :  12655+
WISOMIP_ST                   run.stat    reproducibility passed :  12655+
WVORTEX_ST                   run.stat    reproducibility passed :  12655+
WICE_AGRIF_ST                run.stat    reproducibility passed :  12655+

   !----agrif check----!
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  12655+ 12655+

   !----result comparison check----!

check result differences between :
VALID directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12655+
and
REFERENCE directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12650

WGYRE_PISCES_ST       run.stat    files are identical
WGYRE_PISCES_ST       tracer.stat files are identical
WORCA2_ICE_PISCES_ST  run.stat    files are identical
WORCA2_ICE_PISCES_ST  tracer.stat files are identical
WORCA2_OFF_PISCES_ST  tracer.stat files are identical
WAMM12_ST             run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WORCA2_SAS_ICE_ST     run.stat    files are identical
WAGRIF_DEMO_ST        run.stat    files are identical
WSPITZ12_ST           run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WVORTEX_ST            run.stat    files are identical
WICE_AGRIF_ST         run.stat    files are identical

comment:8 Changed 4 months ago by smasson

In 12685:

r12581_ticket2418: end cleaning, see #2418

comment:9 Changed 4 months ago by smasson

In 12718:

r12581_ticket2418: bugfix for C1D and STATION_ASF, see #2418

comment:10 Changed 3 months ago by smasson

In 12835:

r12581_ticket2418: suppress l_allon and l_config namelist parameters, see #2418

comment:11 Changed 3 months ago by smasson

In 12840:

r12581_ticket2418: improve stpctl error messages and release the max of 9999 MPI tasks in files names, see #2418

comment:12 Changed 3 months ago by smasson

This version pass all the sette tests and gives the same results as the trunk at rev 12650.

Note that, to use sette, I had to remove all the following lines

 set_namelist namelist_cfg sn_cfctl%l_config .true.

in sette_reference-configurations.sh and sette_test-cases.sh as sn_cfctl%l_config has been removed from the nameliste

-bash-4.2$ ./sette_rpt.sh

Current code is : NEMO/branches/2020/r12581_ticket2418 @ r12840  ( last change @ r12840 )

SETTE validation report generated for :

       NEMO/branches/2020/r12581_ticket2418 @ r12840 (last changed revision)

       on X64_IRENE arch file


!!---------------1st pass------------------!!

   !----restart----!
WGYRE_PISCES_ST              run.stat    restartability  passed :  12840
WGYRE_PISCES_ST              tracer.stat restartability  passed :  12840
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  12840
WORCA2_ICE_PISCES_ST         tracer.stat restartability  passed :  12840
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  12840
WAMM12_ST                    run.stat    restartability  passed :  12840
WORCA2_SAS_ICE_ST            run.stat    restartability  passed :  12840
WAGRIF_DEMO_ST               run.stat    restartability  passed :  12840
WSPITZ12_ST                  run.stat    restartability  passed :  12840
WISOMIP_ST                   run.stat    restartability  passed :  12840
WOVERFLOW_ST                 run.stat    restartability  passed :  12840
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  12840
WVORTEX_ST                   run.stat    restartability  passed :  12840
WICE_AGRIF_ST                run.stat    restartability  passed :  12840

   !----repro----!
WGYRE_PISCES_ST              run.stat    reproducibility passed :  12840
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  12840
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  12840
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  12840
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  12840
WAMM12_ST                    run.stat    reproducibility passed :  12840
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  12840
WORCA2_ICE_OBS_ST            run.stat    reproducibility passed :  12840
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  12840
WSPITZ12_ST                  run.stat    reproducibility passed :  12840
WISOMIP_ST                   run.stat    reproducibility passed :  12840
WVORTEX_ST                   run.stat    reproducibility passed :  12840
WICE_AGRIF_ST                run.stat    reproducibility passed :  12840

   !----agrif check----!
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  12840 12840

   !----result comparison check----!

check result differences between :
VALID directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12840
and
REFERENCE directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12650

WGYRE_PISCES_ST       run.stat    files are identical
WGYRE_PISCES_ST       tracer.stat files are identical
WORCA2_ICE_PISCES_ST  run.stat    files are identical
WORCA2_ICE_PISCES_ST  tracer.stat files are identical
WORCA2_OFF_PISCES_ST  tracer.stat files are identical
WAMM12_ST             run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WORCA2_SAS_ICE_ST     run.stat    files are identical
WAGRIF_DEMO_ST        run.stat    files are identical
WSPITZ12_ST           run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WVORTEX_ST            run.stat    files are identical
WICE_AGRIF_ST         run.stat    files are identical

comment:13 Changed 3 months ago by smasson

In 12844:

r12581_ticket2418: merge with trunk@12843, see #2418

comment:14 Changed 3 months ago by smasson

In 12846:

r12581_ticket2418: merge with trunk@12845, see #2418

comment:15 Changed 3 months ago by smasson

In 12853:

r12581_ticket2418: merge with trunk@12852, see #2418

comment:16 Changed 3 months ago by smasson

In 12855:

r12581_ticket2418: add check for Infinity, see #2418

comment:17 Changed 3 months ago by smasson

In 12856:

r12581_ticket2418: stupid bugfix following [12855], see #2418

comment:18 Changed 3 months ago by smasson

this version gives the same results as the trunk@12852

Current code is : NEMO/branches/2020/r12581_ticket2418 @ r12856  ( last change @ r12856 )

SETTE validation report generated for :

       NEMO/branches/2020/r12581_ticket2418 @ r12856 (last changed revision)

       on X64_JEANZAY arch file


!!---------------1st pass------------------!!

   !----restart----!
WGYRE_PISCES_ST              run.stat    restartability  passed :  12856
WGYRE_PISCES_ST              tracer.stat restartability  passed :  12856
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  12856
WORCA2_ICE_PISCES_ST         tracer.stat restartability  passed :  12856
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  12856
WAMM12_ST                    run.stat    restartability  passed :  12856
WORCA2_SAS_ICE_ST            run.stat    restartability  passed :  12856
WAGRIF_DEMO_ST               run.stat    restartability  passed :  12856
WSPITZ12_ST                  run.stat    restartability  passed :  12856
WISOMIP_ST                   run.stat    restartability  passed :  12856
WOVERFLOW_ST                 run.stat    restartability  passed :  12856
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  12856
WVORTEX_ST                   run.stat    restartability  passed :  12856
WICE_AGRIF_ST                run.stat    restartability  passed :  12856

   !----repro----!
WGYRE_PISCES_ST              run.stat    reproducibility passed :  12856
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  12856
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  12856
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  12856
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  12856
WAMM12_ST                    run.stat    reproducibility passed :  12856
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  12856
WORCA2_ICE_OBS_ST            run.stat    reproducibility passed :  12856
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  12856
WSPITZ12_ST                  run.stat    reproducibility passed :  12856
WISOMIP_ST                   run.stat    reproducibility passed :  12856
WVORTEX_ST                   run.stat    reproducibility passed :  12856
WICE_AGRIF_ST                run.stat    reproducibility passed :  12856

   !----agrif check----!
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  12856 12856

   !----result comparison check----!

check result differences between :
VALID directory : /gpfsscratch/rech/fqx/reee217/r12581_ticket2418/NEMO_VALIDATION at rev 12856
and
REFERENCE directory : /gpfswork/rech/fqx/reee217/NEMO_VALIDATION/trunk at rev 12852

WGYRE_PISCES_ST       run.stat    files are identical
WGYRE_PISCES_ST       tracer.stat files are identical
WORCA2_ICE_PISCES_ST  run.stat    files are identical
WORCA2_ICE_PISCES_ST  tracer.stat files are identical
WORCA2_OFF_PISCES_ST  tracer.stat files are identical
WAMM12_ST             run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WORCA2_SAS_ICE_ST     run.stat    files are identical
WAGRIF_DEMO_ST        run.stat    files are identical
WSPITZ12_ST           run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WVORTEX_ST            run.stat    files are identical
WICE_AGRIF_ST         run.stat    files are identical

comment:19 Changed 3 months ago by smasson

I think this branch is ready to be merge back into the trunk.

However, there is still 2 points to discuss:

  • this branch needs a modification of sette as sn_cfctl%l_config has been removed from the nameliste
  • the documentation should also be modified because of the modifications done in namctl. It should be easy as we just have to delete some text. But as the compilation of the documentation is not working in the trunk, maybe it would be better to modify the documentation only once its compilation problem has been fixed…

comment:20 Changed 3 months ago by smasson

In 12858:

r12581_ticket2418: bugfix not seen on X64_IRENE, see #2418

comment:21 Changed 3 months ago by smasson

Strange behavior of the WRITE command…

I tested the following program with different compilers:

PROGRAM tst

   CHARACTER(10) char

   WRITE(char,*) 'aaa'
   PRINT *,char
   WRITE(char,*) 'bbb ', TRIM(char)
   PRINT *,char
   
END PROGRAM tst

ifort 16.0.4 20160811 and ifort 18.0.5 20180823 print

  aaa
  bbb  aaa

which was what I expected…
But ifort 19.0.5.281 20190815, gcc/4.8.5, gcc/9.1.0 and pgi/20.1 will print

  aaa
  bbb  bbb

So it looks the syntaxe I used was not good. In any case, it was not a good idea…


comment:22 Changed 3 months ago by smasson

In 12930:

r12581_ticket2418: update with trunk @12929, see #2418

comment:23 Changed 3 months ago by smasson

In 12931:

sette: suppress set_namelist for sn_cfctl%l_config, see #2418

comment:24 Changed 3 months ago by smasson

In 12932:

r12581_ticket2418: update sette version in svn externals definition, see #2418

comment:25 Changed 3 months ago by smasson

In 12933:

trunk: merge back r12581_ticket2418 branch into the trunk, see #2418

comment:26 Changed 3 months ago by smasson

  • Resolution set to fixed
  • Status changed from new to closed

fixed in [12933].
[12933] passes all the sette tests and gives the same results as trunk@12925

[reee217@jean-zay3: sette]$ ./sette_rpt.sh

Current code is : NEMO/trunk @ r12933  ( last change @ r12933 )

SETTE validation report generated for :

       NEMO/trunk @ r12933 (last changed revision)

       on X64_JEANZAY arch file


!!---------------1st pass------------------!!

   !----restart----!
WGYRE_PISCES_ST              run.stat    restartability  passed :  12933
WGYRE_PISCES_ST              tracer.stat restartability  passed :  12933
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  12933
WORCA2_ICE_PISCES_ST         tracer.stat restartability  passed :  12933
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  12933
WAMM12_ST                    run.stat    restartability  passed :  12933
WORCA2_SAS_ICE_ST            run.stat    restartability  passed :  12933
WAGRIF_DEMO_ST               run.stat    restartability  passed :  12933
WSPITZ12_ST                  run.stat    restartability  passed :  12933
WISOMIP_ST                   run.stat    restartability  passed :  12933
WOVERFLOW_ST                 run.stat    restartability  passed :  12933
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  12933
WVORTEX_ST                   run.stat    restartability  passed :  12933
WICE_AGRIF_ST                run.stat    restartability  passed :  12933

   !----repro----!
WGYRE_PISCES_ST              run.stat    reproducibility passed :  12933
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  12933
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  12933
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  12933
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  12933
WAMM12_ST                    run.stat    reproducibility passed :  12933
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  12933
WORCA2_ICE_OBS_ST            run.stat    reproducibility passed :  12933
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  12933
WSPITZ12_ST                  run.stat    reproducibility passed :  12933
WISOMIP_ST                   run.stat    reproducibility passed :  12933
WVORTEX_ST                   run.stat    reproducibility passed :  12933
WICE_AGRIF_ST                run.stat    reproducibility passed :  12933

   !----agrif check----!
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  12933 12933

   !----result comparison check----!

check result differences between :
VALID directory : /gpfswork/rech/fqx/reee217/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12933
and
REFERENCE directory : /gpfswork/rech/fqx/reee217/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12925

WGYRE_PISCES_ST       run.stat    files are identical
WGYRE_PISCES_ST       tracer.stat files are identical
WORCA2_ICE_PISCES_ST  run.stat    files are identical
WORCA2_ICE_PISCES_ST  tracer.stat files are identical
WORCA2_OFF_PISCES_ST  tracer.stat files are identical
WAMM12_ST             run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WORCA2_SAS_ICE_ST     run.stat    files are identical
WAGRIF_DEMO_ST        run.stat    files are identical
WSPITZ12_ST           run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WVORTEX_ST            run.stat    files are identical
WICE_AGRIF_ST         run.stat    files are identical

comment:27 Changed 3 months ago by smasson

In 12935:

delete #2418 branch

comment:28 Changed 2 months ago by smasson

  • Resolution fixed deleted
  • Status changed from closed to reopened

The solution that was coded is not good as it is machine/compiler dependent…

In ctl_stop when all processes detecting an error are writing in the same ocean.output file, there is a chance that some error messages are erased/overwritten. This is specially the case when 'STOP' is not the first argument of ctl_stop. In this case, many other things can potentially be written in ocean.output file. This can potentially erase some error messages…

Solution :

  • each error message are written in its specific ocean.output_xxxx file. If this file is not opened before entering ctl_stop, we create it.
  • if 'STOP' is the first argument of ctl_stop: each process detecting an error is also adding the following message in the ocean.output file:
     ==>>>   Look for "E R R O R" messages in all existing *ocean.output* files'
    
    This message can be erased/overwritten by all processes detecting and error, but we know that the last process writing in ocean.output will be a process writing this message. So we will get at least 1 line with the message!
  • if 'STOP' is not the first argument of ctl_stop, we know that process 0 will be the last writing in ocean.output file at the end of nemo_gcm, when nstop > 0 is tested. The above error message is then written in ocean.output only by process 0.

comment:29 Changed 2 months ago by smasson

In 13011:

trunk: make sure error messages are visible, see #2418

comment:30 Changed 2 months ago by smasson

  • Resolution set to fixed
  • Status changed from reopened to closed

fixed in [13011]
[13012] pass all sette tests and gives the same results as trunk@12925

Current code is : NEMO/trunk @ r13012  ( last change @ r13012 )

SETTE validation report generated for :

       NEMO/trunk @ r13012 (last changed revision)

       on X64_JEANZAY arch file


!!---------------1st pass------------------!!

   !----restart----!
WGYRE_PISCES_ST              run.stat    restartability  passed :  13012
WGYRE_PISCES_ST              tracer.stat restartability  passed :  13012
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  13012
WORCA2_ICE_PISCES_ST         tracer.stat restartability  passed :  13012
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  13012
WAMM12_ST                    run.stat    restartability  passed :  13012
WORCA2_SAS_ICE_ST            run.stat    restartability  passed :  13012
WAGRIF_DEMO_ST               run.stat    restartability  passed :  13012
WSPITZ12_ST                  run.stat    restartability  passed :  13012
WISOMIP_ST                   run.stat    restartability  passed :  13012
WOVERFLOW_ST                 run.stat    restartability  passed :  13012
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  13012
WVORTEX_ST                   run.stat    restartability  passed :  13012
WICE_AGRIF_ST                run.stat    restartability  passed :  13012

   !----repro----!
WGYRE_PISCES_ST              run.stat    reproducibility passed :  13012
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  13012
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  13012
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  13012
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  13012
WAMM12_ST                    run.stat    reproducibility passed :  13012
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  13012
WORCA2_ICE_OBS_ST            run.stat    reproducibility passed :  13012
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  13012
WSPITZ12_ST                  run.stat    reproducibility passed :  13012
WISOMIP_ST                   run.stat    reproducibility passed :  13012
WVORTEX_ST                   run.stat    reproducibility passed :  13012
WICE_AGRIF_ST                run.stat    reproducibility passed :  13012

   !----agrif check----!
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  13012 13012

   !----result comparison check----!

check result differences between :
VALID directory : /gpfswork/rech/fqx/reee217/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 13012
and
REFERENCE directory : /gpfswork/rech/fqx/reee217/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12925

WGYRE_PISCES_ST       run.stat    files are identical
WGYRE_PISCES_ST       tracer.stat files are identical
WORCA2_ICE_PISCES_ST  run.stat    files are identical
WORCA2_ICE_PISCES_ST  tracer.stat files are identical
WORCA2_OFF_PISCES_ST  tracer.stat files are identical
WAMM12_ST             run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WORCA2_SAS_ICE_ST     run.stat    files are identical
WAGRIF_DEMO_ST        run.stat    files are identical
WSPITZ12_ST           run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WVORTEX_ST            run.stat    files are identical
WICE_AGRIF_ST         run.stat    files are identical

comment:31 Changed 7 weeks ago by smasson

  • Resolution fixed deleted
  • Status changed from closed to reopened

In stpctl, if

  • nstop > 0 when entering the routine
  • and we don't do collective communication
  • and no other error are found in the tests on min/max values

⇒ We won't call ctl_stop and, once exiting stpctl, some processes will have nstop > 0, others won't.
This create an MPI deadlock.

comment:32 Changed 7 weeks ago by smasson

In 13115:

trunk: fix potential deadlock, see #2418

comment:33 Changed 7 weeks ago by smasson

  • Resolution set to fixed
  • Status changed from reopened to closed

fixed in [13115]
[13115] pass all sette tests and gives the same results as trunk@12925

-bash-4.2$ ./sette_rpt.sh

Current code is : NEMO/trunk @ r13115  ( last change @ r13115 )

SETTE validation report generated for :

       NEMO/trunk @ r13115 (last changed revision)

       on X64_IRENE arch file


!!---------------1st pass------------------!!

   !----restart----!
WGYRE_PISCES_ST              run.stat    restartability  passed :  13115
WGYRE_PISCES_ST              tracer.stat restartability  passed :  13115
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  13115
WORCA2_ICE_PISCES_ST         tracer.stat restartability  passed :  13115
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  13115
WAMM12_ST                    run.stat    restartability  passed :  13115
WORCA2_SAS_ICE_ST            run.stat    restartability  passed :  13115
WAGRIF_DEMO_ST               run.stat    restartability  passed :  13115
WSPITZ12_ST                  run.stat    restartability  passed :  13115
WISOMIP_ST                   run.stat    restartability  passed :  13115
WOVERFLOW_ST                 run.stat    restartability  passed :  13115
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  13115
WVORTEX_ST                   run.stat    restartability  passed :  13115
WICE_AGRIF_ST                run.stat    restartability  passed :  13115

   !----repro----!
WGYRE_PISCES_ST              run.stat    reproducibility passed :  13115
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  13115
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  13115
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  13115
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  13115
WAMM12_ST                    run.stat    reproducibility passed :  13115
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  13115
WORCA2_ICE_OBS_ST            run.stat    reproducibility passed :  13115
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  13115
WSPITZ12_ST                  run.stat    reproducibility passed :  13115
WISOMIP_ST                   run.stat    reproducibility passed :  13115
WVORTEX_ST                   run.stat    reproducibility passed :  13115
WICE_AGRIF_ST                run.stat    reproducibility passed :  13115

   !----agrif check----!
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  13115 13115

   !----result comparison check----!

check result differences between :
VALID directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 13115
and
REFERENCE directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12925

WGYRE_PISCES_ST       run.stat    files are identical
WGYRE_PISCES_ST       tracer.stat files are identical
WORCA2_ICE_PISCES_ST  run.stat    files are identical
WORCA2_ICE_PISCES_ST  tracer.stat files are identical
WORCA2_OFF_PISCES_ST  tracer.stat files are identical
WAMM12_ST             run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WORCA2_SAS_ICE_ST     run.stat    files are identical
WAGRIF_DEMO_ST        run.stat    files are identical
WSPITZ12_ST           run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WVORTEX_ST            run.stat    files are identical
WICE_AGRIF_ST         run.stat    files are identical

comment:34 Changed 6 weeks ago by smasson

  • Resolution fixed deleted
  • Status changed from closed to reopened

MAXVAL with mask check can give back -HUGE value on land processors.
When sn_cfctl%l_runstat = F, these values will generate true for infinity tests:

ABS( zmax(1) + zmax(2) + zmax(3) ) > HUGE(1._wp)

comment:35 Changed 6 weeks ago by smasson

In 13136:

trunk: fix maxval values on land subdomains for stpctl, see #2418

Note: See TracTickets for help on using tickets.