Opened 3 months ago

Closed 6 weeks ago

#2456 closed Bug (fixed)

model does not stop properly in stpctl

Reported by: smasson Owned by: systeam
Priority: low Milestone:
Component: MULTIPLE Version: 4.0-HEAD
Severity: minor Keywords:
Cc:

Description

Context

Same as #2418 but for the r4.0-HEAD

Analysis

see #2418

Recommendation

In order to limit the modifications done in the r4.0-HEAD:

  • I reported only the needed modifications in OCE/stpctl.F90 (without the cleaning and the optimisations).
  • I mande only the minimum changes in SAS/stpctl.F90 without adding a check of the min/max in this routine.

Commit History (4)

ChangesetAuthorTimeChangeLog
13137smasson2020-06-22T08:29:57+02:00

r4.0-HEAD: fix maxval values on land subdomains for stpctl, see #2456

13116smasson2020-06-16T21:15:19+02:00

r4.0-HEAD: fix potential deadlock, see #2456

13013smasson2020-06-03T10:33:06+02:00

r4.0-HEAD: make sure error messages are visible, see #2456

12859smasson2020-05-03T11:33:32+02:00

r4-HEAD: stpctl bugfix, see #2456

Change History (11)

comment:1 Changed 3 months ago by smasson

In 12859:

r4-HEAD: stpctl bugfix, see #2456

comment:2 Changed 3 months ago by smasson

  • Resolution set to fixed
  • Status changed from new to closed

fixed in [12859]
pass all sette tests and gives the same results as r4.0-HEAD@12857

Current code is : NEMO/releases/r4.0/r4.0-HEAD @ r12859  ( last change @ r12859 )

SETTE validation report generated for :

       NEMO/releases/r4.0/r4.0-HEAD @ r12859 (last changed revision)

       on X64_JEANZAY arch file


!!---------------1st pass------------------!!

   !----restart----!
WGYRE_PISCES_ST              run.stat    restartability  passed :  12859
WGYRE_PISCES_ST              tracer.stat restartability  passed :  12859
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  12859
WORCA2_ICE_PISCES_ST         tracer.stat restartability  passed :  12859
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  12859
WAMM12_ST                    run.stat    restartability  passed :  12859
WORCA2_SAS_ICE_ST            run.stat    restartability  passed :  12859
WAGRIF_DEMO_ST               run.stat    restartability  passed :  12859
WSPITZ12_ST                  run.stat    restartability  passed :  12859
WISOMIP_ST                   run.stat    restartability  passed :  12859
WOVERFLOW_ST                 run.stat    restartability  passed :  12859
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  12859
WVORTEX_ST                   run.stat    restartability  passed :  12859
WICE_AGRIF_ST                run.stat    restartability  passed :  12859

   !----repro----!
WGYRE_PISCES_ST              run.stat    reproducibility passed :  12859
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  12859
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  12859
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  12859
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  12859
WAMM12_ST                    run.stat    reproducibility passed :  12859
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  12859
WORCA2_ICE_OBS_ST            run.stat    reproducibility passed :  12859
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  12859
WSPITZ12_ST                  run.stat    reproducibility passed :  12859
WISOMIP_ST                   run.stat    reproducibility passed :  12859
WVORTEX_ST                   run.stat    reproducibility passed :  12859
WICE_AGRIF_ST                run.stat    reproducibility passed :  12859

   !----agrif check----!
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  12859 12859

   !----result comparison check----!

check result differences between :
VALID directory : /gpfsscratch/rech/fqx/reee217/r4.0-HEAD/NEMO_VALIDATION at rev 12859
and
REFERENCE directory : /gpfswork/rech/fqx/reee217/NEMO_VALIDATION/r4.0 at rev 12857

WGYRE_PISCES_ST       run.stat    files are identical
WGYRE_PISCES_ST       tracer.stat files are identical
WORCA2_ICE_PISCES_ST  run.stat    files are identical
WORCA2_ICE_PISCES_ST  tracer.stat files are identical
WORCA2_OFF_PISCES_ST  tracer.stat files are identical
WAMM12_ST             run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WORCA2_SAS_ICE_ST     run.stat    files are identical
WAGRIF_DEMO_ST        run.stat    files are identical
WSPITZ12_ST           run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WVORTEX_ST            run.stat    files are identical
WICE_AGRIF_ST         run.stat    files are identical

!!---------------2nd pass------------------!!

   !----restart----!

   !----repro----!

   !----agrif check----!

   !----result comparison check----!

check result differences between :
VALID directory : /gpfsscratch/rech/fqx/reee217/r4.0-HEAD/NEMO_VALIDATION at rev 12859
and
REFERENCE directory : /gpfswork/rech/fqx/reee217/NEMO_VALIDATION/r4.0 at rev 12857

comment:3 Changed 2 months ago by smasson

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:4 Changed 2 months ago by smasson

In 13013:

r4.0-HEAD: make sure error messages are visible, see #2456

comment:5 Changed 2 months ago by smasson

  • Resolution set to fixed
  • Status changed from reopened to closed

fixed in [13013]
[13013] pass all sette tests and gives same results as r4.0-HEAD@12926

Current code is : NEMO/releases/r4.0/r4.0-HEAD @ r13013  ( last change @ r13013 )

SETTE validation report generated for :

       NEMO/releases/r4.0/r4.0-HEAD @ r13013 (last changed revision)

       on X64_JEANZAY arch file


!!---------------1st pass------------------!!

   !----restart----!
WGYRE_PISCES_ST              run.stat    restartability  passed :  13013
WGYRE_PISCES_ST              tracer.stat restartability  passed :  13013
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  13013
WORCA2_ICE_PISCES_ST         tracer.stat restartability  passed :  13013
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  13013
WAMM12_ST                    run.stat    restartability  passed :  13013
WORCA2_SAS_ICE_ST            run.stat    restartability  passed :  13013
WAGRIF_DEMO_ST               run.stat    restartability  passed :  13013
WSPITZ12_ST                  run.stat    restartability  passed :  13013
WISOMIP_ST                   run.stat    restartability  passed :  13013
WOVERFLOW_ST                 run.stat    restartability  passed :  13013
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  13013
WVORTEX_ST                   run.stat    restartability  passed :  13013
WICE_AGRIF_ST                run.stat    restartability  passed :  13013

   !----repro----!
WGYRE_PISCES_ST              run.stat    reproducibility passed :  13013
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  13013
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  13013
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  13013
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  13013
WAMM12_ST                    run.stat    reproducibility passed :  13013
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  13013
WORCA2_ICE_OBS_ST            run.stat    reproducibility passed :  13013
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  13013
WSPITZ12_ST                  run.stat    reproducibility passed :  13013
WISOMIP_ST                   run.stat    reproducibility passed :  13013
WVORTEX_ST                   run.stat    reproducibility passed :  13013
WICE_AGRIF_ST                run.stat    reproducibility passed :  13013

   !----agrif check----!
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  13013 13013

   !----result comparison check----!

check result differences between :
VALID directory : /gpfswork/rech/fqx/reee217/NEMO_ALL_VALIDATIONS/r4.0-HEAD/NEMO_VALIDATION at rev 13013
and
REFERENCE directory : /gpfswork/rech/fqx/reee217/NEMO_ALL_VALIDATIONS/r4.0-HEAD/NEMO_VALIDATION at rev 12926

WGYRE_PISCES_ST       run.stat    files are identical
WGYRE_PISCES_ST       tracer.stat files are identical
WORCA2_ICE_PISCES_ST  run.stat    files are identical
WORCA2_ICE_PISCES_ST  tracer.stat files are identical
WORCA2_OFF_PISCES_ST  tracer.stat files are identical
WAMM12_ST             run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WORCA2_SAS_ICE_ST     run.stat    files are identical
WAGRIF_DEMO_ST        run.stat    files are identical
WSPITZ12_ST           run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WVORTEX_ST            run.stat    files are identical
WICE_AGRIF_ST         run.stat    files are identical

comment:6 Changed 7 weeks ago by smasson

  • Resolution fixed deleted
  • Status changed from closed to reopened

Same as #2418, in stpctl, if

  • nstop > 0 when entering the routine
  • and we don't do collective communication
  • and no other error are found in the tests on min/max values

⇒ We won't call ctl_stop and, once exiting stpctl, some processes will have nstop > 0, others won't.
This create an MPI deadlock.

comment:7 Changed 7 weeks ago by smasson

In 13116:

r4.0-HEAD: fix potential deadlock, see #2456

comment:8 Changed 7 weeks ago by smasson

  • Resolution set to fixed
  • Status changed from reopened to closed

[13116] passes all sette tests and gives the same results as r4.0-HEAD@13095

Current code is : NEMO/releases/r4.0/r4.0-HEAD @ r13116  ( last change @ r13116 )

SETTE validation report generated for :

       NEMO/releases/r4.0/r4.0-HEAD @ r13116 (last changed revision)

       on X64_IRENE arch file


!!---------------1st pass------------------!!

   !----restart----!
WGYRE_PISCES_ST              run.stat    restartability  passed :  13116
WGYRE_PISCES_ST              tracer.stat restartability  passed :  13116
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  13116
WORCA2_ICE_PISCES_ST         tracer.stat restartability  passed :  13116
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  13116
WAMM12_ST                    run.stat    restartability  passed :  13116
WORCA2_SAS_ICE_ST            run.stat    restartability  passed :  13116
WAGRIF_DEMO_ST               run.stat    restartability  passed :  13116
WSPITZ12_ST                  run.stat    restartability  passed :  13116
WISOMIP_ST                   run.stat    restartability  passed :  13116
WOVERFLOW_ST                 run.stat    restartability  passed :  13116
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  13116
WVORTEX_ST                   run.stat    restartability  passed :  13116
WICE_AGRIF_ST                run.stat    restartability  passed :  13116

   !----repro----!
WGYRE_PISCES_ST              run.stat    reproducibility passed :  13116
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  13116
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  13116
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  13116
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  13116
WAMM12_ST                    run.stat    reproducibility passed :  13116
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  13116
WORCA2_ICE_OBS_ST            run.stat    reproducibility passed :  13116
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  13116
WSPITZ12_ST                  run.stat    reproducibility passed :  13116
WISOMIP_ST                   run.stat    reproducibility passed :  13116
WVORTEX_ST                   run.stat    reproducibility passed :  13116
WICE_AGRIF_ST                run.stat    reproducibility passed :  13116

   !----agrif check----!
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  13116 13116

   !----result comparison check----!

check result differences between :
VALID directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/r4.0-HEAD/NEMO_VALIDATION at rev 13116
and
REFERENCE directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/r4.0-HEAD/NEMO_VALIDATION at rev 13095

WGYRE_PISCES_ST       run.stat    files are identical
WGYRE_PISCES_ST       tracer.stat files are identical
WORCA2_ICE_PISCES_ST  run.stat    files are identical
WORCA2_ICE_PISCES_ST  tracer.stat files are identical
WORCA2_OFF_PISCES_ST  tracer.stat files are identical
WAMM12_ST             run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WORCA2_SAS_ICE_ST     run.stat    files are identical
WAGRIF_DEMO_ST        run.stat    files are identical
WSPITZ12_ST           run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WVORTEX_ST            run.stat    files are identical
WICE_AGRIF_ST         run.stat    files are identical

comment:9 Changed 6 weeks ago by smasson

  • Resolution fixed deleted
  • Status changed from closed to reopened

same as #2418:

MAXVAL with mask check can give back -HUGE value on land processors.
When sn_cfctl%l_runstat = F, these values will generate true for infinity tests:

ABS( zmax(1) + zmax(2) + zmax(3) ) > HUGE(1._wp)

comment:10 Changed 6 weeks ago by smasson

In 13137:

r4.0-HEAD: fix maxval values on land subdomains for stpctl, see #2456

comment:11 Changed 6 weeks ago by smasson

  • Resolution set to fixed
  • Status changed from reopened to closed

[13137] fixes the problem. It passes all sette tests and gives the same results as [13095]

SETTE validation report generated for :

        @ r13137 (last changed revision)

       on X64_JEANZAY arch file


!!---------------1st pass------------------!!

   !----restart----!
WGYRE_PISCES_ST              run.stat    restartability  passed :  13137
WGYRE_PISCES_ST              tracer.stat restartability  passed :  13137
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  13137
WORCA2_ICE_PISCES_ST         tracer.stat restartability  passed :  13137
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  13137
WAMM12_ST                    run.stat    restartability  passed :  13137
WORCA2_SAS_ICE_ST            run.stat    restartability  passed :  13137
WAGRIF_DEMO_ST               run.stat    restartability  passed :  13137
WSPITZ12_ST                  run.stat    restartability  passed :  13137
WISOMIP_ST                   run.stat    restartability  passed :  13137
WOVERFLOW_ST                 run.stat    restartability  passed :  13137
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  13137
WVORTEX_ST                   run.stat    restartability  passed :  13137
WICE_AGRIF_ST                run.stat    restartability  passed :  13137

   !----repro----!
WGYRE_PISCES_ST              run.stat    reproducibility passed :  13137
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  13137
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  13137
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  13137
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  13137
WAMM12_ST                    run.stat    reproducibility passed :  13137
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  13137
WORCA2_ICE_OBS_ST            run.stat    reproducibility passed :  13137
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  13137
WSPITZ12_ST                  run.stat    reproducibility passed :  13137
WISOMIP_ST                   run.stat    reproducibility passed :  13137
WVORTEX_ST                   run.stat    reproducibility passed :  13137
WICE_AGRIF_ST                run.stat    reproducibility passed :  13137

   !----agrif check----!
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  13137 13137

   !----result comparison check----!

check result differences between :
VALID directory : /gpfswork/rech/fqx/reee217/NEMO_ALL_VALIDATIONS/r4.0-HEAD/NEMO_VALIDATION at rev 13137
and
REFERENCE directory : /gpfswork/rech/fqx/reee217/NEMO_ALL_VALIDATIONS/r4.0-HEAD/NEMO_VALIDATION at rev 13095

WGYRE_PISCES_ST       run.stat    files are identical
WGYRE_PISCES_ST       tracer.stat files are identical
WORCA2_ICE_PISCES_ST  run.stat    files are identical
WORCA2_ICE_PISCES_ST  tracer.stat files are identical
WORCA2_OFF_PISCES_ST  tracer.stat files are identical
WAMM12_ST             run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WORCA2_SAS_ICE_ST     run.stat    files are identical
WAGRIF_DEMO_ST        run.stat    files are identical
WSPITZ12_ST           run.stat    files are identical
WISOMIP_ST            run.stat    files are identical
WVORTEX_ST            run.stat    files are identical
WICE_AGRIF_ST         run.stat    files are identical
Note: See TracTickets for help on using tickets.