Opened 2 years ago

Closed 23 months ago

Last modified 23 months ago

#2307 closed Enhancement (fixed)

better errors handling

Reported by: smasson Owned by: systeam
Priority: low Milestone:
Component: TOP Version: release-4.0
Severity: minor Keywords:
Cc: Branch review:
MP ready?: Task progress:


in some cases you have an error detection without any message. Something like that in ocean.output without further informations.

===>>> : E R R O R

    ==>>>   nemo_gcm: a total of            2  errors have been found

This is quite frustrating…


1) errors occurring in the very first part of nemo_init before the opening of ocean.output are not all properly treated. We must extend the use of ldtxt in lib_mpp.

2) there is still several STOP in the code that should be replaced by a call to ctl_stop. See for example ctl_opn.

NST/agrif_user.F90:         STOP
NST/agrif_user.F90:         STOP
NST/agrif_oce_update.F90:                     STOP
NST/agrif_top_update.F90:                     STOP
OCE/CRS/crsdom.F90:                    STOP
OCE/CRS/crsdom.F90:                 STOP
OCE/CRS/crsdom.F90:                 STOP
OCE/SBC/cpl_oasis3.F90:         CALL oasis_abort( ncomp_id, "cpl_finalize", "NEMO ABORT STOP" )
OCE/DOM/iscplhsb.F90:      STOP ' iscpl_cons:   please modify this module !'
OCE/LBC/lib_mpp.F90:         STOP 'ctl_opn bad opening'
OCE/LBC/lbc_nfd_nogather_generic.h90:        STOP
TOP/TRP/trdmxl_trc.F90:            CASE ( -2  )   ;   STOP 'trdmxl_trc : not ready '     !     -> isopycnal surface (see ???)
TOP/TRP/trdmxl_trc.F90:               STOP 'tmltrd_trc : key_diainstant was never checked within trdmxl. Comment this to proceed.'
TOP/TRP/trdmxl_trc.F90:         STOP 'trd_mxl_trc : this was never checked. Comment this line to proceed...'
TOP/TRP/trdmxl_trc.F90:         STOP 'Error : jpltrd_trc /= jpmxl_trc_atf .OR.  jpltrd_trc - 1 /= jpmxl_trc_radb'  ! see below


Commit History (2)


dev_r10984_HPC-13 : add missing if(lwp) in [11317], see #2307 and #2285


dev_r10984_HPC-13 : improve error handling, see #2307 and #2285

Change History (5)

comment:1 Changed 23 months ago by smasson

  • Summary changed from better gestion of the errors to better errors handling

comment:2 Changed 23 months ago by smasson

In 11317:

dev_r10984_HPC-13 : improve error handling, see #2307 and #2285

comment:3 Changed 23 months ago by smasson

This bugfix was more complicated than I originally though…

There is a summary of the changes:
1) A large cleaning/reorganization of nemo_init

  • Rewrite mynode (rename it mpp_start) to do only mpi basic tasks (get local communicator, mpp size and rank) without any namelist input/output so we can call it at the early beginning of nemo_init.
  • Suppress old nammpp namelist options: cn_mpi_send and nn_buffer. For a long time, lbc_lnk is working only if cn_mpi_send = 'I', so I suppress the possibility to choose something else…
  • Move the read of nammpp in mppini
  • As we now start nemo_init with a call to mpp_start, we know the processus rank and we can open ocean.output file as soon as possible to retrieve all output prints and errors messages.
  • Therefore, we can suppress the use of cltxt in nemo_init and other routines as it is no more needed.

2) In lib_mpp.F90:

  • Replace mynode by mpp_start, remove mpi_init_oce
  • Improve error handling in ctl_opn (suppress a trouble making STOP 'ctl_opn bad opening’ ).
  • Cosmetics on ctl_stop and ctl_warn (suppress cform_err and cform_war in in_out_manager.F90 to force the use of ctl_stop and ctl_warn)

3) Modify all versions of usrdef_nam.F90 to output messages in numout instead of in a text variable
4) Suppress the last useless argument of ctl_nam (impact so many routines…)
5) Suppress all STOP outside of mppstop and the final stop at the end of nemo_gcm. Force to return a non-zero error code (123) in all error cases.

Because of the large number of modifications, I decided to do it directly in NEMO/branches/2019/dev_r10984_HPC-13_IRRMANN_BDY_optimization

Bellow are the results of the sette tests performed on CURIE (without the compilation option -xCORE-AVX512 and with nn_sponge_len = 1 in agrif_oce.F90).
The failed tests are exactly the same as NEMO/trunk@11258
All run.stat files are exactly the same as NEMO/branches/2019/dev_r10984_HPC-13_IRRMANN_BDY_optimization@11267

SETTE validation report : utils @ r10779  ( last change @ r11265 )

!!---------------1st pass------------------!!

WGYRE_PISCES_ST              run.stat    restartability  passed :  20190721
WGYRE_PISCES_ST              tracer.stat restartability  passed :  20190721
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  20190721
WORCA2_ICE_PISCES_ST         tracer.stat restartability  FAILED :  20190721
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  20190721
WAMM12_ST                    run.stat    restartability  passed :  20190721
WORCA2_SAS_ICE_ST            run.stat    restartability  FAILED :  20190722
WAGRIF_DEMO_ST               run.stat    restartability  passed :  20190722
WSPITZ12_ST                  run.stat    restartability  passed :  20190722
WISOMIP_ST                   run.stat    restartability  passed :  20190722
WOVERFLOW_ST                 run.stat    restartability  passed :  20190722
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  20190722
WVORTEX_ST                   run.stat    restartability  passed :  20190722
WICE_AGRIF_ST                run.stat    restartability  passed :  20190722

WGYRE_PISCES_ST              run.stat    reproducibility passed :  20190721
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  20190721
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  20190721
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  20190721
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  20190721
WAMM12_ST                    run.stat    reproducibility passed :  20190721
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  20190722
WORCA2_ICE_OBS_ST            directory is MISSING               :  20190722
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  20190722
WSPITZ12_ST                  run.stat    reproducibility passed :  20190722
WISOMIP_ST                   run.stat    reproducibility passed :  20190722
WVORTEX_ST                   run.stat    reproducibility passed :  20190722
WICE_AGRIF_ST                run.stat    reproducibility passed :  20190722

   !----agrif check----!
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  20190722 20190722

comment:4 Changed 23 months ago by smasson

  • Resolution set to fixed
  • Status changed from new to closed

comment:5 Changed 23 months ago by smasson

In 11320:

dev_r10984_HPC-13 : add missing if(lwp) in [11317], see #2307 and #2285

Note: See TracTickets for help on using tickets.