Opened 6 years ago

#343 new enhancement

DEVT/PROD and jobs stop. Coming from Irene but still required.

Reported by: mafoipsl Owned by:
Priority: major Milestone: libIGCM_v2.8.4
Component: AMQP Broker Version:
Keywords: libIGCM pack_output DEVT PROD Cc:

Description

On Irene, post-processing jobs could stop for new type of problem. If you are with DEVT in config.card it's impossible to correct the situation. You have to redo the simulation. Making long and safe run is required in DEVT mode too.

Suggestion : stop jobs for both PROD and DEVT.

+++ libIGCM_debug.ksh   (working copy)
@@ -887,7 +887,7 @@
       # If SpaceName is PROD we stop when post_processing failed
-      if [ X${config_UserChoices_SpaceName} = XPROD ] ; then
+      if [ X${config_UserChoices_SpaceName} = XPROD ] || [ X${config_UserChoices_SpaceName} = XDEVT ] ; then
         echo "                        EXIT THE POST-PROCESSING JOB."
@@ -899,7 +899,7 @@
       else
-        echo "In config.card the variable SpaceName is not in PROD"
+        echo "In config.card the variable SpaceName is not in PROD nor DEVT"
         echo "              SO WE DO NOT EXIT THE JOB."

/ccc/scratch/cont003/gen0239/p86maf/IGCM_OUT/IPSLCM6/DEVT/pdControl/CM618-MR-pd-TEST-01/Out/pack_output.18591231.out

Here an example :

2018-12-15 19:01:00 --------Debug2--> IGCM_sys_ncrcat : error code 263
/tmp/jobstart.186772[311]: IGCM_sys_ncrcat: line 1464: 211356: Bus error
2018-12-15 19:01:00 --------Debug2--> IGCM_sys_ncrcat : 0/3 sleep 2 seconds and try again.
2018-12-15 19:01:02 --------Debug2--> IGCM_sys_ncrcat : error code 127
ncrcat: error while loading shared libraries: /ccc/products/ccc_users_env/compil/Atos_7__x86_64/hdf5-1.8.20/intel--17.0.4.196__openmpi--2.0.2/parallel/lib/libhdf5.so.10: cannot read file data: Input/output error
2018-12-15 19:01:02 --------Debug2--> IGCM_sys_ncrcat : 1/3 sleep 2 seconds and try again.
2018-12-15 19:01:04 --------Debug2--> IGCM_sys_ncrcat : error code 127
ncrcat: error while loading shared libraries: /ccc/products/ccc_users_env/compil/Atos_7__x86_64/hdf5-1.8.20/intel--17.0.4.196__openmpi--2.0.2/parallel/lib/libhdf5.so.10: cannot read file data: Input/output error
2018-12-15 19:01:04 --------Debug2--> IGCM_sys_ncrcat : 2/3 sleep 2 seconds and try again.
IGCM_sys_ncrcat : ncrcat error
IGCM_sys_Put_Out : CM618-MR-pd-TEST-01_18500101_18591231_5D_grid_W.nc /ccc/store/cont003/gen0239/p86maf/IGCM_OUT/IPSLCM6/DEVT/pdControl/CM618-MR-pd-TEST-01/OCE/Output/DA/CM618-MR-pd-TEST-01_18500101_18591231_5D_grid_W.nc
WARNING : IGCM_sys_Put_Out CM618-MR-pd-TEST-01_18500101_18591231_5D_grid_W.nc DOES NOT EXIST .
du: cannot access './POST/monitoring01_opa9_ORCA2.cfg': Cannot send after transport endpoint shutdown
du: cannot access './liste_files.OCE.Output.DA.5D_grid_T.nc.txt': Cannot send after transport endpoint shutdown
du: cannot access './COMP/opa9.card': Cannot send after transport endpoint shutdown

If you are with DEVT in config.card, the pack_output job launched by hand includes an error BUT cleans files. See NCRCAT AND CLEANING in the example.

/ccc/cont003/dsku/perle1/home/app/gencmip6/p86maf/IRENE/IPSL-CM6A-MR/IPSLCM6.1.8/T20181203/modipsl/config/IPSLCM6/CM618-MR-pd-TEST-01/POST_REDO/PACKOUTPUT.out_1083216

2018-12-17 09:58:32 --Debug1--> Number of files to process is not equal to what it should be
2018-12-17 09:58:32 --Debug1--> We found 1 files and it should have been 10 files
IGCM_debug_Exit :  ERROR in number of files to process. STOP HERE INCLUDING THE COMPUTING JOB

!!!!!!!!!!!!!!!!!!!!!!!!!!
!!   ERROR TRIGGERED    !!
!!   EXIT FLAG SET      !!
!------------------------!

IGCM_debug_Verif_Exit : Something wrong happened previously.
IGCM_debug_Verif_Exit : ERROR and EXIT keyword will help find out where.
In config.card the variable SpaceName is not in PROD
              SO WE DO NOT EXIT THE JOB.

Mon Dec 17 09:58:32 CET 2018
IGCM_sys_ncrcat : -p /ccc/scratch/cont003/gen0239/p86maf/IGCM_OUT/IPSLCM6/DEVT/pdControl/CM618-MR-pd-TEST-01/OCE/Output/MO CM618-MR-pd-TEST-01_18570101_18571231_1M_diaptr_W.nc --output CM618-MR-pd-TEST-01_18500101_18591231_1M_diaptr_W.nc
IGCM_debug_Verif_Exit : Something wrong happened previously.
IGCM_debug_Verif_Exit : ERROR and EXIT keyword will help find out where.
In config.card the variable SpaceName is not in PROD
              SO WE DO NOT EXIT THE JOB.
...

Mon Dec 17 09:58:32 CET 2018
'''2018-12-17 09:58:32 --Debug1--> Ncrcat and cleaning done for /ccc/scratch/cont003/gen0239/p86maf/IGCM_OUT/IPSLCM6/DEVT/pdControl/CM618-MR-pd-TEST-01/OCE/Output/MO and 1M_diaptr_W.nc'''


Change History (0)

Note: See TracTickets for help on using tickets.