Opened 6 years ago
#343 new enhancement
DEVT/PROD and jobs stop. Coming from Irene but still required.
Reported by: | mafoipsl | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | libIGCM_v2.8.4 |
Component: | AMQP Broker | Version: | |
Keywords: | libIGCM pack_output DEVT PROD | Cc: |
Description
On Irene, post-processing jobs could stop for new type of problem. If you are with DEVT in config.card it's impossible to correct the situation. You have to redo the simulation. Making long and safe run is required in DEVT mode too.
Suggestion : stop jobs for both PROD and DEVT.
+++ libIGCM_debug.ksh (working copy) @@ -887,7 +887,7 @@ # If SpaceName is PROD we stop when post_processing failed - if [ X${config_UserChoices_SpaceName} = XPROD ] ; then + if [ X${config_UserChoices_SpaceName} = XPROD ] || [ X${config_UserChoices_SpaceName} = XDEVT ] ; then echo " EXIT THE POST-PROCESSING JOB." @@ -899,7 +899,7 @@ else - echo "In config.card the variable SpaceName is not in PROD" + echo "In config.card the variable SpaceName is not in PROD nor DEVT" echo " SO WE DO NOT EXIT THE JOB."
/ccc/scratch/cont003/gen0239/p86maf/IGCM_OUT/IPSLCM6/DEVT/pdControl/CM618-MR-pd-TEST-01/Out/pack_output.18591231.out
Here an example :
2018-12-15 19:01:00 --------Debug2--> IGCM_sys_ncrcat : error code 263 /tmp/jobstart.186772[311]: IGCM_sys_ncrcat: line 1464: 211356: Bus error 2018-12-15 19:01:00 --------Debug2--> IGCM_sys_ncrcat : 0/3 sleep 2 seconds and try again. 2018-12-15 19:01:02 --------Debug2--> IGCM_sys_ncrcat : error code 127 ncrcat: error while loading shared libraries: /ccc/products/ccc_users_env/compil/Atos_7__x86_64/hdf5-1.8.20/intel--17.0.4.196__openmpi--2.0.2/parallel/lib/libhdf5.so.10: cannot read file data: Input/output error 2018-12-15 19:01:02 --------Debug2--> IGCM_sys_ncrcat : 1/3 sleep 2 seconds and try again. 2018-12-15 19:01:04 --------Debug2--> IGCM_sys_ncrcat : error code 127 ncrcat: error while loading shared libraries: /ccc/products/ccc_users_env/compil/Atos_7__x86_64/hdf5-1.8.20/intel--17.0.4.196__openmpi--2.0.2/parallel/lib/libhdf5.so.10: cannot read file data: Input/output error 2018-12-15 19:01:04 --------Debug2--> IGCM_sys_ncrcat : 2/3 sleep 2 seconds and try again. IGCM_sys_ncrcat : ncrcat error IGCM_sys_Put_Out : CM618-MR-pd-TEST-01_18500101_18591231_5D_grid_W.nc /ccc/store/cont003/gen0239/p86maf/IGCM_OUT/IPSLCM6/DEVT/pdControl/CM618-MR-pd-TEST-01/OCE/Output/DA/CM618-MR-pd-TEST-01_18500101_18591231_5D_grid_W.nc WARNING : IGCM_sys_Put_Out CM618-MR-pd-TEST-01_18500101_18591231_5D_grid_W.nc DOES NOT EXIST . du: cannot access './POST/monitoring01_opa9_ORCA2.cfg': Cannot send after transport endpoint shutdown du: cannot access './liste_files.OCE.Output.DA.5D_grid_T.nc.txt': Cannot send after transport endpoint shutdown du: cannot access './COMP/opa9.card': Cannot send after transport endpoint shutdown
If you are with DEVT in config.card, the pack_output job launched by hand includes an error BUT cleans files. See NCRCAT AND CLEANING in the example.
/ccc/cont003/dsku/perle1/home/app/gencmip6/p86maf/IRENE/IPSL-CM6A-MR/IPSLCM6.1.8/T20181203/modipsl/config/IPSLCM6/CM618-MR-pd-TEST-01/POST_REDO/PACKOUTPUT.out_1083216
2018-12-17 09:58:32 --Debug1--> Number of files to process is not equal to what it should be 2018-12-17 09:58:32 --Debug1--> We found 1 files and it should have been 10 files IGCM_debug_Exit : ERROR in number of files to process. STOP HERE INCLUDING THE COMPUTING JOB !!!!!!!!!!!!!!!!!!!!!!!!!! !! ERROR TRIGGERED !! !! EXIT FLAG SET !! !------------------------! IGCM_debug_Verif_Exit : Something wrong happened previously. IGCM_debug_Verif_Exit : ERROR and EXIT keyword will help find out where. In config.card the variable SpaceName is not in PROD SO WE DO NOT EXIT THE JOB. Mon Dec 17 09:58:32 CET 2018 IGCM_sys_ncrcat : -p /ccc/scratch/cont003/gen0239/p86maf/IGCM_OUT/IPSLCM6/DEVT/pdControl/CM618-MR-pd-TEST-01/OCE/Output/MO CM618-MR-pd-TEST-01_18570101_18571231_1M_diaptr_W.nc --output CM618-MR-pd-TEST-01_18500101_18591231_1M_diaptr_W.nc IGCM_debug_Verif_Exit : Something wrong happened previously. IGCM_debug_Verif_Exit : ERROR and EXIT keyword will help find out where. In config.card the variable SpaceName is not in PROD SO WE DO NOT EXIT THE JOB. ... Mon Dec 17 09:58:32 CET 2018 '''2018-12-17 09:58:32 --Debug1--> Ncrcat and cleaning done for /ccc/scratch/cont003/gen0239/p86maf/IGCM_OUT/IPSLCM6/DEVT/pdControl/CM618-MR-pd-TEST-01/OCE/Output/MO and 1M_diaptr_W.nc'''