Opened 5 months ago
Last modified 5 months ago
#195 new defect
XIOS process exit failure (hang)
Reported by: | mhedley | Owned by: | ymipsl |
---|---|---|---|
Priority: | minor | Component: | XIOS |
Version: | trunk | Keywords: | |
Cc: |
Description
it appears that there is a non-exit condition, where an mpi process set can complete its work, but not fully exit
Ths issue can be reproduced using the generic standalone test case
- set the number of atm processes to 4nb-proc_atm
trunk/generic_testcase/param.def
¶ms_run duration='4ts' nb_proc_atm=4 nb_proc_oce=0 /
- submit a job where the ncpus value does not match to the number of generic_testscase.exe + xios_server.exe processes
e.g. 4 + 4 != 6
trunk/generic_testcase/sub_run_generic.job
#!/bin/bash #PBS -q normal #PBS -W umask=0022 #PBS -l walltime=0:06:00 #PBS -P foundation #PBS -N XIOS3_generic_testcase #PBS -l select=1:ncpus=6:mem=2gb cd $PBS_O_WORKDIR module switch PrgEnv-cray PrgEnv-cray/8.4.0 module load cpe/23.05 module switch cce cce/15.0.0 module load cray-hdf5-parallel module load cray-netcdf-hdf5parallel mpiexec -np 4 ./generic_testcase.exe : -np 4 ./xios_server.exe
noting the importance that there are the correct number of clients run , matching N generic-testcase.exe processes to nb_proc_atm and that these are assigned first, in this case 4, with servers assigned after.
In this case the PBS job assigns 4 xios_server.exe processes as well, whilst only asking for 6 ncpus (for the 8 entities)
In this scenario, the testcase runs, all data is created, no unusual errors (excepting -> error : WARNING: Unexpected request for buffer to communicate with server (`client_err1 not usually fatal))
but, the job hangs, hits wallclock and dies
all data written within ~1-2 minutes, but the job just sits there until killed
(note, if numbers match: e.g. 2 xios_server.exe or 8 ncpus, then all is well)
Whilst this is a configuration mismatch, it is a somewhat confusing symptom for users for the process to stall / hang
There's no exception raised, although PBS does register a warning
WARNING: CPU oversubscription detected for application
it would be good to understand this failure mode and explore mechanisms to protect against accidental mis-configuration