Opened 5 months ago

Last modified 5 months ago

#195 new defect

XIOS process exit failure (hang)

Reported by: mhedley Owned by: ymipsl
Priority: minor Component: XIOS
Version: trunk Keywords:
Cc:

Description

it appears that there is a non-exit condition, where an mpi process set can complete its work, but not fully exit

Ths issue can be reproduced using the generic standalone test case

  1. set the number of atm processes to 4nb-proc_atm

trunk/generic_testcase/param.def

&params_run
duration='4ts'
nb_proc_atm=4
nb_proc_oce=0
/
  1. submit a job where the ncpus value does not match to the number of generic_testscase.exe + xios_server.exe processes

e.g. 4 + 4 != 6

trunk/generic_testcase/sub_run_generic.job

#!/bin/bash                                                                                                     

#PBS -q normal                                                                                                  
#PBS -W umask=0022                                                                                              
#PBS -l walltime=0:06:00                                                                                        
#PBS -P foundation                                                                                              
#PBS -N XIOS3_generic_testcase                                                                                  
#PBS -l select=1:ncpus=6:mem=2gb                                                                                

cd $PBS_O_WORKDIR

module switch PrgEnv-cray PrgEnv-cray/8.4.0
module load cpe/23.05
module switch cce cce/15.0.0
module load cray-hdf5-parallel
module load cray-netcdf-hdf5parallel

mpiexec -np 4 ./generic_testcase.exe : -np 4 ./xios_server.exe

noting the importance that there are the correct number of clients run , matching N generic-testcase.exe processes to nb_proc_atm and that these are assigned first, in this case 4, with servers assigned after.

In this case the PBS job assigns 4 xios_server.exe processes as well, whilst only asking for 6 ncpus (for the 8 entities)

In this scenario, the testcase runs, all data is created, no unusual errors (excepting -> error : WARNING: Unexpected request for buffer to communicate with server (`client_err1 not usually fatal))
but, the job hangs, hits wallclock and dies

all data written within ~1-2 minutes, but the job just sits there until killed

(note, if numbers match: e.g. 2 xios_server.exe or 8 ncpus, then all is well)

Change History (1)

comment:1 Changed 5 months ago by mhedley

Whilst this is a configuration mismatch, it is a somewhat confusing symptom for users for the process to stall / hang

There's no exception raised, although PBS does register a warning

WARNING: CPU oversubscription detected for application

it would be good to understand this failure mode and explore mechanisms to protect against accidental mis-configuration

Note: See TracTickets for help on using tickets.