Opened 9 months ago
Last modified 8 months ago
#189 new defect
Hangs when attempting to use pools and services with XIOS3 generic_testcase
Reported by: | acc | Owned by: | ymipsl |
---|---|---|---|
Priority: | major | Component: | XIOS |
Version: | trunk | Keywords: | |
Cc: |
Description
A first attempt to use the new pools and service with XIOS3 (trunk) results in a deadlock which hangs until the job times out. This is reproducible with minimal changes to the generic_testcase example.
Changes are based on the example given in:
https://forge.ipsl.jussieu.fr/ioserver/raw-attachment/wiki/WikiStart/Coupling%20workshop-CW2023.pdf
which is still the closest material that I can find to any documentation on the new concepts.
Based on XIOS3 trunk@2540 with these changes:
-
context_atm.xml
[generic_testcase]$ svn diff
1 1 <!-- --> 2 <context id="atm" >2 <context id="atm" default_pool_gatherer="pool_atm" default_pool_writer="pool_atm" default_pool_reader="pool_atm"> 3 3 4 4 <calendar type="Gregorian" time_origin="1850-01-01 00:00:00" /> 5 5 … … 314 314 315 315 <file_definition type="one_file" > 316 316 317 <file id="atm_output" output_freq="1ts" type="one_file" enabled="true" >317 <file id="atm_output" output_freq="1ts" type="one_file" enabled="true" mode="write" using_server2="true" gatherer="gatherer1" writer="writer1" > 318 318 <field field_ref="field3D" enabled="true"/> 319 319 <field field_ref="field2D" enabled="true"/> 320 320 <field field_ref="field_X" enabled="true"/> -
context_oce.xml
1 1 <!-- --> 2 <context id="oce" >2 <context id="oce" default_pool_gatherer="pool_ocean" default_pool_writer="pool_ocean" default_pool_reader="pool_ocean"> 3 3 4 4 <calendar type="Gregorian" time_origin="1850-01-01 00:00:00" /> 5 5 … … 314 314 315 315 <file_definition type="one_file" > 316 316 317 <file id="oce_output" output_freq="1ts" type="one_file" enabled="true" >317 <file id="oce_output" output_freq="1ts" type="one_file" enabled="true" mode="write" using_server2="true" gatherer="gatherer2" writer="writer2" > 318 318 <field field_ref="pressure" /> 319 319 <field field_ref="field3D_resend" /> 320 320 -
iodef.xml
102 102 <variable id="check_event_sync" type="bool">true</variable> 103 103 </variable_group> 104 104 </variable_definition> 105 <pool_definition> 106 <pool name="pool_atm" global_fraction="0.5"> 107 <service name="gatherer1" nprocs="2" type="gatherer"/> 108 <service name="writer1" nprocs="1" type="writer"/> 109 </pool> 110 <pool name="pool_ocean" global_fraction="0.5" > 111 <service name="gatherer2" nprocs="2" type="gatherer"/> 112 <service name="writer2" nprocs="1" type="writer"/> 113 </pool> 114 </pool_definition> 105 115 </context> 106 116 107 117 </simulation> -
param.def
1 1 ¶ms_run 2 2 duration='4ts' 3 nb_proc_atm=1 4 nb_proc_oce=0 3 nb_proc_atm=4 4 nb_proc_oce=3 5 nb_proc_surf=0 5 6 /
XIOS3 has been compiled with intel compilers and intel mpi on an icelake cluster. generic_testcase.exe tests successfully with this setup using traditional, XIOS2 configuration and run with:
mpiexec.hydra -print-rank-map -ppn 1 -np 13 ../bin/generic_testcase.exe
but hangs (seemingly after setting up contexts) when attempted with these xml changes. alisting of the xios_client and xios-server logs is attached. It appears to be trying to work as intended but hangs. Excited by the prospect of taking control over the distribution of resources but failing miserably at this first attempt. Am I missing a key ingredient? If there isn't an obvious error then what is the best way to dig deeper?
Attachments (8)
Change History (18)
Changed 9 months ago by acc
comment:1 follow-up: ↓ 3 Changed 9 months ago by acc
comment:2 Changed 8 months ago by ymipsl
Hi Andrew,
Thank you for trying to test these new functionnalities, and sorry for the late reaction but it was holidays times and no member of the team was available in August.
So services functionnalities will be available only in the XIOS3/trunk version, not in the XIOS3_beta version (even if internally, it use the same engine). As you could see, these features remains a little bit experimental, and get lot of trouble when trying to stabilize these developements on different computers and mpi library. Like we use now a large part of MPI3 standard and introduced one-sided passive communication to manage services and transfer protocol, it appeared that a lot of mpi library are not very robust with these new functionnalities, a lot of bugs was discovered depending of the protocol used internally by the mpi lib and the hardware network.
Mains of our tests was done with openmpi, but we are testing also intelmpi and cray mpi lib that remains bugged.
So I thing we converged to more stability now, and I encourage beta-tester to make us some returns of what is working well or not.
In your case, I just make a big update that will solved the problem of spurious dead-lock wich is very difficult to detect because they are not always reproducible from a architecture to another.
Your xml files are correct, and I can reproduce your test case. With the last update on trunk, it seems to works now as expected, so make and svn update and rerun.
I attached here my xml files.
Changed 8 months ago by ymipsl
Changed 8 months ago by ymipsl
Changed 8 months ago by ymipsl
Changed 8 months ago by ymipsl
Changed 8 months ago by ymipsl
comment:3 in reply to: ↑ 1 Changed 8 months ago by ymipsl
Replying to acc:
To be clear this is what I was hoping to achieve:
Some aspects are unclear; e.g. is global_fraction the fraction of MPI_COMM_WORLD or the fraction of available servers?
It is the global fraction of available servers, so your plot is correct.
comment:4 Changed 8 months ago by ymipsl
Note that you can now mix attached mode and server mode.
Setting writer="attached" or reader="attached" as file attribute (it can be also inherited from context), then it switch automatically in attached mode (without using intermediate server), and all context client processes contribute to write or read the file.
comment:5 Changed 8 months ago by acc
Yes it works! Thanks Yann that update has broken the deadlock. My example is now working as expected. This is with xios3-trunk@2551 on an Intel cluster using Intel-MPI. I'll re-test on a Cray later. I've attached the successful client and server logs (info_level=10) which confirm everything is working as in the schematic. Next task is to try some real applications and see how it copes.
comment:6 Changed 8 months ago by ymipsl
Ok great !
I think a good idea will be to increase the size of the testcase to see the behaviour when running in internode mode with intelMPI.If problem is occuring, it will be more easy for us to solve it on generic tescase than for real application.
For now, we implement 2 different transports protocol to communicate between clients and servers :
- legacy protocol : it use p2p communication (MPI_Isend/MPI_Irecv, IMProbe,...) with small part of passive one sided
- one_sided protocol : only passive one-sided communication.
You can test one of other by setting the xios "transport_protocol" variable :
<variable id="transport_protocol" type="string" >legacy</variable> (default)
<variable id="transport_protocol" type="string" >one_sided</variable>
It may be interresting to test on different architectures.
comment:7 Changed 8 months ago by acc
Confirmation that the same example works on a Cray AMD cluster (ARCHER2) when compiled with the Cray compilers (version 15.0.0) and the Cray MPICH libraries.
It also works successfully when scaled up to 256 cores spread over 4 nodes (150 atm, 74 oce, 15 atm gatherers + 1 atm writer, 15 oce gatherers + 1 oce writer). For the record:
Currently Loaded Modules: 1) craype-x86-rome 7) craype/2.7.19 13) load-epcc-module 2) libfabric/1.12.1.2.2.0.0 8) cray-dsmml/0.2.2 14) cray-mpich/8.1.23 3) craype-network-ofi 9) cray-libsci/22.12.1.1 15) cray-hdf5-parallel/1.12.2.1 4) perftools-base/22.12.0 10) PrgEnv-cray/8.3.3 16) cray-netcdf-hdf5parallel/4.9.0.1 5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta 11) bolt/0.8 6) cce/15.0.0 12) epcc-setup-env
There is no transport_protocol set explicitly so assume this is legacy. iodef.xml in generic_testcase has:
<variable id="pure_one_sided" type="bool">false</variable> <variable id="check_event_sync" type="bool">true</variable>
Are these related/superceded?
comment:8 Changed 8 months ago by ymipsl
Great, very good news. I am very happy for this.
When using intelMPi, what is your library version ?
Yes, when no specification for transport protocol, legacy is the default.
Variable/parameter "pure_one_sided" is just set for purpose development of legacy transport (force ll tranfert using one-side), must not be specified for common usage (default=false)
Variable/parameter "check_event_sync" is used for debugging. In this case XIOS, check that all events sent to the server are coherent, ie same event type and same timeline. Like all xios call must be collective, it can help, if some process miss a call. To do this check, collective communication are involved and synchroniez all clients. So un production mode we recommand to set it to false (which is the default).
Could you try tu rerun the same tests using the one_sided protocol by adding :
<variable id="transport_protocol" type="string" >one_sided</variable>
comment:9 Changed 8 months ago by acc
Ok, 256core / 4 node example also works on our Intel cluster (legacy). At least, it produces valid output files but the .err files have content and it does not seem to have ended cleanly. E.g.:
cat xios_client_006.err -> error : WARNING: Unexpected request for buffer to communicate with server 3 -> error : WARNING: Unexpected request for buffer to communicate with server 0 -> error : WARNING: Unexpected request for buffer to communicate with server 6 cat xios_server_225.err -> error : WARNING: Unexpected request for buffer to communicate with server 0 tail -10l slurm-248072.out Server Context destructor Server Context destructor Server Context destructor Server Context destructor Server Context destructor Abort(806969871) on node 150 (rank 150 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack: PMPI_Finalize(216)...............: MPI_Finalize failed PMPI_Finalize(159)...............: MPID_Finalize(1335)..............: MPIDI_OFI_mpi_finalize_hook(2258): OFI domain close failed (ofi_init.c:2258:MPIDI_OFI_mpi_finalize_hook:Device or resource busy)
But, looking back, the Cray test had similar 'errors' so perhaps these don't matter. For the record:
6) UCX/1.11.2-GCCcore-11.2.0 21) intel-compilers/2021.4.0 22) impi/2021.4.0-intel-compilers-2021.4.0 23) iimpi/2021b 24) slurm/21.08.5
comment:10 Changed 8 months ago by acc
Finally, both intel and cray work with the extra:
<variable id="transport_protocol" type="string" >one_sided</variable>
setting. But there is no evidence in any of the output that this has activated anything different.
xios_client and xios_server logs