Opened 9 months ago

Last modified 8 months ago

#189 new defect

Hangs when attempting to use pools and services with XIOS3 generic_testcase

Reported by: acc Owned by: ymipsl
Priority: major Component: XIOS
Version: trunk Keywords:
Cc:

Description

A first attempt to use the new pools and service with XIOS3 (trunk) results in a deadlock which hangs until the job times out. This is reproducible with minimal changes to the generic_testcase example.

Changes are based on the example given in:
https://forge.ipsl.jussieu.fr/ioserver/raw-attachment/wiki/WikiStart/Coupling%20workshop-CW2023.pdf

which is still the closest material that I can find to any documentation on the new concepts.

Based on XIOS3 trunk@2540 with these changes:

  • context_atm.xml

    [generic_testcase]$ svn diff
     
    11<!-- --> 
    2 <context id="atm"> 
     2<context id="atm" default_pool_gatherer="pool_atm" default_pool_writer="pool_atm" default_pool_reader="pool_atm"> 
    33 
    44  <calendar type="Gregorian" time_origin="1850-01-01 00:00:00" /> 
    55 
     
    314314 
    315315 <file_definition  type="one_file" > 
    316316 
    317     <file id="atm_output" output_freq="1ts" type="one_file" enabled="true"> 
     317    <file id="atm_output" output_freq="1ts" type="one_file" enabled="true" mode="write" using_server2="true"  gatherer="gatherer1" writer="writer1" > 
    318318      <field field_ref="field3D"    enabled="true"/> 
    319319      <field field_ref="field2D"    enabled="true"/> 
    320320      <field field_ref="field_X"    enabled="true"/> 
  • context_oce.xml

     
    11<!-- --> 
    2 <context id="oce"> 
     2<context id="oce" default_pool_gatherer="pool_ocean" default_pool_writer="pool_ocean" default_pool_reader="pool_ocean"> 
    33 
    44  <calendar type="Gregorian" time_origin="1850-01-01 00:00:00" /> 
    55 
     
    314314 
    315315 <file_definition  type="one_file" > 
    316316 
    317     <file id="oce_output" output_freq="1ts" type="one_file" enabled="true"> 
     317    <file id="oce_output" output_freq="1ts" type="one_file" enabled="true" mode="write" using_server2="true"  gatherer="gatherer2" writer="writer2" > 
    318318      <field field_ref="pressure"  /> 
    319319      <field field_ref="field3D_resend" /> 
    320320 
  • iodef.xml

     
    102102        <variable id="check_event_sync" type="bool">true</variable> 
    103103      </variable_group> 
    104104    </variable_definition> 
     105<pool_definition> 
     106<pool name="pool_atm" global_fraction="0.5"> 
     107  <service name="gatherer1" nprocs="2" type="gatherer"/> 
     108  <service name="writer1" nprocs="1" type="writer"/> 
     109</pool> 
     110<pool name="pool_ocean" global_fraction="0.5" > 
     111  <service name="gatherer2" nprocs="2" type="gatherer"/> 
     112  <service name="writer2" nprocs="1" type="writer"/> 
     113</pool> 
     114</pool_definition> 
    105115  </context> 
    106116 
    107117</simulation> 
  • param.def

     
    11&params_run 
    22duration='4ts' 
    3 nb_proc_atm=1 
    4 nb_proc_oce=0 
     3nb_proc_atm=4 
     4nb_proc_oce=3 
     5nb_proc_surf=0 
    56/ 

XIOS3 has been compiled with intel compilers and intel mpi on an icelake cluster. generic_testcase.exe tests successfully with this setup using traditional, XIOS2 configuration and run with:

mpiexec.hydra -print-rank-map -ppn 1 -np 13 ../bin/generic_testcase.exe

but hangs (seemingly after setting up contexts) when attempted with these xml changes. alisting of the xios_client and xios-server logs is attached. It appears to be trying to work as intended but hangs. Excited by the prospect of taking control over the distribution of resources but failing miserably at this first attempt. Am I missing a key ingredient? If there isn't an obvious error then what is the best way to dig deeper?


    

Attachments (8)

xios3out_logs.txt (30.3 KB) - added by acc 9 months ago.
xios_client and xios_server logs
pools_services.png (266.7 KB) - added by acc 9 months ago.
schematic of pools and services
param.def (72 bytes) - added by ymipsl 8 months ago.
iodef.xml (4.6 KB) - added by ymipsl 8 months ago.
context_atm.xml (10.5 KB) - added by ymipsl 8 months ago.
context_grid_dynamico.xml (512 bytes) - added by ymipsl 8 months ago.
context_oce.xml (10.6 KB) - added by ymipsl 8 months ago.
xios3_logs_at_2551.txt (38.6 KB) - added by acc 8 months ago.
Client and server logs for a working example

Download all attachments as: .zip

Change History (18)

Changed 9 months ago by acc

xios_client and xios_server logs

Changed 9 months ago by acc

schematic of pools and services

comment:1 follow-up: Changed 9 months ago by acc

To be clear this is what I was hoping to achieve:
schematic of pools and services

Some aspects are unclear; e.g. is global_fraction the fraction of MPI_COMM_WORLD or the fraction of available servers?

comment:2 Changed 8 months ago by ymipsl

Hi Andrew,

Thank you for trying to test these new functionnalities, and sorry for the late reaction but it was holidays times and no member of the team was available in August.

So services functionnalities will be available only in the XIOS3/trunk version, not in the XIOS3_beta version (even if internally, it use the same engine). As you could see, these features remains a little bit experimental, and get lot of trouble when trying to stabilize these developements on different computers and mpi library. Like we use now a large part of MPI3 standard and introduced one-sided passive communication to manage services and transfer protocol, it appeared that a lot of mpi library are not very robust with these new functionnalities, a lot of bugs was discovered depending of the protocol used internally by the mpi lib and the hardware network.
Mains of our tests was done with openmpi, but we are testing also intelmpi and cray mpi lib that remains bugged.

So I thing we converged to more stability now, and I encourage beta-tester to make us some returns of what is working well or not.

In your case, I just make a big update that will solved the problem of spurious dead-lock wich is very difficult to detect because they are not always reproducible from a architecture to another.

Your xml files are correct, and I can reproduce your test case. With the last update on trunk, it seems to works now as expected, so make and svn update and rerun.

I attached here my xml files.


Changed 8 months ago by ymipsl

Changed 8 months ago by ymipsl

Changed 8 months ago by ymipsl

Changed 8 months ago by ymipsl

Changed 8 months ago by ymipsl

comment:3 in reply to: ↑ 1 Changed 8 months ago by ymipsl

Replying to acc:

To be clear this is what I was hoping to achieve:
schematic of pools and services

Some aspects are unclear; e.g. is global_fraction the fraction of MPI_COMM_WORLD or the fraction of available servers?

It is the global fraction of available servers, so your plot is correct.

Last edited 8 months ago by ymipsl (previous) (diff)

comment:4 Changed 8 months ago by ymipsl

Note that you can now mix attached mode and server mode.
Setting writer="attached" or reader="attached" as file attribute (it can be also inherited from context), then it switch automatically in attached mode (without using intermediate server), and all context client processes contribute to write or read the file.

comment:5 Changed 8 months ago by acc

Yes it works! Thanks Yann that update has broken the deadlock. My example is now working as expected. This is with xios3-trunk@2551 on an Intel cluster using Intel-MPI. I'll re-test on a Cray later. I've attached the successful client and server logs (info_level=10) which confirm everything is working as in the schematic. Next task is to try some real applications and see how it copes.

Changed 8 months ago by acc

Client and server logs for a working example

comment:6 Changed 8 months ago by ymipsl

Ok great !

I think a good idea will be to increase the size of the testcase to see the behaviour when running in internode mode with intelMPI.If problem is occuring, it will be more easy for us to solve it on generic tescase than for real application.
For now, we implement 2 different transports protocol to communicate between clients and servers :

  • legacy protocol : it use p2p communication (MPI_Isend/MPI_Irecv, IMProbe,...) with small part of passive one sided
  • one_sided protocol : only passive one-sided communication.

You can test one of other by setting the xios "transport_protocol" variable :

<variable id="transport_protocol" type="string" >legacy</variable> (default)

<variable id="transport_protocol" type="string" >one_sided</variable>

It may be interresting to test on different architectures.

comment:7 Changed 8 months ago by acc

Confirmation that the same example works on a Cray AMD cluster (ARCHER2) when compiled with the Cray compilers (version 15.0.0) and the Cray MPICH libraries.
It also works successfully when scaled up to 256 cores spread over 4 nodes (150 atm, 74 oce, 15 atm gatherers + 1 atm writer, 15 oce gatherers + 1 oce writer). For the record:

Currently Loaded Modules:
  1) craype-x86-rome                         7) craype/2.7.19          13) load-epcc-module
  2) libfabric/1.12.1.2.2.0.0                8) cray-dsmml/0.2.2       14) cray-mpich/8.1.23
  3) craype-network-ofi                      9) cray-libsci/22.12.1.1  15) cray-hdf5-parallel/1.12.2.1
  4) perftools-base/22.12.0                 10) PrgEnv-cray/8.3.3      16) cray-netcdf-hdf5parallel/4.9.0.1
  5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta  11) bolt/0.8
  6) cce/15.0.0                             12) epcc-setup-env

There is no transport_protocol set explicitly so assume this is legacy. iodef.xml in generic_testcase has:

        <variable id="pure_one_sided" type="bool">false</variable>
        <variable id="check_event_sync" type="bool">true</variable>

Are these related/superceded?

comment:8 Changed 8 months ago by ymipsl

Great, very good news. I am very happy for this.
When using intelMPi, what is your library version ?

Yes, when no specification for transport protocol, legacy is the default.

Variable/parameter "pure_one_sided" is just set for purpose development of legacy transport (force ll tranfert using one-side), must not be specified for common usage (default=false)

Variable/parameter "check_event_sync" is used for debugging. In this case XIOS, check that all events sent to the server are coherent, ie same event type and same timeline. Like all xios call must be collective, it can help, if some process miss a call. To do this check, collective communication are involved and synchroniez all clients. So un production mode we recommand to set it to false (which is the default).

Could you try tu rerun the same tests using the one_sided protocol by adding :
<variable id="transport_protocol" type="string" >one_sided</variable>

comment:9 Changed 8 months ago by acc

Ok, 256core / 4 node example also works on our Intel cluster (legacy). At least, it produces valid output files but the .err files have content and it does not seem to have ended cleanly. E.g.:

cat xios_client_006.err
-> error : WARNING: Unexpected request for buffer to communicate with server 3
-> error : WARNING: Unexpected request for buffer to communicate with server 0
-> error : WARNING: Unexpected request for buffer to communicate with server 6

cat xios_server_225.err
-> error : WARNING: Unexpected request for buffer to communicate with server 0

 tail -10l slurm-248072.out
Server Context destructor
Server Context destructor
Server Context destructor
Server Context destructor
Server Context destructor
Abort(806969871) on node 150 (rank 150 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
PMPI_Finalize(216)...............: MPI_Finalize failed
PMPI_Finalize(159)...............:
MPID_Finalize(1335)..............:
MPIDI_OFI_mpi_finalize_hook(2258): OFI domain close failed (ofi_init.c:2258:MPIDI_OFI_mpi_finalize_hook:Device or resource busy)

But, looking back, the Cray test had similar 'errors' so perhaps these don't matter. For the record:

6) UCX/1.11.2-GCCcore-11.2.0
21) intel-compilers/2021.4.0
22) impi/2021.4.0-intel-compilers-2021.4.0
23) iimpi/2021b
24) slurm/21.08.5

comment:10 Changed 8 months ago by acc

Finally, both intel and cray work with the extra:

<variable id="transport_protocol" type="string" >one_sided</variable>

setting. But there is no evidence in any of the output that this has activated anything different.

Note: See TracTickets for help on using tickets.