= Name and subject of the action Last edition: '''[[Wikinfo(changed_ts)]]''' by '''[[Wikinfo(changed_by)]]''' The PI is responsible to closely follow the progress of the action, and especially to contact NEMO project manager if the delay on preview (or review) are longer than the 2 weeks expected. [[PageOutline(2, , inline)]] == Summary ||=Action || MPI3 collective neighbours communications instead of point to point communications || ||=PI(S) || Silvia Mocavero and Italo Epicoco || ||=Digest || MPI-3 provides new neighbourhood collective operations that allow to perform halo exchange with a single MPI communication call. || ||=Dependencies || If any || ||=Branch || dev_r13296_HPC-07_mocavero_mpi3 || ||=Previewer(s) || Mirek Andrejczuk || ||=Reviewer(s) || Mirek Andrejczuk || ||=Ticket || #2496 || === Description This is the continuation of the work started in 2019 (HPC-12_Mocavero_mpi3). \\ MPI-3 provides new neighbourhood collective operations (i.e. MPI_Neighbor_allgather and MPI_Neighbor_alltoall) that allow to perform halo exchange with a single MPI communication call.\\ These collective communications have been integrated and tested on the NEMO code during 2019 in order to evaluate the code performance compared with the traditional point-to-point halo exchange currently implemented in NEMO. The first version of the implementation uses a cartesian topology, so it does not support 9-point stencil neither land domain exclusion and the north fold is handled as usual. The use of new collective communications has been tested on a representative kernel implementing the FCT advection scheme.\\ Preliminary tests show an improvement within a range of 18%-32% on the GYRE_PISCES configuration (with nn_GYRE=200), depending on the allocated number of cores. The output accuracy is preserved. \\ During 2020 we intend to integrate the graph topology to support the routines that use a 9-point stencil, the land domain exclusion and the north fold exchanges through MPI3 neighbourhood collective communications. \\ === Implementation Step 1: alignment of the dev_r13296_HPC-07_mocavero_mpi3 branch with the new trunk (after July merge party) (done)\\ Step 2: integration of graph topology to support halo exchange for both 5-points (when exchange with only north, south, east and west processes is enough to preserve data dependency) and 9-points stencil (when exchange with diagonal processes is needed) computation. Land domains exclusion is also handled due to the flexibility of graph topology. A parameter in lbc_lnk mpi3 routine call allows to choose between 5-points or 9-points exchange (done)\\ Step 3: add lbc_lnk mpi3 in traadv_fct.F90 (5-points stencil) and icedyn_rhg_evp.F90 (9-points stencil) files to perform comparability tests. Sette tests will be executed, also activating land domain exclusion\\ Step 4: perform performance tests to evaluate the gain in both 5-points and 9-points stencil (done)\\ Step 5: replacement of point-to-point communications with collective ones within the NEMO code. The choice between 5-points and 9-points exchange requires a data dependency analysis. The replacement will be performed in three steps:\\ step 5.1: all the lbc_lnk will be replaced with 9-points mpi3 exchange (a key_mpi3 will be introduced to preserve the old point-to-point exchange version to be used on architectures where MPI3 is not supported or it does not provide a performance gain) (done)\\ step 5.2: 5-points stencil exchange is introduced when data dependency is satisfied without diagonal exchange (in 2021)\\ step 5.3: key_mpi3 will be removed (when the ST will confirm that implementation is more performant) \\ === Documentation updates {{{#!box width=55em help Using previous parts, define the main changes to be done in the NEMO literature (manuals, guide, web pages, …). }}} ''No need for changes in documentation.'' == Preview {{{#!box width=50em info [[Include(wiki:Developers/DevProcess#preview_)]] }}} ''...'' == Tests {{{#!box width=50em info [[Include(wiki:Developers/DevProcess#tests)]] }}} ''Performance tests have been done on CMCC Zeus machine, by evaluating the improvement on the communication time of two representative routines, respectively implementing the 5-points and 9-points stencil. An improvement within the range of 15-31% is achieved on the GYRE_PISCES configuration depending on the number of allocated cores, when 5-points version is used. As we expected, 9-points stencil does not provide the same improvement. However, a modest gain is achieved. Restartability and reproducibility tests for all the sette configurations are ok. Report: Current code is : NEMO/branches/2020/dev_r13296_HPC-07_mocavero_mpi3 @ r13855 ( last change @ r13807 ) SETTE validation report generated for : NEMO/branches/2020/dev_r13296_HPC-07_mocavero_mpi3 @ r13807+ (last changed revision) on ifort_zeus_xios arch file !!---------------1st pass------------------!! !----restart----! \\ WGYRE_PISCES_ST run.stat restartability passed : 13807+\\ WGYRE_PISCES_ST tracer.stat restartability passed : 13807+\\ WORCA2_ICE_PISCES_ST run.stat restartability passed : 13807+\\ WORCA2_ICE_PISCES_ST tracer.stat restartability passed : 13807+\\ WORCA2_OFF_PISCES_ST tracer.stat restartability passed : 13807+\\ WAMM12_ST run.stat restartability passed : 13807+\\ WORCA2_SAS_ICE_ST run.stat restartability passed : 13807+\\ WAGRIF_DEMO_ST run.stat restartability passed : 13807+\\ WSPITZ12_ST run.stat restartability passed : 13807+\\ WISOMIP_ST run.stat restartability passed : 13807+\\ WOVERFLOW_ST run.stat restartability passed : 13807+\\ WLOCK_EXCHANGE_ST run.stat restartability passed : 13807+\\ WVORTEX_ST run.stat restartability passed : 13807+\\ WICE_AGRIF_ST run.stat restartability passed : 13807+\\ !----repro----! \\ WGYRE_PISCES_ST run.stat reproducibility passed : 13807+\\ WGYRE_PISCES_ST tracer.stat reproducibility passed : 13807+\\ WORCA2_ICE_PISCES_ST run.stat reproducibility passed : 13807+\\ WORCA2_ICE_PISCES_ST tracer.stat reproducibility passed : 13807+\\ WORCA2_OFF_PISCES_ST tracer.stat reproducibility passed : 13807+\\ WAMM12_ST run.stat reproducibility passed : 13807+\\ WORCA2_SAS_ICE_ST run.stat reproducibility passed : 13807+\\ WORCA2_ICE_OBS_ST run.stat reproducibility passed : 13807+\\ WAGRIF_DEMO_ST run.stat reproducibility passed : 13807+\\ WSPITZ12_ST run.stat reproducibility passed : 13807+\\ WISOMIP_ST run.stat reproducibility passed : 13807+\\ WVORTEX_ST run.stat reproducibility passed : 13807+\\ WICE_AGRIF_ST run.stat reproducibility passed : 13807+\\ !----agrif check----! \\ ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat unchanged - passed : 13807+ 13807+\\ !----result comparison check----! \\ check result differences between :\\ VALID directory : /work/asc/sm31219/NEMO_work/mpi3/merge_trunk/dev_r13296_HPC-07_mocavero_mpi3/../NEMO_VALIDATION_TRUNK at rev 13807+\\ and\\ REFERENCE directory : /work/asc/sm31219/NEMO_work/mpi3/merge_trunk/NEMO_VALIDATION_TRUNK at rev ref\\ WGYRE_PISCES_ST run.stat files are identical \\ WGYRE_PISCES_ST tracer.stat files are identical \\ WORCA2_ICE_PISCES_ST run.stat files are identical \\ WORCA2_ICE_PISCES_ST tracer.stat files are identical \\ WORCA2_OFF_PISCES_ST tracer.stat files are identical \\ WAMM12_ST run.stat files are identical \\ WISOMIP_ST run.stat files are identical \\ WORCA2_SAS_ICE_ST run.stat files are identical \\ WAGRIF_DEMO_ST run.stat files are identical \\ WSPITZ12_ST run.stat files are identical \\ WISOMIP_ST run.stat files are identical \\ WVORTEX_ST run.stat files are identical \\ '' == Review ''Changes are OK. The documentation, however, should be modify to include new cpp key introduced in this development (including sette). The update of the documentation shouldn't stop this development to be merged with the trunk in 2020''