[[PageOutline]] Last edited [[Timestamp]] [[BR]] '''Author''' : acc '''ticket''' : #679 '''Branch''' : [https://forge.ipsl.jussieu.fr/nemo/browser/branches/DEV_1879_mpp_sca DEV_1879_mpp_sca ] ---- === Description === This branch introduces code to minimise the use of the mpi_allgather operation during the north-fold exchanges. PRACE investigators found significant performance gains with similar changes when using large numbers of processors. [[BR]] '''Method'''[[BR]] A new routine is introduced into opa.F90 (opa_northcomms) that uses the existing method to work out which other processors are directly involved in the north fold exchanges. It does this for T,U,V,F points and uses the masks so that the neighbours won't be included if the boundary is wholly land. Once those lists have been established, the mpp_lbc_north routines (in lib_mpp.F90) will employ them to only exchange with "active" neighbours. These exchanges populate the same ztab array that the mpi_allgather method uses and then calls the lbc_nfd routine to carry out the fold operation. The difference is that instead of filling the whole ztab array (which requires every northern row processor to communicate with every other northern row processor), only those gridcells that will be folded onto an individual processor's domain are exchanged. The reduction in communication should lead to performance gains when using large numbers of processors. The current implementation has been successfully tested in standard ORCA2 and ORCA1 configurations. Test results are identical with and without the modifications. For these configurations, there is no degradation in performance. Still to do: 1. Work out how to deal with 'I' points 2. Check that the method works successfully when land-only regions have been discarded (i.e. jpnij /= jpni*jpnj) 3. Demonstrate and quantify the benefit with ORCA025 and ORCA12. ---- === Testing === Testing could consider (where appropriate) other configurations in addition to NVTK]. ||NVTK Tested||!'''NO!'''|| ||Other model configurations||YES|| ||Processor configurations tested||ORCA2:2x2 and 8X4; ORCA1: 8x4 || ||If adding new functionality please confirm that the [[BR]]New code doesn't change results when it is switched off [[BR]]and !''works!'' when switched on||YES|| === Bit Comparability === ||Does this change preserve answers in your tested standard configurations (to the last bit) ?||!'''YES/NO !'''|| ||Does this change bit compare across various processor configurations. (1xM, Nx1 and MxN are recommended)||!'''YES/NO!'''|| ||Is this change expected to preserve answers in all possible model configurations?||!'''YES/NO!'''|| ||Is this change expected to preserve all diagnostics? [[BR]]!,,!''Preserving answers in model runs does not necessarily imply preserved diagnostics. !''||!'''YES/NO!'''|| If you answered !'''NO!''' to any of the above, please provide further details: * Which routine(s) are causing the difference? * Why the changes are not protected by a logical switch or new section-version * What is needed to achieve regression with the previous model release (e.g. a regression branch, hand-edits etc). If this is not possible, explain why not. * What do you expect to see occur in the test harness jobs? * Which diagnostics have you altered and why have they changed?Please add details here........ ---- === System Changes === ||Does your change alter namelists?||NO|| ||Does your change require a change in compiler options?||NO|| ---- === Resources === !''Please !''summarize!'' any changes in runtime or memory use caused by this change......!'' ---- === IPR issues === ||Has the code been wholly (100%) produced by NEMO developers staff working exclusively on NEMO?||YES||