New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
WorkingGroups/HPC – NEMO
wiki:WorkingGroups/HPC

Version 53 (modified by mikebell, 5 years ago) (diff)

--

HPC WG

Last edition: Wikinfo(changed_ts)? by Wikinfo(changed_by)?

Working group leaders (and responsible for wiki pages) : Mike Bell & Silvia Mocavero


Members of the Working group:

  • Miroslaw Andrejczuk (Met Office/ST)
  • Mike Bell (Met Office)
  • Miguel Castrillo (BSC)
  • Louis Douriez (ATOS)
  • Matt Glover (Met Office)
  • David Guibert (ATOS)
  • Claire Levy (CNRS/NEMO Project Manager)
  • Eric Maisonnave (CERFACS)
  • Sébastien Masson (CNRS/ST)
  • Francesca Mele (CMCC/ST)
  • Silvia Mocavero (CMCC/ST)
  • Andrew Porter (STFC)
  • Erwan Raffin (ATOS)
  • Oriol Tinto (BSC)
  • Mario Acosta (BSC)

Other members

  • Lucien Anton (CRAY)
  • Tom Bradley (NVIDIA)
  • Clement Bricaud (MERCATOR/ST)
  • Marcin Chrust (ECMWF)
  • Peter Dueben (ECMWF)
  • Marie-Alice Foujols (CNRS)
  • Jason Holt (NOC)
  • Dmitry Kuts (Intel)
  • Michael Lange (ECMWF)
  • Julien Le Sommer (CNRS)
  • Gurvan Madec (CNRS/NEMO Scientific Leader)
  • Yann Meurdesoif (CEA)
  • Kristian Mogensen (ECMWF)
  • Stan Posey (NVIDIA)
  • Martin Price (Met Office/ST)
  • Martin Schreiber (Univ of Exeter)
  • Kim Serradell (BSC)
  • Nils Wedi (ECMWF)

Former members

  • Jeremy Appleyards (NVIDIA)
  • Mondher Chekki (Mercator-Ocean)
  • Tim Graham (Met Office)
  • Cyril Mazauric (ATOS)

Main objectives and activities: 2018

NEMO benchmark

Description: development of a benchmark (idealised geometry, equal MPI sub-domain size, realistic parameters, parametrisable communications, internal timings ...) at different resolutions to identify the bottlenecks to NEMO scalability.

Involved people: Sebastien Masson (CNRS/ST), Eric Maissonave (CERFACS), Louis Douriez (ATOS), David Guibert (ATOS), Erwan Raffin (ATOS)

WP action: HPC09_ESIWACE?

Intra-node performance

Description: analysis and improvement of NEMO intra-node performance to move real towards peak performance and to reduce intra-node communications. Analysis at kernel level to understand which kernels are limiting performance; improving utilisation of memory hierarchy; loop vectorisation. Investigation and integration of a second level of parallelism based on the shared memory paradigm. Use of OpenMP (fine and coarse-grained parallelisation).

Involved people: Silvia Mocavero (CMCC/ST), Francesca Mele (CMCC/ST), Mike Bell (MetO), Miroslaw Andrejczuk (MetO/ST), Matt Glover (MetO)

WP actions: WP2018-01_Silvia Mocavero_singlecoreperf, WP2018-02_Francesca_Mele_hybrid

Inter-node communications

Description: limiting the communication overhead through the reduction of the MPI communications number and frequency, the investigation of advanced MPI communications and techniques to overlap communication and computation.

Involved people: Silvia Mocavero (CMCC/ST)

WP actions: WP2018-03_Silvia Mocavero_globcomm, WP2018-04_Silvia Mocavero_mpi3, WP2018-05_AndrewC-extendedhaloes

Separation of concerns

Description: investigation of a light-DSL approach to apply DSL in NEMO without impacting on the NEMO coding structure rules. The approach consists in processing the NEMO code to create an internal representation compliant to the PSyclone tool and manipulating the intermediate code to perform PSyclone transformations.

Involved people: Andrew Porter (STFC)

WP action: -

I/O

Description: Investigation of techniques to reduce the I/O overhead which increases with the number of processors. Extension of the use of XIOS for reading/writing restart files.

Involved people: Miroslaw Andrejczuk (Met Office/ST)

WP actions: WP2018-06_andmirek-XIOSread, WP2018-07_andmirek_XIOSwrite

Mixed-precision

Description: optimization of the NEMO model by using a mixed precision approach. Study on the precision needed by the different processes in NEMO. Implementation of the mixed precision approach in NEMO.

Involved people: Miguel Castrillo (BSC), Oriol Tinto (BSC)

WP action: WP2018-08_Mixed_precision


Minutes of Meetings

TOC(WorkingGroups/HPC/Mins_2*, depth=1)?

Minutes of Subgroup Meetings

TOC(WorkingGroups/HPC/Mins_s*, depth=1)?

Minutes of Domain Specific Meetings

TOC(WorkingGroups/HPC/Mins_MP*, depth=1)?


Old version of page

Working group leader (and responsible for wiki pages) : Sébastien Masson.


Members of the Working group:

  • Sébastien Masson
  • Italo Epicoco
  • Silvia Mocavero
  • Marie-Alice Foujols
  • Jason Holt
  • Gurvan Madec
  • Mondher Chekki

Objectives:

  • make short term recommendations for improving the performance of the existing system
  • propose criteria for a taking decisions at Gateway 2025 regarding HPC.
  • provide more detail on Gung-Ho (esp. regarding its implications for mesh discretization)
  • identify other possible strategies and approaches for evolutions in the long term.
  • define a simple configuration (with IO and complex geometry) that will serve as a proof of concept for validating the proposed approach for the future system.

Some ideas...:

Document by Seb detailing some short term actions to help reduce communications .

A strong improvement of NEMO scalability is needed to be able to take advantage of the new machines. This probably means a deep review/rewrite of NEMO code at some point in the future (beyond 5 years from now?). At the same time, we already know that CMIP7 won't use an ocean model that has not been strongly tested and validated and will stick to a NEMO model not so far from the existing one.
This means that we need to:

  1. keep improving the current structure of NEMO so it works quite efficiently for almost 10 more years (until the end of CMPI7).
  2. start to work on a new structure that would fully tested and validated at least for CMIP8 in about 10 years.

Based on this, we propose to divide the work according to 3 temporal windows

0-3 years: improvements with existing code:

  1. remove solvers and global sums (to be done in 3.7) 1) reduce the number of communications: do less and bigger communications (group communications, use larger halo). main priority: communications in the time splitting and sea-ice rheology.
  2. reduce the number of communications: remove useless communications (a lot of them are simply associated with output...)
  3. introduce asynchronous communications
  4. check code vectorization (SIMD instructions)

0-5 years: improvements through the introduction of OpenMP:

work initialed by CMCC. implementation such as tiling may be efficient with many cores processors? review lbclnk to be able to deal with MPI and OpenMP OpenMP along the vertical axis? Find a way to remove implicit schemes? test different way to find new sources of parallelism for example with the help of OpenMP4 test OpenACC (not that far from OpenMP)?

beyond 5 years:

GungHo or not GungHo , that is the question...

Agenda:

For the next 2 years, as a start, a workshop to be organized in 2015 on “NEMO in 2025 : routes toward multi-resolution approaches”.


Comments of group members :

gurvan -- (2014 november 11):

  • improving the code efficiency imply using more processor for a given application. This means breaking the current limit of 35x35 local horizontal domain. The 3 years propositions go in that direction. One point is missing: A target for an ORCA 1/36° is a 10x10 local domain to be able to use 1 Million cores... In this case, the number of horizontal grid points is the same as the vertical one (about 100 levels is currently what we are running). So, do we have to consider a change in the indexation of arrays from i-j-k to k-j-i ?
  • Sea-ice running in parallel with the ocean on its own set of processors (with a 1 time-step asynchronous coupling between ice and ocean).
  • BGC running in parallel with the ocean on its own set of processors.
  • BGC : obviously on-line coarsening significantly reduces the cost of BGC models, further improvement can be achieved by considering SMS term fo BGC as a big 1D vector and a compuation over only the required area (ocean point only, oceans and euphotique layer only etc...). Same idea for sea-ice physics...
  • Remark: the version of MON currently under development (MOM5: switch to C-grid, use of finit volume approach,...) is using FMS, a GungHo? type approach...and "There are dozens of scientists and engineers at GFDL focused on meeting the evolving needs of climate scientists pushing the envelope of computational tools for studying climate"

Sebastien -- (2014 november 17): some ideas I heard about asynchronous communications:

  • compute inner domain during communication of the halo:

Today in NEMO, we first do loops from 2 to jpi-1 and, next, we do a communication to get values in 1 and jpi that will be needed for the next loop involving neighboring points. So, by default, we compute over the full domain (including halo) and make a communication after the incomplete loops.
We could change this paradigm. Halo exists only to be able to compute data over the inner domains. So we don't really need to compute over the full domain. We could compute only over the inner domain and make a communication before the loops that are involving neighboring points to update halo. This is what is done in WRF for example. If we do communications before the loops, we cant start non-blocking communication, do the computation for 3 to jpi-2, receive the communication and finally do the computation for 2 and jpi-1 (note that in this case, by default halo do not have updated data when computation is finished).

  • larger halo: could be done only on some variables that are for example using neighbor of neighbor (see what was done on sor solver).
  • in 3D loops: hide communication at each level by the computation of the next level... -> do not do 3D communications but n 2D asynchronous communications. Good only if communications are really hidden by the computation at each level (that will be less and less as the size of the subdomain is decreasing... Is it really a good idea?

Silvia -- (2014 november 25): about asynchronous communications:

  • compute inner domain during communication of the halo:

In the past, at CMCC we have carried out some optimization activities on a regional configuration (covering the mediterranean basin at 1/16°) of NEMO (v3.2). The performance analysis highlighted the SOR as one of the most computational intensive kernels, so our optimizations have been focused also on it. One activity aimed at overlapping communication and computation changing the algorithm in the following way: (i) halo computation, (ii) asynchronous communications and (iii) computation over the inner domain (overlapped with communication). The new algorithm has been evaluated on the old Marenostrum system (dismissed in 2013), in the context of an HPC-Europa application. It has been theoretically evaluated using the Dimemas tool (developed at BSC) showing that the new algorithm performed better than the old one, but the experimental results did not confirm the expectations. However, we can plan to test the communication/computation overlap paradigm on new architectures. The idea could be to extract some kernels characterized by the "do loops" you talked about, to change the communication algorithm and to test it before deciding to extend the modification to the entire the code.

  • larger halo: could be done only on some variables that are for example using neighbor of neighbor (see what was done on sor solver).

Larger halo allows to decrease the communication frequency in spite of the computation of a larger domain. It is needed to identify the best trade-off between communication time decrease and computation time increase, that is the halo dimension which minimizes the total execution time. This dimension could depend on the number of MPI processes, the domain size and some architectural parameters such as the communication latency, … (we have published a work on this aspect "The performance model for a parallel SOR algorithm using the red-black scheme", Int. J. of High Performance Systems Architecture, 2012 Vol.4, no.2, pp.101 - 109)

Attachments (5)

Download all attachments as: .zip