UPDATED Actions and notes from NEMO HPC working group meeting

Attending: Mike Bell, Miguel Castrillo, Marie-Alice Foujols, Tim Graham, Claire Levy, Gurvan Madec, Silvia Mocavero, Oriol Tinto-Prims

Apologies: Mondher Chekki, Martin Schreiber, Julien le Sommer

  1. Discuss results from various groups
  2. Barcelona (Miguel) – document circulated prior to meeting. The report describes performance of ORCA2. Miguel is keen to study higher resolution models. Section 4 highlights improvements achieved by reducing the number of communications between nodes (due to high network latency). The message packing and reduced frequency of convergence checking have been included in the NEMO trunk at vn 3.6. The other change and an improvement to the message packing for the north-fold are available as branches dev_r5302_CNRS18_HPC_scalability & dev_r5546_CNRS19_HPC_scalability

The Dimemas simulator has not yet been used to simulate impact of cache misses on performance. Gurvan noted that work is in progress to couple LIM3 through OASIS so that its domain decomposition can differ from NEMO. Action: Miguel to provide some information about the domain decompositions used in his report.

  1. CMCC (Silvia) – document and ppt circulated prior to meeting. The report describes a reduction in the communications used for the north pole fold which have been included in the NEMO trunk at vn 3.6. The ppt describes investigations of the impact of 3 approaches to implementing OpenMP for the MUSCL advection scheme on a number of machines. Tim mentioned that the Met Office has had more success implementing OpenMP in its Unified Model on its CRAY than its previous IBM machine.

  1. Met Office (Tim) (preliminary results on our new CRAY) – notes on tests circulated prior to the meeting. Results suggest calculations within a node are memory bound. Tim is exploring the CRAY PAT tool options and checking consistency with the NEMO timing outputs.
  2. What is the current level of performance of NEMO on HPCs ?
  3. Compared with other ocean models – ROMS takes two times larger timesteps than NEMO. It has OpenMP implemented at the level of its second loop. (Gurvan gave a more precise description). Performance is very machine dependent. Gurvan – is there a reference on this ?

Action: Gurvan to circulate a message from Steve Griffies describing domain decomposition and performance of GFDL 1/10 deg coupled model (complete) Numbers from the GFDL coupled system : For CM2.6 (50km atmos coupled to 1/10th degree MOM) Atmos PEs = 1440 Ocean PEs = 17820 Ocean decomposition 160x160 Land locked PEs masked out = 7780 300 second ocean timestep 1200 second atmosphere timestep 1200 second coupling step 12-13 hours for one year of simulation. 1 Tb of data per year.

  1. Compared with the best that could reasonably be expected
  2. On a single processor / node : Silvia has looked at this and found calculations running at @ 10% of peak. It’s not clear what could reasonably be expected.

Action: Silvia to circulate her document describing roofline modelling of Glob16 with BFM (biology model) on Blue Gene?.

  1. Through parallelisation – Communication bottlenecks have been identified and some have been removed. Parallelisation of computations and communications between nodes has not been explored yet
  1. What are the priorities for future work ?
  • It would be useful to define common benchmark configurations at low resolution and high resolution. The GYRE configuration could also be useful. There is a lot of support for this idea

A NEMO HPC benchmark already exists, built by Sébastien Masson, and already given to a few computer vendors. Probably useful to restart from this existing benchmark. It is mainly composed of 2 pieces: first a GYRE-global 1/12° equivalent, and second a “real ORCA12”. For the GYRE configuration we would need to check/update the physical/numerical choices of its namelist in order to activate the up-to-date choices

  • The short-term work-plan for the group proposed by Sebastien Masson describes a number of good practical steps for communications between nodes that would definitely give improvements
  • A longer term methodology / strategy is not clear. It could be based on continued analysis of bottlenecks. Larger changes to the code organisation would need very careful consideration of impacts on science users. An approach will need to be developed for nodes supporting 100 or 1000 threads; this is why OpenMP is important.
  • It would be useful at some point to write an HPC contribution to the NEMO coding rules, in order to avoid code that will degrade performance (see Gurvan’s point A below)
  • Gurvan suggests:

A) For me, the short term target for HPC with NEMO is to be able to run efficiently the system with a local domain of 10 by 10. Currently the limit of scalability is ~40 by 40 local domain. The gain is to be able to use 16 more processors, thus probable being in elapse time ~ 10 time faster (up to 16 time in theory). Reaching this target requires (at least) that all the 5 tasks proposed by Seb are achieved, but also a 6th one: 6) Restrict the computation to the inner domain everywhere: Description: verify that wherever it is possible, the computation is done only in the inner domain (i.e. from 2 to jpi-1 and from 2 to jpj-1). In particular, in many places implicite do loop using (:,:) has been introduced that should be removed. NB: this is particularly true on LIM3 where most DO loop are over the full domain Skills: f90, good knowledge of NEMO. Amount of work: < 4 months. Easy to verify: results must be unchanged. Impact for HPC: potentially significant when local domain is small (for example, with a 10x10 local domain, full versus inner domain computation means 36% more calculation ! still 10% for a 40x40 local domain) Impact for the users: no B) The performance tests should be done separately on ocean (OPA), bio, and sea-ice. This can be done by running the ocean alone (GYRE), the off-line TOP (OFF), and Stand Alone? Surface module (SAS) with sea-ice activated. Indeed, currently setting nn_components=1 for ocean, and 2 for SAS in namsbc namelist allows to run an ocean-ice system (in forced or coupled to the atmosphere) on two separate executables, i.e. in parallel, so the sea-ice computation is masked by the one computation (shorter elapse time !) Similarly, the same technique can be introduced for biogeochemistry (especially when not using the one-line coarsening of TOP). In order to make progress in this aspect, two others task can be defined:

  1. optimization of SAS (with ice) : design the optimal domain decomposition for the ice (idea: (1) use larger jpj size for processors that are in the 40°N-40°S band (no sea-ice faster computation compared to icy area ; (2) change the global model domain for SAS so that there will be no north-fold communication in SAS (Pacific sector of Arctic ocean move north of the Atlantic sector one)).
  2. introduce the possibility of running TOP in parallel of the ocean using the same technique as the one set up for SAS.

Both tasks required a good knowledge of NEMO, and of OASIS for the second.

C) If the north fold is still a blocking issue (despite its recent great improvement), It is possible to increase the overlap area juste on the north-fold, restricting its application to the strict minimum by time-step without increasing all the size of the allows.

  1. What resources are available for this work ?

Miguel could spend up to 50% of his time. Oriol Tinto will be doing a PhD dedicated to research on HPC optimisations.

Silvia could spend 20% of her time on HPC optimisation issues.

Tim could only spend @10-20% of his time on these issues. Martin Schreiber has applied for a PhD student to work on NEMO optimisation issues.

Mondher could spend @ 10-20% of his time on these issues.

There could be EC funding available for work of this sort.

  1. Future meetings

Mike will call another meeting in 4-6 weeks time.

Specific agenda items:

  • Agree who will own actions identified in Sebastien Masson’s task list
  • Discuss which benchmark configurations to use for future tests
  • Further discussion of existing evidence about bottlenecks (Silvia’s roofline analysis; further results from Tim, …. )
  • Others ?
Last modified 4 years ago Last modified on 2016-03-08T18:47:55+01:00