New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
Ocean Model Performance.txt on WorkingGroups/HPC/Mins_sub_2018_08_01 – Attachment – NEMO

WorkingGroups/HPC/Mins_sub_2018_08_01: Ocean Model Performance.txt

File Ocean Model Performance.txt, 7.0 KB (added by trac, 6 years ago)
Line 
1
2============
3
4Questions:
5
6============
7
8
91) Is there a scalability / performance / code adaptation problem for global ocean models and if so for which models do you consider this a problem and why ?
10
11Finite difference ocean models are characterised by 5 or 9-point stencil, that means the communication overhead has a relevant impact on the model scalability (communications do not scale with computation). NEMO reaches the scalability limit at < 10x10 subdomain sizes (ocean only, no sea-ice, no I/O). The limit is due to both the communication overhead and the load imbalance given by the north folding operation, introduced to handle the north boundary of a three-polar ORCA grid. Processes involved in the north folding operation perform additional exchanges (with a consequent effect of load imbalance among processes). Some plots about the scalability and the effect of NF can be provided by CNRS and CMCC.
12Moreover, the real performance is far from the peak performance of the HPC systems where the models are executed. As regards NEMO, the analysis at routine level shows that many kernels are memory-bounded, in particular the tracer advection schemes.
13
14
152) What in the structure or in scientific choices for ocean and sea-ice models (or other non-atmosphere components) are particular scalability challenges ?
16
17
18As reported above, the communications introduced to handle the north folding can be one of the main bottlenecks for NEMO. Moreover, the use of collective communications becomes a computational issue when the number of processes increases with the resolution. Merging global communications (to reduce the startup latency) and reducing the number of global communications should have a positive impact on scalability.   
19
20
213) Which global ocean model developments are leading efforts on improving performance / adaptation for future HPC ?
22
23As regards NEMO, evolutionary and revolutionary strategies are proposed to maintain the model competitiveness. They concern:
24
25   • improving the arithmetic intensity and the vectorisation of the main computationally intensive kernels in order to increase the real performance at node level
26   • reducing the communications in terms of quantity and frequency
27   • limiting the I/O overhead by using dedicated servers for both reading and writing operations
28   • investigating mixed-precision impact on accuracy
29   • investigating hybrid programming approaches (MPI+X, e.g. OpenMP and OpenACC) to efficiently exploit heterogeneous architectures
30   • evaluating DSL approaches to improve the performance portability
31
32Moreover, new numerical schemes will be investigated and designed, such as a new time-stepping to reduce the computational cost, making the model integration more robust and stable.
33
34
354) Which global ocean models are able to run eddy-resolving, say greater or equal 1/36 degree, global problems (state also number of vertical levels) ?
36
37
38Nowadays, the main ocean models support a horizontal global resolution of ~10kms. NEMO higher simulations are usually at 1/12°. CMCC developed an eddy resolving global configuration at 1/16° (ocean+sea-ice) with 98 vertical levels. The target for the next future (three years) is the development of the 1/36° configuration.     
39
40
415) What computational resources are required to achieve this? State computational performance if possible (ocean only, forecast days / day; number of cores ; programming model [e.g. MPI OpenMP hybrid], state MPI task / thread ratio if applicable)
42
43The plots of the NEMO scalability on BENCH-1, 025 and 12 can be provided on different machines. Tests have been performed with the pure MPI model. The development of the hybrid version is still under investigation due to the limited improvement w.r.t the pure MPI one (usually the fastest time-to-solution is achieved with 2 or 4 threads/MPI tasks).
44
45
466) What do you consider the most scalable ocean model today, e.g. providing the fastest time-to-solution, providing the best cost-benefit ratio for the global ocean/sea-ice problem ? Is this view based on an actual intercomparison of computational performance?
47
48The comparison of two or more models is not so easy and it requires the definition of the scientific problem in terms of resolution and complexity and the analysis of the performance on a set of target architectures. It is also important to decide if resources have to be used in capacity or capability mode. We can say that the main ocean models (HYCOM, MOM, POP, MPAS-O, NEMO) have 10 kms as current target horizontal resolution and the goal is to reach 2-4 kms in the next future. The performance analysis shows a good scalability of these models up to a number of cores in the order of 10^4 for high-resolution configurations, usually achieved without considering I/O.
49
50
517) What do you think should be or is already done to improve the performance of global ocean models ?
52
53If the goal is preserving the same performance (in terms of subdomain size when the scalability limit is achieved) while increasing the models resolution, the main challenges concern I/O management, memory scalability and communications traffic.
54On the other side, if the goal is to overcome the current scalability limit, evolutionary optimisations have limited room for improvement and the design of new algorithms is essential.     
55
56
578) The very important aspect of coupling is considered separately, but if you have any comments on the performance / best practice relevant to the computational performance of coupled simulations that you consider important please state.
58
59One of the main aspects for the improvement of the coupled models performance is balancing computing resources among the components in order to minimise the idle time. The definition of the best coupled configuration needs the knowledge of the scalability curve of each component on a target machine. This work could be very complex due to the number of system components. Moreover, the performance of each component depends on the target architecture, which makes it quite difficult to establish general rules.
60
61
62========================
63
64Questions/remarks from Nils
65
66========================
67
68You mention global communications, but do not state where they are needed in Nemo and if this can be changed.
69
70What factor of improvement/speed-up do you expect from the listed changes such as vectorisation etc (given that OpenMP is also not showing improvement ?), what is your performance goal, when would you be happy (eg 1 year per day at 2km resolution ? ), what is maximally achievable without algorithmic or structural change ?
71
72What do you realistically expect in terms of scalability/ time to solution from the new time stepping scheme ?
73
74If the Tripolar grid is a major bottleneck for scalability, what are the advantages to keep it, what is done to address this bottleneck? Other models may consider moving to this grid.
75
76Comparability to others is always difficult, but given a goal, we can measure how close we are to that goal, and 1/12 degree performance for a range of ocean models would be a start, so far there are very few numbers in the reply, and it would be nice to have some recent time to solution numbers at 1/12 degree.