Version 2 (modified by mocavero, 3 years ago) (diff)

NEMO HPC subgroup: Mon 21 Apr 2017

Attending: Claire Levy (CNRS), Mike Bell (Met Office), Tim Graham (Met Office), Miroslaw Andrejczuk (Met Office), Matthew Glover (Met Office), Andy Porter (STFC), Miguel Castrillo (BSC), Oriol Tinto (BSC), Martin Schreiber (Uniexe), Cyril Mazauric (ATOS), Silvia Mocavero (CMCC)

1. Actions from previous meetings

1.1 Integration NEMO/perf_regions tool: Tim to share the python script by including it in the perf_regions repository. Involved people to provide the first analysis before the meeting in Barcelona

Tim measured the performance counters available on the MetO system and provided raw data. Martin extracted the L3 cache misses and some preliminary results have been shown during the Barcelona meeting. Martin: the correct implementation of the performance counters is not guaranteed on the different machines (e.g. FLOPS overcounting, cache misses higher than cache accesses (maybe due to concurrent instances execution?), mismatch between the measured bandwidth and data transfer derived from the measured cache misses). Even if the absolute values for performance counters are not guaranteed, we could use PAPI counters to have indications on code modifications impact (e.g. improvement of cache hits rate due to cache blocking). The analysis should start from the execution of a single instance to avoid the effect of concurrency.

Mike suggests to have some results of the initial analysis for the next meeting to have a clear overview and to speedup the activity. Mike suggests that a comparable analysis with the Extrae/Paraver? tool could be useful.

Actions: Tim to upload the script in the perf_regions repository and the results achieved on the MetO system on dropbox; Tim and Silvia to test the single instance execution on MetO and CMCC systems and to share results; Cyril to integrate cache blocking on the benchmark branch to evaluate the improvement of the cache hits rate; Miguel and Oriol to perform the same analysis with Extrae/Paraver?

1.2 Updates on memory leaks ticket: Silvia to test NEMO-XIOS2 with the Allinea tool: Silvia to analyze the code modifications between the two versions

The test on an old revision (6287) of the NEMO3.6 stable code showed the presence of memory leaks when we used XIOS1. Following Tim’s suggestion, Silvia analyzed the behavior of the code by using XIOS2. In the meanwhile, the 3.6 code has been updated to 7654 revision and the behavior in terms of memory leaks on the new version with both XIOS 1 and XIOS2 seems to be different: no memory leaks happened with and without XIOS (both 1 and 2). The analysis of the main differences between the two revisions is going on.

Action: Silvia to finalize the analysis of the code modifications between the two revisions

1.3 NEMO optimization from BULL: Cyril to test the cache blocking impact on the modified loops; Cyril to test the impact of modifications on restart files (at numerical level); Cyril and Miguel to discuss about the integration of Cyril's work within the gathering communications activity

Cyril mainly worked on vectorization with a good improvement and a low impact on restart files. The improvement is not so good on KNL nodes. Silvia suggests to explore the new feature of the Intel Analyzer which provides the Roofline model analysis to better understand the main limits. All agree that restart files change is a key issue and that it is not so easy to understand which code modification is responsible for results changes. The interaction with scientists is needed to address this issue.

Action: Cyril to continue to work on the analysis and improvement of performance (i.e. vectorization) on KNL

1.4 Single Precision: Oriol to share detailed numbers on performance improvement; to test the improvement turning off vectorization; to test different precisions by using the Oxford emulator

Oriol started the analysis by using the Oxford emulator (a library that allows to change the precision by truncating the number of significant bits): not so long simulation due to the increasing of the execution time by using the emulator, diagnostics is used for the accuracy tests, code parts are activated/deactivated by changing the keys reported in the namelist. The evaluation of the impact on results is the key point (as for vectorization). Miroslaw suggests to directly change the declaration of the variables to test different precisions. Andy reports about a presentation of Tim Palmer on ECMWF code where the accuracy/precision issue is addressed. Mike suggests to contact Oxford people to exchange info about this topic.

Actions: Miroslaw to provide an example of the code to change the precision acting on variables declaration; Andy to share Tim Palmer's presentation

1.5 Hybrid parallelization status: Silvia to discuss with Andy and Martin about the OpenMP approach used in NEMO; the need to combine the model developments and HPC optimization strategies will be addressed during the Enlarged Developer’s Committee meeting

Silvia started to develop the fine-grain and coarse-grain versions of a couple of kernels. The development branch can be created from the NEMO trunk and the different versions of the kernels can be integrated in order to test them on the different systems and to evaluate the code changes complexity. On the other side, Andy developed a fine-grain version of the advection scheme kernel by using the automatic psyclone-like approach. Also this version could be integrated to have a comparison between the two development approaches. Tim asks if the coarse-grain approach parallelization can be supported by the psyclone-like approach. Andy: it could be not so easy. The limited gain of the fine-grain approach is due to the threads-synchronization. The coarse-grain approach should improve the gain since the threads control is moved at high level. A fine-grain approach could be improved by avoiding the threads synchronization when it is not needed. Silvia: the first step should be to understand which is the most convenient approach for the NEMO hybridization by testing the different versions manually implemented, also considering if an automatic approach can support the same implementation approach. Claire: the experience done from some development teams on DYNAMICO and on CROCO (info from Rachid Benshila) shows that the choice of a coarse-grain implementation impacts not only on computational development but also on natural science, then the suggestion is to start with a small kernel and to discuss the strategy with the NEMO ST

Action: Silvia to provide one or two kernels implemented with both the fine-grain and coarse-grain approaches and to discuss the impact of these approaches with the ST

2. Outcomes from the Enlarged Developer’s Committee meeting

Mike reports the main outcomes from the meeting: the need to find a way of writing the NEMO code that both supports readability and will give good HPC performance (portable on different architectures); in the short term (6 months) the need to have some progress starting from the analysis with the perf_regions tool and to propose some examples of optimizations applied to some strategic kernels to be discussed with the ST. Tim reports about a different way to see the readability from natural and computational scientists. Natural scientists prefer few long subroutines, while computational scientists prefer a lot of small subroutines. Silvia reports the comment of the scientists on the psyclone-like approach: the introduction of the algorithm layer increases the complexity of the code and decreases its readability Andy comments that some code changes are needed if we want to improve performance and its portability. The need to limit the code changes makes the work of the computational scientists harder Silvia comments that a compromise between the natural scientists requirements and the code changes needed to allow computational scientist to work on performance improvement is needed

3. Next meeting call

Next meeting will be in the second half of May

Action: Silvia to send the doodle poll for the next meeting.

4. AOB

Tim reports about two funding opportunities: the first one from PRACE (BSC people could apply to fund their work on NEMO), the second one from UK people who have access to the UK facilities (Andy could apply for working on the halo size extension).