'''NEMO HPC subgroup: Mon 09 Jan 2017'''

Attending: Claire Levy (CNRS), Mike Bell (Met Office), Tim Graham (Met Office), Andy Porter (STFC), Miguel Castrillo (BSC), Oriol Tinto (BSC), Mario Acosta (BSC), Martin Schreiber (Uniexe), Cyril Mazauric (Bull), Silvia Mocavero (CMCC)  



== 1.	Actions from previous meetings ==
 
== 1.1  NEMO WP2017 HPC actions ==

The actions to be included in the workplan have to be defined taking into account the expertise and the time available to perform them. Claire asks for a list of the actions we would like to do in 2017 but don't have resource to do. The list has to be finalized before the discussion with the Steering Committee.

'''Action''': Tim to list the actions and to discuss them with Silvia by email.

== 1.2  FLOPS over counting on Intel architectures ==

Silvia has downloaded Andy's parser and tested it on the Intel Sandy Bridge architecture.

'''Action''': all to test Andy's parser (at least some NEMO kernels) before the next meeting.

== 1.3  Integration NEMO/perf_regions tool ==

The discussion on the best and easiest solution to solve the problem of profiling performance counters on nested regions has not been started. Tim suggests to solve the problem by changing the C code instead of modifying the NEMO parser in order to avoid nested regions.

'''Action''': Martin to investigate if the C code can be changed to handle performance counters in nested regions and to discuss the solution with Tim and Silvia.

== 1.4  Perf_regions documentation ==

Andy and Martin started the discussion on the POSIX timers integration.

'''Action''': Andy and Martin to solve the problem related to the POSIX timers integration.

== 1.5  Single-core performance test plan: Tim to add his comments and to share the document to the other subgroup members (Done) ==

== 1.6  Updates on memory leaks ticket ==

Silvia has performed some tests on the 3.6_stable version with and without using XIOS and shared the outputs of the Allinea memory profiler with the subgroup. Tests have been performed with XIOS1. Tim suggests to test XIOS2.

'''Action''': Silvia to test NEMO-XIOS2 with the Allinea tool.

== 2.	NEMO optimization from BULL ==
 
Cyril has presented his work on NEMO vectorization, cache blocking, gathering communications and reduction in the number of mask variables. Cache blocking optimization has been applied only to some loops of the NEMO code and this justifies the limited improvement (~1%). The impact of the optimization has to be evaluated by limiting the analysis to the modified code. Results at numerical level have been evaluated by comparing the solver.stat file. The analysis should be extended to the restart files. The activity on communications gathering could be included in the HPC-1 shared action of the WP2017. The work done on vectorization and cache blocking could be linked to the single-core performance analysis and improvement action (HPC-4). It is important to preserve the code readability and maintenance. Martin suggested that if the cache blocking works well it might be possible to implement it with a precompiler script so that we maintain code readability. Claire suggested to provide an update of the coding rules for the developers to maintain the code optimizations.

'''Action''': Cyril to test the cache blocking impact on the modified loops; Cyril to test the impact of modifications on restart files (at numerical level); Cyril and Miguel to discuss about the integration of Cyril's work within the gathering communications activity.   
 

== 3.	Paper on the perf_regions tool ==

Martin has shared a draft of the paper (‪‪http://www.martin-schreiber.info/pub/tmp/perf_regions_paper_sketch_ver1.pdf‬‬‪‪)‬‬. The editing process has been postponed after the work on the perf_region tool will be completed and tested on the NEMO code.


== 4.	Next meeting call   ==
 
Next meeting will be in February (last two weeks).


'''Action''': Silvia to send the doodle poll for the next meeting 

== 5.	AOB   ==
 
BSC team has worked on the single-precision execution of NEMO. The ORCA-LIM configuration has been considered. The sea-ice component has been run in double-precision. Single-precision on ORCA has allowed an improvement of ~40%. There are some differences on the outputs that have to be analyzed.


'''Action''': BSC team to check the outputs differences and to present the work during the next meeting.