New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
WorkingGroups/HPC/Mins_sub_2017_04_21 (diff) – NEMO

Changes between Version 1 and Version 2 of WorkingGroups/HPC/Mins_sub_2017_04_21


Ignore:
Timestamp:
2017-04-24T13:30:21+02:00 (7 years ago)
Author:
mocavero
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WorkingGroups/HPC/Mins_sub_2017_04_21

    v1 v2  
    1 '''NEMO HPC subgroup: Mon 27 Feb 2017''' 
     1'''NEMO HPC subgroup: Mon 21 Apr 2017''' 
    22 
    3 Attending: Claire Levy (CNRS), Mike Bell (Met Office), Tim Graham (Met Office), Miroslaw Andrejczuk (Met Office), Matthew Glover (Met Office), Andy Porter (STFC), Miguel Castrillo (BSC), Oriol Tinto (BSC), Martin Schreiber (Uniexe), Silvia Mocavero (CMCC)   
     3Attending: Claire Levy (CNRS), Mike Bell (Met Office), Tim Graham (Met Office), Miroslaw Andrejczuk (Met Office), Matthew Glover (Met Office), Andy Porter (STFC), Miguel Castrillo (BSC), Oriol Tinto (BSC), Martin Schreiber (Uniexe), Cyril Mazauric (ATOS), Silvia Mocavero (CMCC)   
    44 
    55 
    66== 1.   Actions from previous meetings == 
    77  
    8 == 1.1  NEMO WP2017 
     8== 1.1  Integration NEMO/perf_regions tool: Tim to share the python script by including it in the perf_regions repository. Involved people to provide the first analysis before the meeting in Barcelona == 
    99 
    10 Claire asked for a list of the actions we would like to do in 2017 but don't have resource to do (done) 
     10Tim measured the performance counters available on the MetO system and provided raw data. Martin extracted the L3 cache misses and some preliminary results have been shown during the Barcelona meeting. 
     11Martin: the correct implementation of the performance counters is not guaranteed on the different machines (e.g. FLOPS overcounting, cache misses higher than cache accesses (maybe due to concurrent instances execution?), mismatch between the measured bandwidth and data transfer derived from the measured cache misses). Even if the absolute values for performance counters are not guaranteed, we could use PAPI counters to have indications on code modifications impact (e.g. improvement of cache hits rate due to cache blocking). The analysis should start from the execution of a single instance to avoid the effect of concurrency. 
    1112 
    12 Feedback from the Steering Committee on the need to improve the man power for the HPC work. The next version of the development strategy document should be written in a easily way to allow the submission of a new project; on the other hand HPC activities can be funded in the long term as part of the European Infrastructure projects (e.g. IS-ENES, led by Sylvie Joussaume). 
     13Mike suggests to have some results of the initial analysis for the next meeting to have a clear overview and to speedup the activity. 
     14Mike suggests that a comparable analysis with the Extrae/Paraver tool could be useful. 
     15 
     16'''Actions''': Tim to upload the script in the perf_regions repository and the results achieved on the MetO system on dropbox; Tim and Silvia to test the single instance execution on MetO and CMCC systems and to share results; Cyril to integrate cache blocking on the benchmark branch to evaluate the improvement of the cache hits rate; Miguel and Oriol to perform the same analysis with Extrae/Paraver 
    1317 
    1418 
    15 == 1.2  FLOPS over counting on Intel architectures: all to test Andy's parser (at least some NEMO kernels) before the next meeting == 
     19== 1.2  Updates on memory leaks ticket: Silvia to test NEMO-XIOS2 with the Allinea tool: Silvia to analyze the code modifications between the two versions == 
    1620 
    17 No progress on this point. 
     21The test on an old revision (6287) of the NEMO3.6 stable code showed the presence of memory leaks when we used XIOS1. Following Tim’s suggestion, Silvia analyzed the behavior of the code by using XIOS2. In the meanwhile, the 3.6 code has been updated to 7654 revision and the behavior in terms of memory leaks on the new version with both XIOS 1 and XIOS2 seems to be different: no memory leaks happened with and without XIOS (both 1 and 2). The analysis of the main differences between the two revisions is going on. 
    1822 
    19 == 1.3  Integration NEMO/perf_regions tool == 
     23'''Action''': Silvia to finalize the analysis of the code modifications between the two revisions 
    2024 
    21 Martin has solved the problem of the accuracy which affected the original timing computation in NEMO when nested regions were measured. Moreover, performance counters are now handled also in nested regions. Tim has provided the outputs of a first analysis of GYRE with the perf_regions tool (compiled as static library) and the analysis of the outputs is going on. The main problems to be addressed to perform the roofline analysis are the well known FLOPS over counting and the bandwidth measurement. A first analysis should be completed in two weeks. 
    22 Tim has developed a python script to extract data on performance counters in a more readable way. 
     25== 1.3  NEMO optimization from BULL: Cyril to test the cache blocking impact on the modified loops; Cyril to test the impact of modifications on restart files (at numerical level); Cyril and Miguel to discuss about the integration of Cyril's work within the gathering communications activity == 
     26 
     27Cyril mainly worked on vectorization with a good improvement and a low impact on restart files. The improvement is not so good on KNL nodes. 
     28                 
     29Silvia suggests to explore the new feature of the Intel Analyzer which provides the Roofline model analysis to better understand the main limits. 
     30All agree that restart files change is a key issue and that it is not so easy to understand which code modification is responsible for results changes. The interaction with scientists is needed to address this issue. 
     31 
     32'''Action''': Cyril to continue to work on the analysis and improvement of performance (i.e. vectorization) on KNL 
    2333 
    2434 
    25 '''Action''': Tim to share the python script by including it in the perf_regions repository. Involved people to provide the first analysis before the meeting in Barcelona. 
     35== 1.4  Single Precision: Oriol to share detailed numbers on performance improvement; to test the improvement turning off vectorization; to test different precisions by using the Oxford emulator == 
     36 
     37Oriol started the analysis by using the Oxford emulator (a library that allows to change the precision by truncating the number of significant bits): not so long simulation due to the increasing of the execution time by using the emulator, diagnostics is used for the accuracy tests, code parts are activated/deactivated by changing the keys reported in the namelist. The evaluation of the impact on results is the key point (as for vectorization). 
     38                 
     39Miroslaw suggests to directly change the declaration of the variables to test different precisions. 
     40Andy reports about a presentation of Tim Palmer on ECMWF code where the accuracy/precision issue is addressed. 
     41Mike suggests to contact Oxford people to exchange info about this topic. 
     42 
     43'''Actions''': Miroslaw to provide an example of the code to change the precision acting on variables declaration; Andy to share Tim Palmer's presentation 
    2644 
    2745 
    28 == 1.4  Perf_regions documentation == 
     46== 1.5  Hybrid parallelization status: Silvia to discuss with Andy and Martin about the OpenMP approach used in NEMO; the need to combine the model developments and HPC optimization strategies will be addressed during the Enlarged Developer’s Committee meeting == 
    2947 
    30 Andy has integrated the POSIX timers which improve the measurement accuracy on short runtime. 
    31  
    32 == 1.5  Updates on memory leaks ticket == 
    33  
    34 Silvia has analyzed the behavior of NEMO with XIOS2 and has sent the document with analysis outputs to the group. The analysis has been carried out after updating the code to the last revision of the 3.6 stable version and shows that the execution is not affected by memory leaks. The analysis has been extended also to the same revision of the code executed without XIOS and with XIOS1 and these last tests have confirmed the results achieved with XIOS2. A detailed analysis is needed to understand the changes between the two revisions of the 3.6 stable in order to better understand the different behavior. 
     48Silvia started to develop the fine-grain and coarse-grain versions of a couple of kernels. The development branch can be created from the NEMO trunk and the different versions of the kernels can be integrated in order to test them on the different systems and to evaluate the code changes complexity. On the other side, Andy developed a fine-grain version of the advection scheme kernel by using the automatic psyclone-like approach. Also this version could be integrated to have a comparison between the two development approaches. 
     49                 
     50Tim asks if the coarse-grain approach parallelization can be supported by the psyclone-like approach.  
     51Andy: it could be not so easy. The limited gain of the fine-grain approach is due to the threads-synchronization. The coarse-grain approach should improve the gain since the threads control is moved at high level. A fine-grain approach could be improved by avoiding the threads synchronization when it is not needed. 
     52Silvia: the first step should be to understand which is the most convenient approach for the NEMO hybridization by testing the different versions manually implemented, also considering if an automatic approach can support the same implementation approach. 
     53Claire: the experience done from some development teams on DYNAMICO and on CROCO (info from Rachid Benshila) shows that the choice of a coarse-grain implementation impacts not only on computational development but also on natural science, then the suggestion is to start with a small kernel and to discuss the strategy with the NEMO ST 
    3554 
    3655 
    37 '''Action''': Silvia to analyze the code modifications between the two versions. 
     56'''Action''': Silvia to provide one or two kernels implemented with both the fine-grain and coarse-grain approaches and to discuss the impact of these approaches with the ST 
    3857 
    3958 
    40 == 2.   Presentation on Single Precision (Oriol) == 
     59== 2.   Outcomes from the Enlarged Developer’s Committee meeting == 
    4160  
    42 Oriol has presented the outcomes of the analysis performed on NEMO by running the code with mixed precision. The performance improvement (~40% of SYPD on 256 cores) and the difference on the outputs (~1° on the SST) are reported in the presentation. 
     61Mike reports the main outcomes from the meeting: the need to find a way of writing the NEMO code that both supports readability and will give good HPC performance (portable on different architectures); in the short term (6 months) the need to have some progress starting from the analysis with the perf_regions tool and to propose some examples of optimizations applied to some strategic kernels to be discussed with the ST. 
     62Tim reports about a different way to see the readability from natural and computational scientists. Natural scientists prefer few long subroutines, while computational scientists prefer a lot of small subroutines. 
     63Silvia reports the comment of the scientists on the psyclone-like approach: the introduction of the algorithm layer increases the complexity of the code and decreases its readability 
     64Andy comments that some code changes are needed if we want to improve performance and its portability. The need to limit the code changes makes the work of the computational scientists harder 
     65Silvia comments that a compromise between the natural scientists requirements and the code changes needed to allow computational scientist to work on performance improvement is needed     
     66  
    4367 
    44 Miroslav suggests to test the improvement turning off vectorization. 
    45 Martin comments that the improvement achieved by running IFS in reduced precision is due to the normalization to 1 of the coefficients, maybe we could not achieve the same results in NEMO. 
    46 Claire highlights the importance to look at this kind of work due to the potential gain. However, the needs in terms of precision results of the NEMO community can be very different since there are a variety of applications. 
    47 Tim suggests to provide different kinds of variables in NEMO to allow to set different levels of precision. 
    48 Oriol would like to use an emulator from Oxford to study the behavior of NEMO with different precisions. 
    49 Mike comments that extra precision could be important when increments are accumulated. 
     68== 3.   Next meeting call   == 
     69  
     70Next meeting will be in the second half of May 
    5071 
    5172 
    52 '''Action''': Oriol to share detailed numbers on performance improvement; to test the improvement turning off vectorization; to test different precisions by using the Oxford emulator.    
    53   
     73'''Action''': Silvia to send the doodle poll for the next meeting. 
    5474 
    55 == 3.   PSyclone and NEMO (Andy & Silvia) == 
     75 == 4.  AOB  == 
    5676 
    57 CMCC and STFC are working to test the PSyKAl approach on a NEMO kernel. A sequential version of the original code will be modified to add vectorization and cache blocking by CMCC. In the meanwhile, STFC is working on the development of the new sequential PSyKAl version. A comparison at performance level will be done at the end of the development phase. This work allows not only to compare the performance of the two versions but also to evaluate the complexity of the PSyKAl implementation on a NEMO kernel and to provide information about this to the NEMO development community. The stand-alone code is available on a github repository. 
    58    
    59 == 4.   Hybrid parallelization status (Silvia)   == 
    60   
    61 The OpenMP implementation has been discussed during the Merge Party in December and integrated in the trunk. However, some System Team experts are not so convinced of the modifications due to the loss of code readability and the increasing of complexity introduced by the OpenMP parallelization and the limited gain in performance. 
    62  
    63 Martin suggests to share a document describing the code complexity problem in order to evaluate alternative solutions. 
    64 Silvia asks to discuss more in general about the problem to address HPC issues without affecting the readability and flexibility of the code. 
    65 Mike suggests to discuss about this issue during the next Enlarged Developer’s Committee meeting. 
    66 Claire suggests to consider different OpenMP strategies (e.g. based on tiling) as in Dynamico atmospheric model since it seems to be more promising from the computational point of view; this could convince the System Team to integrate the new developments. 
    67 Andy highlights that Dynamico and NEMO are different due to the data layout and the tiling approach could be not so efficient for NEMO. 
    68  
    69  
    70 '''Action''': Silvia to discuss with Andy and Martin about the OpenMP approach used in NEMO; the need to combine the model developments and HPC optimization strategies will be addressed during the Enlarged Developer’s Committee meeting. 
    71  
    72 == 5.   Next meeting call   == 
    73   
    74 Next meeting will be in the second half of April. 
    75  
    76  
    77 '''Action''': Silvia to send the doodle poll for the next meeting.   
     77Tim reports about two funding opportunities: the first one from PRACE (BSC people could apply to fund their work on NEMO), the second one from UK people who have access to the UK facilities (Andy could apply for working on the halo size extension).