New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
2020WP/HPC-09_epico_Loop_fusion (diff) – NEMO

Changes between Version 6 and Version 7 of 2020WP/HPC-09_epico_Loop_fusion


Ignore:
Timestamp:
2020-11-30T19:49:05+01:00 (3 years ago)
Author:
epico
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • 2020WP/HPC-09_epico_Loop_fusion

    v6 v7  
    2727}}} 
    2828 
    29 ''...  
     29'' 
     30The computational peak performance of the target parallel architecture can be better exploited working on the vectorisation level of the code. Many compilers usually are able to perform automatic vectorisation but the code needs to be written in such a way as to drive the compiler to increase the vectorisation level. A screening of the code will be needed in order to limit the dependency issues. Moreover, directives can also be used to increase the execution of SIMD instructions and to get closer to modern core peak performance. 
     31 
     32Single core performance will be enhanced by changing the structure of the DO-loops. Namely, the DO loops will be fused in order to perform as much operations as possibile over the current (j, i, k) grid cell before moving on processing the next one. This approach will ehnance the vectorization level and the cache reuse. 
     33 
     34The DO-loops fusion requires also to move the halo exchange before the fused loops, and this implies an extra-halo exchange.  
     35Part of this action is also focused on moving the communication before a routine/kernel execution extending also the halo region.  
     36 
     37Planned optimisations will be designed taking care to ensure that scientific quality of the code is not compromised.  
    3038'' 
    3139 
     
    3947}}} 
    4048 
    41 ''...'' 
     49'' 
     50The DO-loops fusion can be inserted in the NEMO code gradually, but this requires to move the halo exchanges earlier in the code and this is possible thanks to the extended halo=2 
     51 
     52a cleanup of the useless communications has been applied 
     531. the communications before a result writing have been removed because only the inner domanin data are stored into the output files 
     542. some communications have been removed by changing the DO LOOP ranges 
     553. most of the communication have been moved earlier in the code but exchanging a wider halo. We still maintain the support for halo=1 in the code 
     56  
     57Loop fusion implementation plan is as follows 
     581. we fuse the DO-loops following the strategy defined during the NEMO HPC-WG meeting (please refer to the slides here attached https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2019WP/HPC-02_Epicoco_SingleCorePerformance/HPC02_SingleCorePerformance_proof_of_concept.pdf). We will start from the advection routines (both from tracers and for ocean dynamics) 
     591a. the compilation key (key_loop_fusion) to activate or deactivate the loop fusion optiization 
     602a. the loop fusion has been applied on traadv_fct and traadv_mus (two new files have been added with the loop fused)= 
     612. proceed with LDF module (next year) 
     623. we will complete with the remaining routines from the most computing intensive towards the less computing intensive one. (next year) 
     63'' 
    4264 
    4365=== Documentation updates