Name and subject of the action
Last edition: Wikinfo(changed_ts)? by Wikinfo(changed_by)?
The PI is responsible to closely follow the progress of the action, and especially to contact NEMO project manager if the delay on preview (or review) are longer than the 2 weeks expected.
Summary
Action | Loop Fusion |
---|---|
PI(S) | I.Epicoco, F.Mele |
Digest | Adoption of the 'loop fusion' optimization tecnique to reduce the cache misses and to improve the computational performance |
Dependencies | Extra-halo |
Branch | source:/NEMO/branches/{YEAR}/dev_r{REV}_{ACTION_NAME} |
Previewer(s) | TBD |
Reviewer(s) | TBD |
Ticket | #2367 |
Description
The computational peak performance of the target parallel architecture can be better exploited working on the vectorisation level of the code. Many compilers usually are able to perform automatic vectorisation but the code needs to be written in such a way as to drive the compiler to increase the vectorisation level. A screening of the code will be needed in order to limit the dependency issues. Moreover, directives can also be used to increase the execution of SIMD instructions and to get closer to modern core peak performance.
Single core performance will be enhanced by changing the structure of the DO-loops. Namely, the DO loops will be fused in order to perform as much operations as possibile over the current (j, i, k) grid cell before moving on processing the next one. This approach will ehnance the vectorization level and the cache reuse.
The DO-loops fusion requires also to move the halo exchange before the fused loops, and this implies an extra-halo exchange. Part of this action is also focused on moving the communication before a routine/kernel execution extending also the halo region.
Planned optimisations will be designed taking care to ensure that scientific quality of the code is not compromised.
Implementation
The DO-loops fusion can be inserted in the NEMO code gradually, but this requires to move the halo exchanges earlier in the code and this is possible thanks to the extended halo=2
a cleanup of the useless communications has been applied
- the communications before a result writing have been removed because only the inner domanin data are stored into the output files
- some communications have been removed by changing the DO LOOP ranges
- most of the communication have been moved earlier in the code but exchanging a wider halo. We still maintain the support for halo=1 in the code
Loop fusion implementation plan is as follows
- we fuse the DO-loops following the strategy defined during the NEMO HPC-WG meeting (please refer to the slides here attached https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2019WP/HPC-02_Epicoco_SingleCorePerformance/HPC02_SingleCorePerformance_proof_of_concept.pdf). We will start from the advection routines (both from tracers and for ocean dynamics)
1a. the compilation key (key_loop_fusion) to activate or deactivate the loop fusion optiization 2a. the loop fusion has been applied on traadv_fct and traadv_mus (two new files have been added with the loop fused)=
- proceed with LDF module (next year)
- we will complete with the remaining routines from the most computing intensive towards the less computing intensive one. (next year)
Documentation updates
The use of compilation key named key_loop_fusion must be included in the documentation. key_loop_fusion can be used to activate or deactivate the loop fusion optimization
Preview
...
Tests
SETTE
SETTE test has been executed with halo=1 and with halo=2. halo=1: all tests passed (Restartability, Reproducibiity and Comparison with the trunk). halo=2: three configurations did not pass the test: AGRIF_DEMO_ST, OVERFLOW_ST, LOCK_EXCHANGE_ST, but this behaviour is exactly the same we have in the trunk. All the other configurations passed the full test
Compiler: ifort 19.5.281 is used with XIOS in attached mode.
Regular checks
- Can this change be shown to produce expected impact (option activated)? YES both in terms of numerical changes in the results and in HPC performance
- Can this change be shown to have a null impact (option not activated)? YES
- Results of the required bit comparability tests been run: are there no differences when activating the development? NO
- If some differences appear, is reason for the change valid/understood? YES
- If some differences appear, is the impact as expected on model configurations? YES
- Is this change expected to preserve all diagnostics? NO numerical differences will be introduced due to the different order of the floating point operations
- If no, is reason for the change valid/understood? YES
- Are there significant changes in run time/memory? NO changes in the memory footprint, performance improvement can be seen depending on the architecture
Detailed SETTE results
Tiling is turned on in this report.
Current code is : URL: https://forge.ipsl.jussieu.fr/nemo/svn/NEMO/branches/2020/dev_r13898_Tiling_Cleanup_MPI3 @ r13906 13906 ( last change @ r13906 ) SETTE validation report generated for : URL: https://forge.ipsl.jussieu.fr/nemo/svn/NEMO/branches/2020/dev_r13898_Tiling_Cleanup_MPI3 @ r13906+ (last changed revision) on ifort_zeus_xios arch file !!---------------1st pass------------------!! !----restart----! WGYRE_PISCES_ST run.stat restartability passed : 13906+ WGYRE_PISCES_ST tracer.stat restartability passed : 13906+ WORCA2_ICE_PISCES_ST run.stat restartability passed : 13906+ WORCA2_ICE_PISCES_ST tracer.stat restartability passed : 13906+ WORCA2_OFF_PISCES_ST tracer.stat restartability passed : 13906+ WAMM12_ST run.stat restartability passed : 13906+ WORCA2_SAS_ICE_ST run.stat restartability passed : 13906+ WAGRIF_DEMO_ST run.stat restartability FAILED : 13906+ (results are different after 19 time steps) WWED025_ST run.stat restartability passed : 13906+ WISOMIP+_ST run.stat restartability passed : 13906+ WOVERFLOW_ST ocean.output MISSING : 13906+ WOVERFLOW_ST incomplete test WLOCK_EXCHANGE_ST ocean.output MISSING : 13906+ WLOCK_EXCHANGE_ST incomplete test WVORTEX_ST run.stat restartability passed : 13906+ WICE_AGRIF_ST run.stat restartability passed : 13906+ !----repro----! WGYRE_PISCES_ST run.stat reproducibility passed : 13906+ WGYRE_PISCES_ST tracer.stat reproducibility passed : 13906+ WORCA2_ICE_PISCES_ST run.stat reproducibility passed : 13906+ WORCA2_ICE_PISCES_ST tracer.stat reproducibility passed : 13906+ WORCA2_OFF_PISCES_ST tracer.stat reproducibility passed : 13906+ WAMM12_ST run.stat reproducibility passed : 13906+ WORCA2_SAS_ICE_ST run.stat reproducibility passed : 13906+ WORCA2_ICE_OBS_ST run.stat reproducibility passed : 13906+ WAGRIF_DEMO_ST run.stat reproducibility FAILED : 13906+ (results are different after 8 time steps) WWED025_ST run.stat reproducibility passed : 13906+ WISOMIP+_ST run.stat reproducibility passed : 13906+ WVORTEX_ST run.stat reproducibility passed : 13906+ WICE_AGRIF_ST run.stat reproducibility passed : 13906+ !----agrif check----! ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat unchanged - passed : 13906+ 13906+ !----result comparison check----! check result differences between : VALID directory : /work/asc/fm27215/dev_r13898_Tiling_Cleanup_MPI3/NEMO_VALIDATION at rev 13906+ and REFERENCE directory : /work/asc/fm27215/trunk@r13787/trunk/NEMO_VALIDATION_H2_BERG at rev 13787 WGYRE_PISCES_ST run.stat files are identical WGYRE_PISCES_ST tracer.stat files are identical WORCA2_ICE_PISCES_ST run.stat files are identical WORCA2_ICE_PISCES_ST tracer.stat files are identical WORCA2_OFF_PISCES_ST tracer.stat files are identical WAMM12_ST run.stat files are identical WORCA2_SAS_ICE_ST run.stat files are identical WAGRIF_DEMO_ST run.stat files are DIFFERENT (results are different after 17 time steps) WWED025_ST run.stat files are identical WISOMIP+_ST run.stat files are identical WVORTEX_ST run.stat files are identical WICE_AGRIF_ST run.stat files are identical WOVERFLOW_ST incomplete test WLOCK_EXCHANGE_ST incomplete test
Review
...
Attachments (1)
- comm_cleanup.pdf (126.1 KB) - added by francesca 4 years ago.
Download all attachments as: .zip