Form 128 (in 2019WP/HPC-02_Epicoco_Single Core Performance)

Saved Values

in subcontext 'abstract'

implementation: 'The DO-loops fusion and the extra halo exchange can be inserted in the NEMO code gradually. During the transitional period we can support both the exchange of one-line halo or two-lines halo. the extra-halo implementation plan is as follow Transitional phase: both one-halo and two-halo region is supported 1. jpi, jpj values are computed considering one-line halo (i.e. jpreci=1, jprecj=1) 2. for those cases where two halo lines must be exchanged, before the exchange, a new memory allocation is needed in order to store the second halo line. 2a. we are aware that this implmentation introduces some performance penalties, due to extra memory copy, but this implementation is only transitional. 4. modifying the mpplnk routines to support one or two lines exchange 3. Changing all of the routines moving the halo exchange before the execution of the routine ifself Final phase: the halo region is finally defined by jpreci and jprecj variables 1. jpi, jpj are computed considering the extended halo region (i.e. jpreci=2, jprecj=2) 2. modifying the mpplnk routines to support the exchange of the halo region defined by jpreci and jprecj 3. modify all of the DO-loops indexes considering the jpreci and jprecj values and removing the unecessary memory copy introduced during the transitional phase. Loop fusion implementation plan is as follows 1. we fuse the DO-loops following the strategy defined during the NEMO HPC-WG meeting (please refer to the slides here attached https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2019WP/HPC-02_Epicoco_SingleCorePerformance/HPC02_SingleCorePerformance_proof_of_concept.pdf). We will start from the advection routines (both from tracers and for ocean dynamics) 2. proceed with LDF module 3. we will complete with the remaining routines from the most computing intensive towards the less computing intensive one.' by epico2019-05-15T16:00:25+02:00
manual: 'The computational optimizations, which will be implemented in this action, do not impact on the NEMO usabilty, neigther they will change the user's interfaces. Reference manual will not be changed.' by epico2019-05-14T19:36:32+02:00
description: 'The computational peak performance of the target parallel architecture can be better exploited working on the vectorisation level of the code. Many compilers usually are able to perform automatic vectorisation but the code needs to be written in such a way as to drive the compiler to increase the vectorisation level. A screening of the code will be needed in order to limit the dependency issues. Moreover, directives can also be used to increase the execution of SIMD instructions and to get closer to modern core peak performance. Single core performance will be enhanced by changing the structure of the DO-loops. Namely, the DO loops will be fused in order to perform as much operations as possibile over the current (j, i, k) grid cell before moving on processing the next one. This approach will ehnance the vectorization level and the cache reuse. The DO-loops fusion requires also to move the halo exchange before the fused loops, and this implies an extra-halo exchange. Part of this action is also focused on moving the communication before a routine/kernel execution extending also the halo region. Planned optimisations will be designed taking care to ensure that scientific quality of the code is not compromised.' by epico2019-05-14T19:36:32+02:00

Change History

Changed on 2019-05-15T16:00:25+02:00 by epico:

  • implementation changed from
    The DO-loops fusion and the extra halo exchange can be inserted in the NEMO code gradually. During the transitional period we can support both the exchange of one-line halo or two-lines halo. the extra-halo implementation plan is as follow Transitional phase: both one-halo and two-halo region is supported 1. jpi, jpj values are computed considering one-line halo (i.e. jpreci=1, jprecj=1) 2. for those cases where two halo lines must be exchanged, before the exchange, a new memory allocation is needed in order to store the second halo line. 2a. we are aware that this implmentation introduces some performance penalties, due to extra memory copy, but this implementation is only transitional. 4. modifying the mpplnk routines to support one or two lines exchange 3. Changing all of the routines moving the halo exchange before the execution of the routine ifself Final phase: the halo region is finally defined by jpreci and jprecj variables 1. jpi, jpj are computed considering the extended halo region (i.e. jpreci=2, jprecj=2) 2. modifying the mpplnk routines to support the exchange of the halo region defined by jpreci and jprecj 3. modify all of the DO-loops indexes considering the jpreci and jprecj values and removing the unecessary memory copy introduced during the transitional phase. Loop fusion implementation plan is as follows 1. we fuse the DO-loops following the strategy defined during the NEMO HPC-WG meeting (please refer to the slide here attached https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2019WP/HPC-02_Epicoco_SingleCorePerformance/HPC02_SingleCorePerformance_proof_of_concept.pdf). We will start from the advection routines (both from tracers and for ocean dynamics) 2. proceed with LDF module 3. we will complete with the remaining routines from the most computing intensive towards the less computing intensive one.
    to
    The DO-loops fusion and the extra halo exchange can be inserted in the NEMO code gradually. During the transitional period we can support both the exchange of one-line halo or two-lines halo. the extra-halo implementation plan is as follow Transitional phase: both one-halo and two-halo region is supported 1. jpi, jpj values are computed considering one-line halo (i.e. jpreci=1, jprecj=1) 2. for those cases where two halo lines must be exchanged, before the exchange, a new memory allocation is needed in order to store the second halo line. 2a. we are aware that this implmentation introduces some performance penalties, due to extra memory copy, but this implementation is only transitional. 4. modifying the mpplnk routines to support one or two lines exchange 3. Changing all of the routines moving the halo exchange before the execution of the routine ifself Final phase: the halo region is finally defined by jpreci and jprecj variables 1. jpi, jpj are computed considering the extended halo region (i.e. jpreci=2, jprecj=2) 2. modifying the mpplnk routines to support the exchange of the halo region defined by jpreci and jprecj 3. modify all of the DO-loops indexes considering the jpreci and jprecj values and removing the unecessary memory copy introduced during the transitional phase. Loop fusion implementation plan is as follows 1. we fuse the DO-loops following the strategy defined during the NEMO HPC-WG meeting (please refer to the slides here attached https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2019WP/HPC-02_Epicoco_SingleCorePerformance/HPC02_SingleCorePerformance_proof_of_concept.pdf). We will start from the advection routines (both from tracers and for ocean dynamics) 2. proceed with LDF module 3. we will complete with the remaining routines from the most computing intensive towards the less computing intensive one.

Changed on 2019-05-15T15:59:52+02:00 by epico:

  • implementation changed from
    The DO-loops fusion and the extra halo exchange can be inserted in the NEMO code gradually. During the transitional period we can support both the exchange of one-line halo or two-lines halo. the extra-halo implementation plan is as follow Transitional phase: both one-halo and two-halo region is supported 1. jpi, jpj values are computed considering one-line halo (i.e. jpreci=1, jprecj=1) 2. for those cases where two halo lines must be exchanged, before the exchange, a new memory allocation is needed in order to store the second halo line. 2a. we are aware that this implmentation introduces some performance penalties, due to extra memory copy, but this implementation is only transitional. 4. modifying the mpplnk routines to support one or two lines exchange 3. Changing all of the routines moving the halo exchange before the execution of the routine ifself Final phase: the halo region is finally defined by jpreci and jprecj variables 1. jpi, jpj are computed considering the extended halo region (i.e. jpreci=2, jprecj=2) 2. modifying the mpplnk routines to support the exchange of the halo region defined by jpreci and jprecj 3. modify all of the DO-loops indexes considering the jpreci and jprecj values and removing the unecessary memory copy introduced during the transitional phase. Loop fusion implementation plan is as follows 1. we fuse the DO-loops following the strategy defined during the NEMO HPC-WG meeting. We will start from the advection routines (both from tracers and for ocean dynamics) 2. proceed with LDF module 3. we will complete with the remaining routines from the most computing intensive towards the less computing intensive one.
    to
    The DO-loops fusion and the extra halo exchange can be inserted in the NEMO code gradually. During the transitional period we can support both the exchange of one-line halo or two-lines halo. the extra-halo implementation plan is as follow Transitional phase: both one-halo and two-halo region is supported 1. jpi, jpj values are computed considering one-line halo (i.e. jpreci=1, jprecj=1) 2. for those cases where two halo lines must be exchanged, before the exchange, a new memory allocation is needed in order to store the second halo line. 2a. we are aware that this implmentation introduces some performance penalties, due to extra memory copy, but this implementation is only transitional. 4. modifying the mpplnk routines to support one or two lines exchange 3. Changing all of the routines moving the halo exchange before the execution of the routine ifself Final phase: the halo region is finally defined by jpreci and jprecj variables 1. jpi, jpj are computed considering the extended halo region (i.e. jpreci=2, jprecj=2) 2. modifying the mpplnk routines to support the exchange of the halo region defined by jpreci and jprecj 3. modify all of the DO-loops indexes considering the jpreci and jprecj values and removing the unecessary memory copy introduced during the transitional phase. Loop fusion implementation plan is as follows 1. we fuse the DO-loops following the strategy defined during the NEMO HPC-WG meeting (please refer to the slide here attached https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2019WP/HPC-02_Epicoco_SingleCorePerformance/HPC02_SingleCorePerformance_proof_of_concept.pdf). We will start from the advection routines (both from tracers and for ocean dynamics) 2. proceed with LDF module 3. we will complete with the remaining routines from the most computing intensive towards the less computing intensive one.

Changed on 2019-05-14T19:36:32+02:00 by epico:

  • implementation changed from
    Describe flow chart of the changes in the code. List the .F90 files and modules to be changed. Detailed list of new variables (including namelists) to be defined. Give for each the chosen name (following coding rules) and definition.
    to
    The DO-loops fusion and the extra halo exchange can be inserted in the NEMO code gradually. During the transitional period we can support both the exchange of one-line halo or two-lines halo. the extra-halo implementation plan is as follow Transitional phase: both one-halo and two-halo region is supported 1. jpi, jpj values are computed considering one-line halo (i.e. jpreci=1, jprecj=1) 2. for those cases where two halo lines must be exchanged, before the exchange, a new memory allocation is needed in order to store the second halo line. 2a. we are aware that this implmentation introduces some performance penalties, due to extra memory copy, but this implementation is only transitional. 4. modifying the mpplnk routines to support one or two lines exchange 3. Changing all of the routines moving the halo exchange before the execution of the routine ifself Final phase: the halo region is finally defined by jpreci and jprecj variables 1. jpi, jpj are computed considering the extended halo region (i.e. jpreci=2, jprecj=2) 2. modifying the mpplnk routines to support the exchange of the halo region defined by jpreci and jprecj 3. modify all of the DO-loops indexes considering the jpreci and jprecj values and removing the unecessary memory copy introduced during the transitional phase. Loop fusion implementation plan is as follows 1. we fuse the DO-loops following the strategy defined during the NEMO HPC-WG meeting. We will start from the advection routines (both from tracers and for ocean dynamics) 2. proceed with LDF module 3. we will complete with the remaining routines from the most computing intensive towards the less computing intensive one.
  • manual changed from
    Using part 1 and 2, define the summary of changes to be done in the NEMO reference manual (tex files), and in the content of web pages.
    to
    The computational optimizations, which will be implemented in this action, do not impact on the NEMO usabilty, neigther they will change the user's interfaces. Reference manual will not be changed.
  • description changed from
    The computational peak performance of the target parallel architecture can be better exploited working on the vectorisation level of the code. Many compilers usually are able to perform automatic vectorisation but the code needs to be written in such a way as to drive the compiler to increase the vectorisation level. A screening of the code will be needed in order to limit the dependency issues. Moreover, directives can also be used to increase the execution of SIMD instructions and to get closer to modern core peak performance. Planned optimisations will be designed taking care to ensure that scientific quality of the code is not compromised.
    to
    The computational peak performance of the target parallel architecture can be better exploited working on the vectorisation level of the code. Many compilers usually are able to perform automatic vectorisation but the code needs to be written in such a way as to drive the compiler to increase the vectorisation level. A screening of the code will be needed in order to limit the dependency issues. Moreover, directives can also be used to increase the execution of SIMD instructions and to get closer to modern core peak performance. Single core performance will be enhanced by changing the structure of the DO-loops. Namely, the DO loops will be fused in order to perform as much operations as possibile over the current (j, i, k) grid cell before moving on processing the next one. This approach will ehnance the vectorization level and the cache reuse. The DO-loops fusion requires also to move the halo exchange before the fused loops, and this implies an extra-halo exchange. Part of this action is also focused on moving the communication before a routine/kernel execution extending also the halo region. Planned optimisations will be designed taking care to ensure that scientific quality of the code is not compromised.