Opened 2 years ago

Closed 2 weeks ago

Last modified 6 days ago

#2009 closed Task (fixed)

HPC-02_Epicoco_Single Core Performance

Reported by: mocavero Owned by: francesca
Priority: high Milestone: IMMERSE 2019
Component: OCE Version: trunk
Severity: minor Keywords:
Cc: Review: failed
MP ready?: no
Progress: Developed a proof of concept for evaluating the introduction of the loop fusion approach. The loop fusion optimization requires the introduction of the extra halo management in advance. The work on the extra halo region is ongoing. The loop fusion development will not be included in the 2019 merge party, but it will be ready for 2020 developments. This Activity will be splitted in two activities in 2020, namely extra-halo and loop fusion.

Description (last modified by epico)

Context

Improvement of NEMO single-core performance by using alternative approaches to improve the data locality and vectorisation

Implementation plan

  1. Development of a portable tool to extract the hardware performance counters needed for the analysis of the NEMO single-core performance (done)
  2. Analysis of memory-bounded routines on Met Office and CMCC systems (done)
  3. Development of a proof of concept example regarding the loop fusion approach to enhance the cache reuse and the level of vectorization. Please refer to the slides (https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2019WP/HPC-02_Epicoco_SingleCorePerformance/HPC02_SingleCorePerformance_proof_of_concept.pdf) in the wiki page of the HPC02 action (done)
  4. Planning the use of an extended halo region.
  5. Evaluation of how the extended halo region impacts on the restart and input files (in progress)

Commit History (4)

ChangesetAuthorTimeChangeLog
12325francesca2020-01-15T13:26:22+01:00

replace halo-copy routines - ticket #2009

11720francesca2019-10-18T12:58:16+02:00

add halo managment file- ticket #2009

11719francesca2019-10-18T12:52:29+02:00

add extra halo support- ticket #2009

11516francesca2019-09-09T16:35:07+02:00

Create HPC-02 branch - ticket #2009

Change History (17)

comment:1 Changed 19 months ago by nemo

  • Status changed from new to assigned

comment:2 Changed 19 months ago by mocavero

  • Progress modified (diff)

comment:3 Changed 16 months ago by francesca

  • Description modified (diff)
  • Milestone changed from 2018 WP to IMMERSE 2019
  • Owner changed from mocavero to francesca
  • Progress modified (diff)

comment:4 Changed 15 months ago by nicolasmartin

  • Summary changed from HPC-01_HPCWG_singlecoreperf to HPC-01(2018WP)_HPCWG_singlecoreperf

comment:5 Changed 13 months ago by clevy

  • Summary changed from HPC-01(2018WP)_HPCWG_singlecoreperf to HPC-07(2018WP)_HPCWG_singlecoreperf

comment:6 Changed 12 months ago by nemo

  • Priority changed from normal to high

comment:7 Changed 12 months ago by epico

  • Summary changed from HPC-07(2018WP)_HPCWG_singlecoreperf to HPC-02_Epicoco_Single Core Performance

comment:8 Changed 8 months ago by epico

  • Description modified (diff)
  • Progress modified (diff)

comment:9 Changed 5 months ago by francesca

  • Progress modified (diff)

comment:10 Changed 4 months ago by francesca

In 11516:

Create HPC-02 branch - ticket #2009

comment:11 Changed 3 months ago by francesca

In 11719:

add extra halo support- ticket #2009

comment:12 Changed 3 months ago by francesca

In 11720:

add halo managment file- ticket #2009

comment:13 follow-up: Changed 2 months ago by smasson

  • Type changed from Task to Bug

Dear Italo,

I did review you developments @11720

There are several points that should be discussed/clarified before being able to merge this branch

First point concerns the array indices. We were thinking to keep the same logic to define the arrays dimensions as (1:jpi, 1:jpj) including the halos. As today, the “inner part of the domain would remain (1+nn_hls : jpi-nn_hls, 1+nn_hls : jpj-nn_hls). The advantage of using the default fortran indexes (starting at 1) is that you don’t have to define the index boundaries when passing an array into a subroutine. See for example what you are forced to do in mpp_lnk_generic.h90 in the declaration of “ARRAY_TYPE(1-nn_hls+1:,1-nn_hls+1:,:,:,: )”
By keeping the default dimensions (starting at 1), I think that you can keep mpp_lnk_generic.h90 almost as it is. You should just replace the default definition of ihl from 1 to nn_hls (something that should have been done before):

      IF( PRESENT(ihlcom) ) THEN   ;   ihl = ihlcom
      ELSE                         ;   ihl = nn_hls
      END IF

Next, the only real modification concerns the north pole folding that should be extended to halos larger than 1.
Note that we need to keep the ihlcom optional argument (which name should in fact starts with a k and not an i) as we want to be able to use halo size other than de default nn_hls in some specific parts of the code (e.g. dynspg_ts and ice reology).
I did not understood the need of the halo_mng routine… the array copies do not look as something to do in an HPC perspective, no?

To me, the main work concerning the extrahalo is development is not the communications part (that to me are almost already to do the job) but the following 2 points.

1) A proper cleaning and preparation of the code to be able to run with nn_hls > 1. We should first be able to run the model with nn_hls = 2 instead of 1 without changing the code results and without optimizations.
To do this, we need to modify the inputs/outputs to correctly initialize the arrays and outputs netcdf files. We must first redefine the default strategy used since ever: include the periodicity/halo in the inputs and outputs files. Today i/o implicitly include the fact that nn_hls = 1. We should first insure that i/o are completely independent of the value of nn_hls.
E.g. ORCA2 input files have 182 points in longitude instead of 180 because it includes the periodicity based on nn_hls = 1. With nn_hls = 2 we would need 184 points, so it is strange to keep input files with 182 points! Same idea for the latitude, we should have 147 instead of 149 points. We should exclude all periodicity/halo in the input files read only the inner part of the domains and fill the halos with a communication. This is more or less what we already do today except for the north pole folding. Including the north pole folding would simply require that fldread and iom_get accept and additional argument specifying the grid type.

2) Optimizations, take advantage of the extended halo size: rewriting of the do loops, the declaration of local variables, suppress/regroup communications…

comment:14 in reply to: ↑ 13 Changed 7 weeks ago by epico

Replying to smasson:

Dear Italo,

I did review you developments @11720

There are several points that should be discussed/clarified before being able to merge this branch

First point concerns the array indices. We were thinking to keep the same logic to define the arrays dimensions as (1:jpi, 1:jpj) including the halos. As today, the “inner part of the domain would remain (1+nn_hls : jpi-nn_hls, 1+nn_hls : jpj-nn_hls). The advantage of using the default fortran indexes (starting at 1) is that you don’t have to define the index boundaries when passing an array into a subroutine. See for example what you are forced to do in mpp_lnk_generic.h90 in the declaration of “ARRAY_TYPE(1-nn_hls+1:,1-nn_hls+1:,:,:,: )”
By keeping the default dimensions (starting at 1), I think that you can keep mpp_lnk_generic.h90 almost as it is. You should just replace the default definition of ihl from 1 to nn_hls (something that should have been done before):

      IF( PRESENT(ihlcom) ) THEN   ;   ihl = ihlcom
      ELSE                         ;   ihl = nn_hls
      END IF

The reason why we need the negative indexes in the array bounds is that in the north fold communication the global position of the subdomain is used (i.e. nimpp and njmpp). I recall that (nimpp, njmpp) map the global coordinates to the local coordinates (1,1). nimpp and njmpp are computed during the initialisation of the domain decomposition and they depend on the nn_hls value used at the initialization. If we want to use a different halo after the domain initialisation we would have to recompute the nimpp values, but if we maintain the local coordinate (1,1) unchanged after the change in the halo size, we do not need to recompute the nimpp and njmpp values.
Moreover, the use of negative indexes leads also to still using the same indexes for referring to the internal domain which is denoted by 1:nlei. (also nlei and nlej keep the same value when the halo size changes if you use negative indexes).
The use of negative indexes is relevant for the north fold communication only, so we can have the declaration of such indexes only in the lbc_lnk related routines and we can instead use the 1:jpi indexes in the rest of the code.


Next, the only real modification concerns the north pole folding that should be extended to halos larger than 1.
Note that we need to keep the ihlcom optional argument (which name should in fact starts with a k and not an i) as we want to be able to use halo size other than de default nn_hls in some specific parts of the code (e.g. dynspg_ts and ice reology).

In our solution we have introduced the variables jpi_1 and jpj_1 in par_oce; these store the dimension of the local domain when the halo has only one line. Inside the lbc_lnk we compare the actual size of the input array with jpi_1 being able to establish how wide the halo of the input array is.

I did not understood the need of the halo_mng routine… the array copies do not look as something to do in an HPC perspective, no?

The halo_mng module has been introduced only to gather in one module the routines that should be called when we change the halo size for example because of we are in the dynspg or in the reology or in the FCT advection kernel where an halo 3 would needed to remove the communication.
I agree, the memory copy is something that must be avoided as much as possible, but in this case we have used the same identical approach you have implemented in the DYN_optimization branch where you allocate new arrays with extended halo, copy the values from the smaller arrays, and after computation you copy back from extended arrays to the smaller ones.
Other solutions can be adopted to avoid the memory copy but they can impact on the cache misses or on the vectorisation.
We started working on the halo_mng last July, before seeing your branch, surprisingly we used the same approach …
In this regards, probably some misunderstanding happened since we didn’t know that you were working on the extension of the halo for the dynspg routine until we saw your DYN_optimization branch.


To me, the main work concerning the extrahalo is development is not the communications part (that to me are almost already to do the job) but the following 2 points.

1) A proper cleaning and preparation of the code to be able to run with nn_hls > 1. We should first be able to run the model with nn_hls = 2 instead of 1 without changing the code results and without optimizations.
To do this, we need to modify the inputs/outputs to correctly initialize the arrays and outputs netcdf files. We must first redefine the default strategy used since ever: include the periodicity/halo in the inputs and outputs files. Today i/o implicitly include the fact that nn_hls = 1. We should first insure that i/o are completely independent of the value of nn_hls.
E.g. ORCA2 input files have 182 points in longitude instead of 180 because it includes the periodicity based on nn_hls = 1. With nn_hls = 2 we would need 184 points, so it is strange to keep input files with 182 points! Same idea for the latitude, we should have 147 instead of 149 points. We should exclude all periodicity/halo in the input files read only the inner part of the domains and fill the halos with a communication. This is more or less what we already do today except for the north pole folding. Including the north pole folding would simply require that fldread and iom_get accept and additional argument specifying the grid type.

Also to me this represents the most significant part of the development, and also the most challenging one. We have not finished yet and we are not able to finish before the merge party. Anyway a viable solution that could be used with the code developed so far is to change the halo to one line just before the reading/writing, read the input file and change again the halo to two or more lines.
Now, we have to decide if it is better to postpone the merging to the next year waiting for a complete implementation of the I/O routines or merge the branch as it is and use the work around that I have suggested.


2) Optimizations, take advantage of the extended halo size: rewriting of the do loops, the declaration of local variables, suppress/regroup communications…

comment:15 Changed 6 weeks ago by smasson

  • Type changed from Bug to Task

comment:16 Changed 2 weeks ago by epico

  • Progress modified (diff)
  • Resolution set to fixed
  • Status changed from assigned to closed

comment:17 Changed 6 days ago by francesca

In 12325:

replace halo-copy routines - ticket #2009

Note: See TracTickets for help on using tickets.