New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
#2009 (HPC-02_Epicoco_Single Core Performance) – NEMO

Opened 6 years ago

Closed 4 years ago

Last modified 4 years ago

#2009 closed Task (fixed)

HPC-02_Epicoco_Single Core Performance

Reported by: mocavero Owned by: francesca
Priority: high Milestone: IMMERSE 2019
Component: OCE Version: trunk
Severity: minor Keywords:
Cc:

Description (last modified by epico)

Context

Improvement of NEMO single-core performance by using alternative approaches to improve the data locality and vectorisation

Implementation plan

  1. Development of a portable tool to extract the hardware performance counters needed for the analysis of the NEMO single-core performance (done)
  2. Analysis of memory-bounded routines on Met Office and CMCC systems (done)
  3. Development of a proof of concept example regarding the loop fusion approach to enhance the cache reuse and the level of vectorization. Please refer to the slides (https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2019WP/HPC-02_Epicoco_SingleCorePerformance/HPC02_SingleCorePerformance_proof_of_concept.pdf) in the wiki page of the HPC02 action (done)
  4. Planning the use of an extended halo region.
  5. Evaluation of how the extended halo region impacts on the restart and input files (in progress)

Commit History (4)

ChangesetAuthorTimeChangeLog
12325francesca2020-01-15T13:26:22+01:00

replace halo-copy routines - ticket #2009

11720francesca2019-10-18T12:58:16+02:00

add halo managment file- ticket #2009

11719francesca2019-10-18T12:52:29+02:00

add extra halo support- ticket #2009

11516francesca2019-09-09T16:35:07+02:00

Create HPC-02 branch - ticket #2009

Change History (17)

comment:1 Changed 6 years ago by nemo

  • Status changed from new to assigned

comment:2 Changed 6 years ago by mocavero

  • wp_comment set to The investigation activity about the optimisation techniques is on-going and the ticket will be closed by the end of this year. A new action on the application of the investigated techniques will be planned for 2019.

comment:3 Changed 5 years ago by francesca

  • Description modified (diff)
  • Milestone changed from 2018 WP to IMMERSE 2019
  • Owner changed from mocavero to francesca
  • wp_comment changed from The investigation activity about the optimisation techniques is on-going and the ticket will be closed by the end of this year. A new action on the application of the investigated techniques will be planned for 2019. to The investigation activity about the optimisation techniques has been started. The activity will continue in the context of the IMMERSE project in 2019.

comment:4 Changed 5 years ago by nicolasmartin

  • Summary changed from HPC-01_HPCWG_singlecoreperf to HPC-01(2018WP)_HPCWG_singlecoreperf

comment:5 Changed 5 years ago by clevy

  • Summary changed from HPC-01(2018WP)_HPCWG_singlecoreperf to HPC-07(2018WP)_HPCWG_singlecoreperf

comment:6 Changed 5 years ago by nemo

  • Priority changed from normal to high

comment:7 Changed 5 years ago by epico

  • Summary changed from HPC-07(2018WP)_HPCWG_singlecoreperf to HPC-02_Epicoco_Single Core Performance

comment:8 Changed 5 years ago by epico

  • Description modified (diff)
  • wp_comment changed from The investigation activity about the optimisation techniques has been started. The activity will continue in the context of the IMMERSE project in 2019. to Developed a proof of concept for evaluating the introduction of the loop fusion approach.

comment:9 Changed 5 years ago by francesca

  • wp_comment changed from Developed a proof of concept for evaluating the introduction of the loop fusion approach. to Developed a proof of concept for evaluating the introduction of the loop fusion approach. The loop fusion optimization requires the introduction of the extra halo management in advance. The work on the extra halo region is ongoing. The loop fusion development will not be included in the 2019 merge party, but it will be ready for 2020 developments.

comment:10 Changed 5 years ago by francesca

In 11516:

Error: Failed to load processor CommitTicketReference
No macro or processor named 'CommitTicketReference' found

comment:11 Changed 4 years ago by francesca

In 11719:

Error: Failed to load processor CommitTicketReference
No macro or processor named 'CommitTicketReference' found

comment:12 Changed 4 years ago by francesca

In 11720:

Error: Failed to load processor CommitTicketReference
No macro or processor named 'CommitTicketReference' found

comment:13 follow-up: Changed 4 years ago by smasson

  • Type changed from Task to Bug

Dear Italo,

I did review you developments @11720

There are several points that should be discussed/clarified before being able to merge this branch

First point concerns the array indices. We were thinking to keep the same logic to define the arrays dimensions as (1:jpi, 1:jpj) including the halos. As today, the “inner part of the domain would remain (1+nn_hls : jpi-nn_hls, 1+nn_hls : jpj-nn_hls). The advantage of using the default fortran indexes (starting at 1) is that you don’t have to define the index boundaries when passing an array into a subroutine. See for example what you are forced to do in mpp_lnk_generic.h90 in the declaration of “ARRAY_TYPE(1-nn_hls+1:,1-nn_hls+1:,:,:,: )”
By keeping the default dimensions (starting at 1), I think that you can keep mpp_lnk_generic.h90 almost as it is. You should just replace the default definition of ihl from 1 to nn_hls (something that should have been done before):

      IF( PRESENT(ihlcom) ) THEN   ;   ihl = ihlcom
      ELSE                         ;   ihl = nn_hls
      END IF

Next, the only real modification concerns the north pole folding that should be extended to halos larger than 1.
Note that we need to keep the ihlcom optional argument (which name should in fact starts with a k and not an i) as we want to be able to use halo size other than de default nn_hls in some specific parts of the code (e.g. dynspg_ts and ice reology).
I did not understood the need of the halo_mng routine… the array copies do not look as something to do in an HPC perspective, no?

To me, the main work concerning the extrahalo is development is not the communications part (that to me are almost already to do the job) but the following 2 points.

1) A proper cleaning and preparation of the code to be able to run with nn_hls > 1. We should first be able to run the model with nn_hls = 2 instead of 1 without changing the code results and without optimizations.
To do this, we need to modify the inputs/outputs to correctly initialize the arrays and outputs netcdf files. We must first redefine the default strategy used since ever: include the periodicity/halo in the inputs and outputs files. Today i/o implicitly include the fact that nn_hls = 1. We should first insure that i/o are completely independent of the value of nn_hls.
E.g. ORCA2 input files have 182 points in longitude instead of 180 because it includes the periodicity based on nn_hls = 1. With nn_hls = 2 we would need 184 points, so it is strange to keep input files with 182 points! Same idea for the latitude, we should have 147 instead of 149 points. We should exclude all periodicity/halo in the input files read only the inner part of the domains and fill the halos with a communication. This is more or less what we already do today except for the north pole folding. Including the north pole folding would simply require that fldread and iom_get accept and additional argument specifying the grid type.

2) Optimizations, take advantage of the extended halo size: rewriting of the do loops, the declaration of local variables, suppress/regroup communications…

comment:14 in reply to: ↑ 13 Changed 4 years ago by epico

Replying to smasson:

Dear Italo,

I did review you developments @11720

There are several points that should be discussed/clarified before being able to merge this branch

First point concerns the array indices. We were thinking to keep the same logic to define the arrays dimensions as (1:jpi, 1:jpj) including the halos. As today, the “inner part of the domain would remain (1+nn_hls : jpi-nn_hls, 1+nn_hls : jpj-nn_hls). The advantage of using the default fortran indexes (starting at 1) is that you don’t have to define the index boundaries when passing an array into a subroutine. See for example what you are forced to do in mpp_lnk_generic.h90 in the declaration of “ARRAY_TYPE(1-nn_hls+1:,1-nn_hls+1:,:,:,: )”
By keeping the default dimensions (starting at 1), I think that you can keep mpp_lnk_generic.h90 almost as it is. You should just replace the default definition of ihl from 1 to nn_hls (something that should have been done before):

      IF( PRESENT(ihlcom) ) THEN   ;   ihl = ihlcom
      ELSE                         ;   ihl = nn_hls
      END IF

The reason why we need the negative indexes in the array bounds is that in the north fold communication the global position of the subdomain is used (i.e. nimpp and njmpp). I recall that (nimpp, njmpp) map the global coordinates to the local coordinates (1,1). nimpp and njmpp are computed during the initialisation of the domain decomposition and they depend on the nn_hls value used at the initialization. If we want to use a different halo after the domain initialisation we would have to recompute the nimpp values, but if we maintain the local coordinate (1,1) unchanged after the change in the halo size, we do not need to recompute the nimpp and njmpp values.
Moreover, the use of negative indexes leads also to still using the same indexes for referring to the internal domain which is denoted by 1:nlei. (also nlei and nlej keep the same value when the halo size changes if you use negative indexes).
The use of negative indexes is relevant for the north fold communication only, so we can have the declaration of such indexes only in the lbc_lnk related routines and we can instead use the 1:jpi indexes in the rest of the code.


Next, the only real modification concerns the north pole folding that should be extended to halos larger than 1.
Note that we need to keep the ihlcom optional argument (which name should in fact starts with a k and not an i) as we want to be able to use halo size other than de default nn_hls in some specific parts of the code (e.g. dynspg_ts and ice reology).

In our solution we have introduced the variables jpi_1 and jpj_1 in par_oce; these store the dimension of the local domain when the halo has only one line. Inside the lbc_lnk we compare the actual size of the input array with jpi_1 being able to establish how wide the halo of the input array is.

I did not understood the need of the halo_mng routine… the array copies do not look as something to do in an HPC perspective, no?

The halo_mng module has been introduced only to gather in one module the routines that should be called when we change the halo size for example because of we are in the dynspg or in the reology or in the FCT advection kernel where an halo 3 would needed to remove the communication.
I agree, the memory copy is something that must be avoided as much as possible, but in this case we have used the same identical approach you have implemented in the DYN_optimization branch where you allocate new arrays with extended halo, copy the values from the smaller arrays, and after computation you copy back from extended arrays to the smaller ones.
Other solutions can be adopted to avoid the memory copy but they can impact on the cache misses or on the vectorisation.
We started working on the halo_mng last July, before seeing your branch, surprisingly we used the same approach …
In this regards, probably some misunderstanding happened since we didn’t know that you were working on the extension of the halo for the dynspg routine until we saw your DYN_optimization branch.


To me, the main work concerning the extrahalo is development is not the communications part (that to me are almost already to do the job) but the following 2 points.

1) A proper cleaning and preparation of the code to be able to run with nn_hls > 1. We should first be able to run the model with nn_hls = 2 instead of 1 without changing the code results and without optimizations.
To do this, we need to modify the inputs/outputs to correctly initialize the arrays and outputs netcdf files. We must first redefine the default strategy used since ever: include the periodicity/halo in the inputs and outputs files. Today i/o implicitly include the fact that nn_hls = 1. We should first insure that i/o are completely independent of the value of nn_hls.
E.g. ORCA2 input files have 182 points in longitude instead of 180 because it includes the periodicity based on nn_hls = 1. With nn_hls = 2 we would need 184 points, so it is strange to keep input files with 182 points! Same idea for the latitude, we should have 147 instead of 149 points. We should exclude all periodicity/halo in the input files read only the inner part of the domains and fill the halos with a communication. This is more or less what we already do today except for the north pole folding. Including the north pole folding would simply require that fldread and iom_get accept and additional argument specifying the grid type.

Also to me this represents the most significant part of the development, and also the most challenging one. We have not finished yet and we are not able to finish before the merge party. Anyway a viable solution that could be used with the code developed so far is to change the halo to one line just before the reading/writing, read the input file and change again the halo to two or more lines.
Now, we have to decide if it is better to postpone the merging to the next year waiting for a complete implementation of the I/O routines or merge the branch as it is and use the work around that I have suggested.


2) Optimizations, take advantage of the extended halo size: rewriting of the do loops, the declaration of local variables, suppress/regroup communications…

comment:15 Changed 4 years ago by smasson

  • Type changed from Bug to Task

comment:16 Changed 4 years ago by epico

  • Resolution set to fixed
  • Status changed from assigned to closed
  • wp_comment changed from Developed a proof of concept for evaluating the introduction of the loop fusion approach. The loop fusion optimization requires the introduction of the extra halo management in advance. The work on the extra halo region is ongoing. The loop fusion development will not be included in the 2019 merge party, but it will be ready for 2020 developments. to Developed a proof of concept for evaluating the introduction of the loop fusion approach. The loop fusion optimization requires the introduction of the extra halo management in advance. The work on the extra halo region is ongoing. The loop fusion development will not be included in the 2019 merge party, but it will be ready for 2020 developments. This Activity will be splitted in two activities in 2020, namely extra-halo and loop fusion.

comment:17 Changed 4 years ago by francesca

In 12325:

Error: Failed to load processor CommitTicketReference
No macro or processor named 'CommitTicketReference' found
Note: See TracTickets for help on using tickets.