= Name and subject of the action

Last edition: '''[[Wikinfo(changed_ts)]]''' by '''[[Wikinfo(changed_by)]]'''

The PI is responsible to closely follow the progress of the action, 
and especially to contact NEMO project manager if 
the delay on preview (or review) are longer than the 2 weeks expected.

[[PageOutline(2, , inline)]]

== Summary

||=Action       || Implement 2D tiling (with the LFRA version of NEMO) ||
||=PI(S)        || Daley Calvert, Andrew Coward ||
||=Digest       || Implement 2D tiling to reduce traffic between main memory and L3 cache     ||
||=Dependencies || DO loop macros ([wiki:2020WP/KERNEL-02_Coward_DoLoopMacros_part1]), extended haloes (Italo Epicoco, Seb Masson and Francesca Mele), extension of XIOS to accept 2D tiles of data (Yann Meurdesoif & Seb Masson)  ||
||=Branch       || source:/NEMO/branches/{YEAR}/dev_r{REV}_{ACTION_NAME} ||
||=Previewer(s) || Gurvan Madec                                        ||
||=Reviewer(s)  || Gurvan Madec                                        ||
||=Ticket       || #2365                                               ||

=== Description

Implement tiling over horizontal dimensions (i and j).

=== Branch

[https://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/2020/dev_r13383_HPC-02_Daley_Tiling dev_r13383_HPC-02_Daley_Tiling]

* [https://forge.ipsl.jussieu.fr/nemo/changeset?sfp_email=&sfph_mail=&reponame=&new=13831%40NEMO%2Fbranches%2F2020%2Fdev_r13383_HPC-02_Daley_Tiling%2Fsrc&old=13747%40NEMO%2Ftrunk%2Fsrc&sfp_email=&sfph_mail= Changes in src]
* [https://forge.ipsl.jussieu.fr/nemo/changeset?sfp_email=&sfph_mail=&reponame=&new=13831%40NEMO%2Fbranches%2F2020%2Fdev_r13383_HPC-02_Daley_Tiling%2Fcfgs&old=13747%40NEMO%2Ftrunk%2Fcfgs&sfp_email=&sfph_mail= Changes in cfgs]
* [https://forge.ipsl.jussieu.fr/nemo/changeset?sfp_email=&sfph_mail=&reponame=&new=13831%40NEMO%2Fbranches%2F2020%2Fdev_r13383_HPC-02_Daley_Tiling%2Ftests&old=13747%40NEMO%2Ftrunk%2Ftests&sfp_email=&sfph_mail= Changes in tests]

=== Summary of the tiling method

The processor domain (dimensions `jpi` x `jpj`) is split into one or more tiles/subdomains, which are iterated over asynchronously within the timestepping loop.

These tile domains are defined by a new set of indices representing the internal part of the domain (`ntsi`/`ntei`/`ntsj`/`ntej`). 
These indices replace those of the processor domain (`Nis0`/`Nie0`/`Njs0`/`Nje0`) in DO loops, array shape declarations, and other appropriate places.

These changes are implemented via new and existing CPP macros, allowing the tiling to be implemented with relatively few changes at lower levels.
A number of temporary workarounds are required to preserve results, but will be removed before the 2020 merge party or as part of the 2021 work plan.

=== Details of the implementation

==== Namelist

In `dom_nam` (`domain.F90`) a new namelist (`namtile`) is read and control prints written to ocean.output:

    {{{
    !-----------------------------------------------------------------------
    &namtile        !   parameters of the tiling
    !-----------------------------------------------------------------------
       ln_tile = .false.     !  Use tiling (T) or not (F)
       nn_ltile_i = 10       !  Length of tiles in i
       nn_ltile_j = 10       !  Length of tiles in j
    /
    }}}

These variables are declared in `dom_oce.F90`.

==== Setting the tile domain

A new subroutine `dom_tile` (`domain.F90`) sets the values of the tile indices (`ntsi`/`ntei`/`ntsj`/`ntej`) and the active tile number (`ntile`).

During initialisation, this subroutine calculates the number of tiles (`nijtile`) and the tile indices, which are stored in public arrays (`ntsi_a`/`ntei_a`/`ntsj_a`/`ntej_a`) with lengths equal to `nijtile + 1`.
When `dom_tile` is otherwise called, these arrays are used to set the tiling indices for the current tile (e.g. `ntsi = ntsi_a(ntile)`).

`ntile = 0` indicates that tiling is disabled, i.e. the full domain is to be used.

`dom_tile` is called whenever the active tile needs to be set, if tiling needs to be disabled and for initialisation (in `dom_init`):

    {{{
    #!fortran
    CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=3 ) ! Work on tile 3
    CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Work on the full domain
    CALL dom_tile( ntsi, ntsj, ntei, ntej )          ! Initialisation (implies ktile=0)
    }}}

Variables `ntsi`/`ntei`/`ntsj`/`ntej`/`nijtile` are declared in `par_oce.F90`.
Variables `ntsi_a`/`ntei_a`/`ntsj_a`/`ntej_a` are declared in `dom_oce.F90`
   
==== Changes to CPP macros

In `do_loop_substitute.h90`, the DO loop macros are modified to instead use the tiling indices:

    {{{
    #!diff
    - #define DO_2D(B, T, L, R) DO jj = Njs0-(B), Nje0+(T)   ;   DO ji = Nis0-(L), Nie0+(R)
    + #define DO_2D(B, T, L, R) DO jj = ntsj-(B), ntej+(T)   ;   DO ji = ntsi-(L), ntei+(R)
    }}}

A number of new macros have been added that replace `jpi`/`jpj` in DIMENSION and ALLOCATE statements (see “Local working array declarations” section below):

    {{{
    #define A1Di(H) ntsi-H:ntei+H                # H is equivalent to B/T/L/R in DO loop macros
    #define A1Dj(H) ntsj-H:ntej+H
    #define A2D(H) A1Di(H),A1Dj(H)

    #define A1Di_T(T) (ntsi-nn_hls-1)*T+1:       # T is 1 (= ntsi:) or 0 (= 1:)
    #define A1Dj_T(T) (ntsj-nn_hls-1)*T+1:
    #define A2D_T(T) A1Di_T(T),A1Dj_T(T)

    #define JPK  :                               
    #define JPTS  :
    #define KJPT  :
    }}}

The purpose of the `A1Di`/`A1Dj`/`A2D` macros is to allow local working arrays to be declared with the size of the tile (or the full domain, if tiling is not used), minimising memory use.
Furthermore, the tile-sized arrays will be declared with lower and upper bounds corresponding to the position of the tile in the full domain. 
Horizontal indices, for example in DO loops, will therefore apply to both tile- and full-sized arrays:

    {{{
    #!fortran
    ! ntsi = 3, ntsj = 7, ntei = 5, ntej = 9
    REAL(wp), DIMENSION(ntsi:ntei,ntsj:ntej) :: z2d
    REAL(wp), DIMENSION(jpi,jpj) :: a2d

    DO_2D(1,1,1,1)
      z2d(ji,jj) = a2d(ji,jj)
    END_2D
    }}}    

The `A1Di_T`/`A1Dj_T`/`A2D_T` macros are assumed-shape versions of the `A1Di`/`A1Dj`/`A2D` macros, used in dummy array argument declarations where the shape of the actual array argument is inconsistent between calls to the subroutine (see “Local working array declarations” section below).

The `JPK`/`JPTS`/`KJPT` macros are used where explicit-shape declarations have been replaced by assumed-shape declarations.
Their only purpose is to preserve readability.

==== Changes at the timestepping level

The looping over tiles occurs in the `stp` subroutine. 
The domain indices for the current tile (`ntile /= 0`) are set at the start of each iteration. 
After exiting the loop (and before, during initialisation) the tiling is disabled (`ntile == 0`):

    {{{
    #!fortran
    ! Loop over tile domains
    DO jtile = 1, nijtile
       IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=jtile )
  
       ! Tiled region of code
    END DO

    IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 )        ! Revert to full domain
    }}}

The tiled code currently encompasses the active tracers (TRA) region of `stp`.
The loop over tiles must currently be broken into two separate loops to preserve results. 
This is due to the temporary workaround implemented in `tra_adv`, which disables tiling for certain options and therefore can change the order in which the tracer trends are updated.

==== General changes at the module and subroutine levels

  __DO loop bounds__

  Each tile has an internal area and overlapping halo, but unlike the MPP domain the halo points are not set by `lbc_lnk`. 
  The internal part of a tile may therefore be partly overwritten by the halo of an adjacent tile, which will change results.
  In these cases, DO loops must work on the internal part of the tile only.

  This is generally only an issue for persistent variables (i.e. declared at the module level, or with `SAVE`), e.g. when zeroing an array:

    {{{
    #!diff
    - DO_3D_11_11(1, jpk)
    -    akz(ji,jj,jk) = 0._wp
    - END_3D
    + DO_3D_00_00(1, jpk)
    +    akz(ji,jj,jk) = 0._wp
    + END_3D
    }}}

  or where an array appears on both sides of the assignment:

    {{{
    #!diff
    - DO_2D( 0, 1, 0, 0 )
    + DO_2D( 0, 0, 0, 0 )
         pts(ji,jj,1,jn,Krhs) = pts(ji,jj,1,jn,Krhs) + zfact * ( sbc_tsc_b(ji,jj,jn) + sbc_tsc(ji,jj,jn) ) / e3t(ji,jj,1,Kmm)
    }}}

  In some cases (e.g. `tra_bbl_adv` in `trabbl.F90`) DO loops must also work on the outer halo of a processor domain, which requires a slightly different approach:

    {{{
    #!diff
    - DO_2D( 1, 0, 1, 0 )
    + IF( ntsi == Nis0 ) THEN ; isi = 1 ; ELSE ; isi = 0 ; ENDIF    ! Avoid double-counting when using tiling
    + IF( ntsj == Njs0 ) THEN ; isj = 1 ; ELSE ; isj = 0 ; ENDIF
    + DO_2D( isi, 0, isj, 0 )
    }}}
    
  __Local working array declarations__

  The new CPP macros in `do_loop_substitute.h90` replace references to the full domain in explicit shape declarations for local working arrays:
    
    {{{
    #!diff
    - ALLOCATE(jpi,jpj      ) DIMENSION(jpi,jpj      )
    + ALLOCATE(A2D(nn_hls)  ) DIMENSION(A2D(nn_hls)  )
    }}}

  This allows the arrays to be declared with the size of the tile (or the full domain, if tiling is not used), minimising memory use.

  This approach does not work for subroutines that are called with actual array arguments of varying shape; an assumed-shape declaration must be used instead. 
  It is necessary to use the form specifying lower bounds (i.e. `DIMENSION(start_i:, start_j:)`), as this information is required to correctly index the array when tiling is used.
  However, these bounds must be passed as additional arguments to the subroutine.

  To avoid widespread changes, the subroutine is replaced by a wrapper subroutine that calculates the bounds and passes them to the original subroutine.
  For example:

    {{{
    #!diff
    - SUBROUTINE eos_insitu( pts, prd, pdep )
    -    REAL(wp), DIMENSION(jpi,jpj,jpk,jpts), INTENT(in   ) ::   pts
    -    REAL(wp), DIMENSION(jpi,jpj,jpk     ), INTENT(  out) ::   prd
    -    REAL(wp), DIMENSION(jpi,jpj,jpk     ), INTENT(in   ) ::   pdep
    + SUBROUTINE eos_insitu( pts, prd, pdep ) 
    +    REAL(wp), DIMENSION(:,:,:,:), INTENT(in   ) ::   pts
    +    REAL(wp), DIMENSION(:,:,:)  , INTENT(  out) ::   prd
    +    REAL(wp), DIMENSION(:,:,:)  , INTENT(in   ) ::   pdep
    +    
    +    CALL eos_insitu_t( pts, is_tile(pts), prd, is_tile(prd), pdep, is_tile(pdep) )
    + END SUBROUTINE eos_insitu

    + SUBROUTINE eos_insitu_t( pts, ktts, prd, ktrd, pdep, ktdep )
    +    INTEGER, INTENT(in   ) ::   ktts, ktrd, ktdep
    +    REAL(wp), DIMENSION(A2D_T(ktts) ,JPK,JPTS), INTENT(in   ) ::   pts
    +    REAL(wp), DIMENSION(A2D_T(ktrd) ,JPK     ), INTENT(  out) ::   prd
    +    REAL(wp), DIMENSION(A2D_T(ktdep),JPK     ), INTENT(in   ) ::   pdep
    }}}

  Here, `is_tile` is an interface of functions returning 1 or 0 depending on the size of the array, e.g.:

    {{{
    #!fortran
    FUNCTION is_tile_2d( pt )
       REAL(wp), DIMENSION(:,:), INTENT(in) ::   pt
       INTEGER :: is_tile_2d
       IF( ln_tile .AND. SIZE(pt, 1) < jpi ) THEN
          is_tile_2d = 1
       ELSE
          is_tile_2d = 0
       ENDIF
    END FUNCTION is_tile_2d
    }}}

  and `A2D_T` is a version of the `A2D` CPP macro that returns `1:,1:` if `is_tile(array) = 0` (`array` is the size of the full domain) or `ntsi:,ntsj:` if `is_tile(array) = 1` (`array` is the size of the tile).
  The `JPK`/`JPTS` macros each return `:` and are used to preserve readability.

  These wrappers are added in the following locations:
      * `IOM/prtctl.F90` (`prt_ctl`)
      * `TRA/eosbn2.F90` (various subroutines)
      * `TRA/traldf_iso.F90` (`tra_ldf_iso`)
      * `TRA/traldf_lap_blp.F90` (`tra_ldf_lap`)
      * `TRA/traldf_triad.F90` (`tra_ldf_triad`)
      * `TRA/zpshde.F90` (`zps_hde`, `zps_hde_isf`)

  __`:` array subscripts__

  The above array declaration changes may introduce conformance issues when `:` subscripts are used for indexing, or if no indexing is used at all.
  These are resolved by instead using an equivalent DO loop:

    {{{
    #!diff
    - REAL(wp), DIMENSION(jpi,jpj,jpk)   :: a3d
    - REAL(wp), DIMENSION(jpi,jpj)       :: z2d
    - z2d(:,:) = a3d(:,:,1).
    + REAL(wp), DIMENSION(jpi,jpj,jpk)   :: a3d
    + REAL(wp), DIMENSION(A2D(nn_hls)) :: z2d
    + DO_2D(1,1,1,1)
    +    z2d(ji,jj) = a3d(ji,jj,1)
    + END_2D
    }}}

  __Code called once per timestep__

  Some code should only be called once per timestep (e.g. ocean.output write statements, initialisation steps, reading data from files) but will be called by each tile.
  IF statements are used to suppress these calls and generally take the form:

    {{{
    #!fortran
    IF( ntile == 0 .OR. ntile == 1 )  THEN             ! Do only on the first tile
       ! ...
    ENDIF

    IF( ntile == 0 .OR. ntile == nijtile )  THEN       ! Do only on the last tile
       ! ...
    ENDIF
    }}}

  Sometimes `dom_tile` must also be called to temporarily disable the tiling:

    {{{
    #!fortran
    IF( ntile == 0 .OR. ntile == 1 )  THEN   ! Do only for the full domain
       itile = ntile
       IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile = 0 )     ! Use full domain
          CALL fld_read( kt, 1, sf_tsd )     !==   read T & S data at kt time step   ==!
       IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile = itile )  ! Revert to tile
    ENDIF
    }}}

==== Special cases

  __`diaptr` module__

  At present, between `dia_ptr` and `dia_ptr_hst` the following operations are generally performed:

    1. Calculate the zonal integral
    2. Call `mpp_sum`
    3. Perform additional arithmetic or copy result onto 2D/3D arrays
    4. Call `iom_put`

  The tiling is not easily implemented here due to steps 2 and 4.
  The `diaptr` module has therefore been largely restructured to accommodate the tiling, with the added bonus of significantly reducing the number of communications.

    * `dia_ptr` has been split into `dia_ptr_zint` (steps 1-2) and `dia_ptr_iom` (steps 3-4)
       * `dia_ptr_zint` is called for every tile, `dia_ptr_iom` is called after `dia_ptr_zint` has been run for all tiles
    * The zonal integrals that were calculated in `dia_ptr` are now calculated by `dia_ptr_zint` and are stored in two new module arrays, `pvtr_int` and `pzon_int`, which are used by `dia_ptr_iom`
    * Zonal integrals are calculated by calling `ptr_sj` (calculates the integral for the tile) and a new subroutine `ptr_sum` (accumulates the `ptr_sj` integrals and calls `mpp_sum` on the last tile only)
        * The number of communications is reduced from 90 to 13 (data for all basins are exchanged at once)
        * Code relating to `mpp_sum` has been removed from `ptr_sj`
        * Pointers and their target arrays have been removed from `ptr_sj`
    * `dia_ptr_hst` has been condensed by moving the loop over basins out of the IF statements

  __`dia_ar5_hst` subroutine__

  This code contains `iom_put` and `lbc_lnk` calls, which cannot be tiled.
  The code has therefore been rearranged to separate the transport calculations from the `iom_put` calls.

  The transport calculations are performed for each tile and saved in two new module variables, `hstr_adv` and `hstr_ldf`.
  The `iom_put` calls are called only on the last tile, and the `lbc_lnk` calls have been removed as per #2367 (clean-up of communications).

  __`prt_ctl` subroutine__

  This code prints the sums of the input arrays over the processor domain, minus the sums calculated by the previous `prt_ctl` call.
  It is possible to implement tiling for this subroutine by aggregating the result over tiles, similar to the approach taken with `dia_ptr`.
  However, this was difficult to implement cleanly and bit-level differences in the global sum could not be avoided when using tiling.
    
  I have taken a simpler approach where the sum is output for each tile, rather than the processor domain, if tiling is used.
  The `prt_ctl` utility can therefore be used to diagnose differences on a tile-by-tile basis per processor.
  However, the `prt_ctl` output cannot be compared between a simulation with tiling and one without.

  __`timing` module__

  The timing utility already works with the tiling; the only change is to ensure that the call iteration counter is incremented once for all tiles.

  __Trends diagnostics__

  Tiling has not yet been implemented in these diagnostics, meaning that tiling has to be disabled for the various `trd_tra` calls throughout the TRA modules.

  Rather than add IF statements around each of these calls, I simply set `ln_tile = .false.` in `trd_init` if the trends diagnostics are used.

==== Temporary workarounds

This code has been marked with a `! TEMP: [tiling]` comment.

  __`iom_put` calls__

  XIOS does not currently support tiling, so the data must be complete (i.e. all tiles must have finished) at the time of an `iom_put` call.
  The general workaround for this is to call `iom_put` only on the last tile.
  Additional workarounds are required for some local working arrays, which are not preserved between subsequent calls to the subroutine.

  Some code was rearranged in order to calculate the diagnostic and call `iom_put` at the same time on the last tile (`traldf_triad.F90`).
  In other cases this was not possible, so the working arrays were declared with `SAVE` so that they could be processed by each tile (`traadv.F90`, `tramle.F90`).
  In all cases, it is necessary to declare the working arrays with the size of the full domain (`DIMENSION(jpi,jpj)`) instead of the tile (`DIMENSION(A2D(nn_hls))`).
    
  XIOS support for tiling is expected to be fully implemented sometime in December-January.
  At this point all of the `iom_put` workarounds can be removed. 

  A preliminary development branch has been provided for testing by Olga Abramkina.
  It seems likely that the workarounds would need to stay in place for the merge, as I presume that we would not want to merge based on a development version of XIOS.
  However, the workarounds could certainly be removed post-merge before the 4.2 release.

  __`lbc_lnk` calls__

  Similar to `iom_put` calls, `lbc_lnk` can only be called once all tiles have finished and the data is complete.
  The general workaround for this is the same as for `iom_put`; call `lbc_lnk` only on the last tile.

  This is done in a few cases (`tranpc.F90`, `traqsr.F90`), but often it was cleaner to simply disable the tiling due to the frequency of `lbc_lnk` calls.
  This has been implemented in `tra_adv` for all schemes except 2nd order centred advection (`ln_traadv_cen = .true.` with `nn_cen_h = 2`), in `tra_ldf` for all bi-laplacian schemes, and for calls to `zps_hde`.
  It was also necessary to split the tiling loop in `step.F90` so that the first loop ended before `tra_adv`, in order to preserve results.

  Most of the `lbc_lnk` calls are removed in the `nn_hls = 2` case by #2367 (clean-up of communications), which will be merged with the tiling branch to form the basis of this year’s merge party.
  #2367 also removes several `lbc_lnk` calls that were used to set the halo points on arrays being passed to `iom_put`; these have already been removed from the tiling branch.

  Tiling will only be used with `nn_hls = 2`, so most of the remaining workarounds for the `lbc_lnk` calls should be removed.
  However there are a number of changes in #2367 that prevent tiling, such as the addition of `lbc_lnk` calls in `tra_adv` and `zps_hde`, as well as expanding the bounds of some DO loops so that they work on the halo (see the “DO loop bounds” section above).
  Removal of the workarounds therefore depends on whether these issues can be resolved before the merge.

==== List of new variables and functions (excluding local)

    * Global variables (`par_oce.F90`, `dom_oce.F90`)
        * `ntsi`, `ntsj`- start index of tile
        * `ntei`, `ntej`- end index of tile
        * `ntsi_a`, `ntsj_a`- start indices of each tile
        * `ntei_a`, `ntej_a`- end indices of each tile
        * `ntile`- current tile number
        * `nijtile`- number of tiles
    * Module variables
        * `hstr_adv`, `hstr_ldf` (`diaar5.F90`)- saved transports
        * `pvtr_int`, `pzon_int` (`diaptr.F90`)- zonal integrals
        * `jp_msk`, `jp_vtr` (`diaptr.F90`)- indices for `pvtr_int` & `pzon_int`
        * `nnpcc` (`tranpc.F90`)- replaces local variable `inpcc`
    * Namelist `namtile` (`dom_oce.F90`)
        * `ln_tile`- logical control on use of tiling
        * `nn_ltile_i`, `nn_ltile_j`- tile length
    * Pre-processor macros (`do_loop_substitute.h90`)
        * `A1Di`/`A1Dj`/`A2D`- substitutions for ALLOCATE or DIMENSION arguments
        * `A1Di_T`/`A1Dj_T`/`A2D_T`- substitutions for ALLOCATE or DIMENSION arguments when the shape of the array is unknown
        * `JPK`/`JPTS`/`KJPT`- placeholders for `:` to preserve readability
    * Functions and subroutines
        * `dom_tile` (`domain.F90`)- Calculate/set tiling variables
        * `is_tile` (`domutl.F90`)- returns 0 if the array has the dimensions of the full domain, else 1
        * `ptr_sum` (`diaptr.F90`)- sum `ptr_sj` zonal integrals over tiles and processors to get total
    * The following subroutines have all been renamed to `<SUBROUTINE>_t`, where `<SUBROUTINE>` is now a wrapper function for `<SUBROUTINE>_t`:
        * `eos_insitu`, `eos_insitu_pot`, `eos_insitu_2d`, `rab_3d`, `rab_2d`, `bn2`, `eos_fzp_2d` (`eosbn2.F90`)
        * `tra_ldf_iso` (`traldf_iso.F90`)
        * `tra_ldf_lap` (`traldf_lap_blp.F90`)
        * `tra_ldf_triad` (`traldf_triad.F90`)
        * `prt_ctl` (`prtctl.F90`)
        * `zps_hde`, `zps_hde_isf` (`zpshde.F90`)

=== Documentation updates

{{{#!box width=55em help
Using previous parts, define the main changes to be done in the NEMO literature 
(manuals, guide, web pages, …).
}}}

''...''

== Preview 

{{{#!box width=50em info
[[Include(wiki:Developers/DevProcess#preview_)]]
}}}

''...''

== Tests
=== SETTE

SETTE passes all tests (including with tiling turned on) and compares with the trunk.
The Intel compiler (ifort 18.0.5 20180823) is used with XIOS in detached mode. 
The Cray compiler was not used due to #2394 (new Cray compiler does not work with new way of reading namelists) and due to older Cray compilers raising various errors.

==== Regular checks

* Can this change be shown to produce expected impact (option activated)? __YES__
* Can this change be shown to have a null impact (option not activated)? __YES__
* Results of the required bit comparability tests been run: are there no differences when activating the development? __YES (SETTE), NO (other tests)__
  * If some differences appear, is reason for the change valid/understood? __YES (see known failures)__
  * If some differences appear, is the impact as expected on model configurations? __YES__
* Is this change expected to preserve all diagnostics? __NO (see known failures)__
  * If no, is reason for the change valid/understood? __YES__
* Are there significant changes in run time/memory? __NO__

==== Detailed SETTE results

Tiling is turned on in this report.

{{{
Current code is : NEMO/branches/2020/dev_r13383_HPC-02_Daley_Tiling @ r13745  ( last change @ r13745 )

SETTE validation report generated for : 

       NEMO/branches/2020/dev_r13383_HPC-02_Daley_Tiling @ r13745+ (last changed revision)

       on XC40_METO_IFORT arch file


!!---------------1st pass------------------!!

   !----restart----!   
WGYRE_PISCES_ST              run.stat    restartability  passed :  13745+
WGYRE_PISCES_ST              tracer.stat restartability  passed :  13745+
WORCA2_ICE_PISCES_ST         run.stat    restartability  passed :  13745+
WORCA2_ICE_PISCES_ST         tracer.stat restartability  passed :  13745+
WORCA2_OFF_PISCES_ST         tracer.stat restartability  passed :  13745+
WAMM12_ST                    run.stat    restartability  passed :  13745+
WORCA2_SAS_ICE_ST            run.stat    restartability  passed :  13745+
WAGRIF_DEMO_ST               run.stat    restartability  passed :  13745+
WWED025_ST                   run.stat    restartability  passed :  13745+
WISOMIP+_ST                  run.stat    restartability  passed :  13745+
WOVERFLOW_ST                 run.stat    restartability  passed :  13745+
WLOCK_EXCHANGE_ST            run.stat    restartability  passed :  13745+
WVORTEX_ST                   run.stat    restartability  passed :  13745+
WICE_AGRIF_ST                run.stat    restartability  passed :  13745+

   !----repro----!   
WGYRE_PISCES_ST              run.stat    reproducibility passed :  13745+
WGYRE_PISCES_ST              tracer.stat reproducibility passed :  13745+
WORCA2_ICE_PISCES_ST         run.stat    reproducibility passed :  13745+
WORCA2_ICE_PISCES_ST         tracer.stat reproducibility passed :  13745+
WORCA2_OFF_PISCES_ST         tracer.stat reproducibility passed :  13745+
WAMM12_ST                    run.stat    reproducibility passed :  13745+
WORCA2_SAS_ICE_ST            run.stat    reproducibility passed :  13745+
WORCA2_ICE_OBS_ST            run.stat    reproducibility passed :  13745+
WAGRIF_DEMO_ST               run.stat    reproducibility passed :  13745+
WWED025_ST                   run.stat    reproducibility passed :  13745+
WISOMIP+_ST                  run.stat    reproducibility passed :  13745+
WVORTEX_ST                   run.stat    reproducibility passed :  13745+
WICE_AGRIF_ST                run.stat    reproducibility passed :  13745+

   !----agrif check----!   
ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat    unchanged  -    passed :  13745+ 13745+

   !----result comparison check----!   

check result differences between :
VALID directory : /home/d00/hadcv/cylc-run/u-bs939/share/sette/output/NEMO_VALIDATION at rev 13745+
and
REFERENCE directory : /home/d00/hadcv/cylc-run/u-bs939/share/sette_ref/output/NEMO_VALIDATION at rev 13688

WGYRE_PISCES_ST       run.stat    files are identical  
WGYRE_PISCES_ST       tracer.stat files are identical  
WORCA2_ICE_PISCES_ST  run.stat    files are identical  
WORCA2_ICE_PISCES_ST  tracer.stat files are identical  
WORCA2_OFF_PISCES_ST  tracer.stat files are identical  
WAMM12_ST             run.stat    files are identical  
WORCA2_SAS_ICE_ST     run.stat    files are identical  
WAGRIF_DEMO_ST        run.stat    files are identical  
WWED025_ST            run.stat    files are identical  
WISOMIP+_ST           run.stat    files are identical  
WVORTEX_ST            run.stat    files are identical  
WICE_AGRIF_ST         run.stat    files are identical  
WOVERFLOW_ST          run.stat    files are identical  
WLOCK_EXCHANGE_ST     run.stat    files are identical
}}}  

=== Development testing

A configuration based on ORCA2_ICE_PISCES (without `key_si3` or `key_top`) was used to test code modified by the tiling development.
To facilitate cleaner testing, `ln_trabbc`, `ln_trabbl`, `ln_icebergs`, `ln_rnf`, `ln_ssr`, `ln_tradmp`, `ln_ldfeiv`, `ln_traldf_msc`, `ln_mle`, `ln_zdfddm` and `ln_zdfiwm` were all set to false.
`ln_qsr_2bd` was used instead of `ln_qsr_rgb`, `nn_fsbc` was set to 1, and `nn_ice` and `nn_fwb` were set to 0.

Simulations of the tiling branch were run for 10 days with 1-day diagnostic output, for all scientific options relevant to the affected code.
Each simulation was repeated with tiling turned on, using square tile sizes of 5 and 50 (the latter being equivalent to one tile over the full domain).
Additionally, all simulations including those with tiling were repeated with `nn_hls = 2`.

run.stat and diagnostic output were compared with equivalent simulations of the trunk and 10-day simulations of the tiling branch that were run in two 5-day submissions (i.e. testing for restartability).

Version 8.3.4 of the Cray compiler was used with XIOS 2.5 revision 1565.
A `jpni = 4`, `jpnj = 8` decomposition was used with 6 XIOS processors.

=== Known failures in development tests__

* `ln_trabbl = .true.` with `nn_bbl_adv > 0` gives different results when using tiling
    
  This is due to a change in the order of computation of `pt_rhs` in `tra_bbl_adv` when using the tiling.
  The loop over i is broken up by the tiles, which causes the up-slope and down-slope contributions to be added in a different order on the intersections between tiles.

* Some trends diagnostics have slightly different values at one point on the northfold in the development branch

  This may be because I removed an `lbc_lnk` from `tra_zdf`, similar to the effect of removing the `lbc_lnk` for `utr_bbl`/`vtr_bbl` in `tra_bbl`.
  It could be restored if necessary, since we do not use tiling with the trends diagnostics.

* NEMO fails with a `stp_ctl` error when using `ln_traldf_hor = .true.` or `ln_zdfosm = .true.`, in both the development branch and trunk

  I assume this is related to my configuration, as I did not have this issue when testing in GYRE.

=== Untested code

* Code (e.g. `zps_hde_isf`) that requires `ln_isfcav = .true.`
* Code in `diaptr.F90` for the `uocetr_vsum_cumul` diagnostic (`ptr_ci_2d` subroutine)

  XIOS hangs when trying to output this diagnostic, which may be due to the XIOS library being too old

* Code in `tra_adv` that requires `ln_wave .AND. ln_sdw = .true.`, 
* Code in `tra_adv` that requires `ln_vvl_ztilde .OR. ln_vvl_layer = .true.`
* Code in `tra_asm_inc` that requires `ln_asmdin = .true.`

  I did not have the required input file (assim_background_state_DI.nc) for ORCA2, although I was able to make an idealised one for testing in GYRE.

* Code in `tra_asm_inc` that requires `ln_temnofreeze = .true.` or `ln_seaiceinc = .true.`

  These logicals are hard coded as false.

== Review

Reviewer: Italo Epicoco (CMCC)
Date: 30/11/2020

* Is the proposed methodology now implemented?
YES

* Are the code changes in agreement with the flowchart defined at preview step?
YES

* Are the code changes in agreement with list of routines and variables as proposed at preview step?
YES, only the files included in TRA module have been changed

* Is the in-line documentation accurate and sufficient?
YES

*  Do the code changes comply with NEMO coding standards?
YES

* Is the development documented with sufficient details for others to understand the impact of the change?
YES

* Is the project literature (manual, guide, web, …) now updated or completed following the proposed summary in preview section? 
NO.
The documentation about tiling should refer to the usage of the ln_tile, nn_ltile_i and nn_ltile_j variabiles added in the namelist. Moreover some hints to the user should be given to drive an efficient choice of the values for nn_ltile_i and nn_ltile_j.

* Is the review fully successful?
YES