New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
2020WP/HPC-02_Daley_Tiling (diff) – NEMO

Changes between Version 10 and Version 11 of 2020WP/HPC-02_Daley_Tiling


Ignore:
Timestamp:
2020-11-18T18:37:34+01:00 (3 years ago)
Author:
hadcv
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • 2020WP/HPC-02_Daley_Tiling

    v10 v11  
    2222=== Description 
    2323 
    24 Implement loop tiling over horizontal dimensions (i and j). 
    25  
    26 === Implementation 
    27  
    28 As of 24/09/20, most of the code called by the "active tracers" part of the step subroutine (between `trc_stp` and `tra_atf`) has been tiled. Solutions and workarounds for the issues encountered to date are described in [https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2020WP/HPC-02_Daley_Tiling/Tiling_code_issues.pdf this document]. A progress summary can be found [https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2020WP/HPC-02_Daley_Tiling/Tiling_progress_summary_240920.pdf here]. 
    29  
    30 The tiling implementation has been tested using GYRE in benchmark mode with mono-processor and MPI configurations. The tests comprise 10 day simulations using different tile decompositions (including no tiling) and different science options particular to the tiled modules. A test passes if the tiling does not change results at the bit level (`run.stat`) or in the diagnostics.  
    31  
    32 __Summary of method__ 
    33  
    34 The full processor domain (dimensions `jpi` x `jpj`) is split into one or more tiles/subdomains.  
    35 This is implemented by:  
    36  
    37 '''1. Modifying the DO loop macros in `do_loop_substitute.h90` to use the tile bounds''' 
    38  
    39   The tile domain is defined by a new set of domain indices (`ntsi`, `ntei`, `ntsj`, `ntej`), which represent the internal part of the domain: 
    40  
    41   {{{ 
    42   #!diff 
    43   - #define DO_2D(B, T, L, R) DO jj = Njs0-(B), Nje0+(T)   ;   DO ji = Nis0-(L), Nie0+(R) 
    44   + #define DO_2D(B, T, L, R) DO jj = ntsj-(B), ntej+(T)   ;   DO ji = ntsi-(L), ntei+(R) 
    45   }}} 
    46  
    47   A new subroutine `dom_tile` (in `domain.F90`) sets the values of these indices.  
    48  
    49   During initialisation, this subroutine calculates and stores the indices in global arrays (`ntsi_a`, `ntei_a`, `ntsj_a`, `ntej_a`) with lengths equal to the number of tiles (`nijtile`) plus one. The zero index is used to store the indices for the full domain: 
    50  
    51   {{{ 
    52   #!fortran 
    53   ntsi_a(0) = Nis0 
    54   ntsj_a(0) = Njs0 
    55   ntei_a(0) = Nie0 
    56   ntej_a(0) = Nje0 
    57   }}} 
    58  
    59   `dom_tile` is called whenever the active tile needs to be set or if tiling needs to be disabled: 
    60  
    61   {{{ 
    62   #!fortran 
    63   CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=3 ) ! Work on tile 3 
    64   CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Work on the full domain 
    65   }}} 
    66  
    67 '''2. Declaring SUBROUTINE-level arrays using the tile bounds''' 
    68  
    69   A new set of substitution macros in `do_loop_substitute.h90`: 
    70  
    71   {{{ 
    72   #define ST_1Di(H) ntsi-H:ntei+H 
    73   #define ST_1Dj(H) ntsj-H:ntej+H 
    74   #define ST_2D(H) ST_1Di(H),ST_1Dj(H) 
    75   }}} 
    76  
    77   replaces references to the full domain in explicit shape and allocatable array declarations: 
    78  
    79   {{{ 
    80   #!diff 
    81   - ALLOCATE(jpi,jpj      ) DIMENSION(jpi,jpj      ) 
    82   + ALLOCATE(ST_2D(nn_hls)) DIMENSION(ST_2D(nn_hls)) 
    83   }}} 
    84  
    85   These arrays then have the same dimensions as the tile if tiling is used, otherwise they will have the same dimensions as the full domain as before. Furthermore, the tile-sized arrays are declared with lower and upper bounds corresponding to the position of the tile in the full domain. Horizontal indices, for example in DO loops, will therefore apply to both tile- and full-sized arrays: 
    86  
    87   {{{ 
    88   #!fortran 
    89   ! ntsi = 3, ntsj = 7, ntei = 5, ntej = 9 
    90   REAL(wp), DIMENSION(ntsi:ntei,ntsj:ntej) :: z2d 
    91   REAL(wp), DIMENSION(jpi,jpj) :: a2d 
    92  
    93   DO_2D(1,1,1,1) 
    94     z2d(ji,jj) = a2d(ji,jj) 
    95   END_2D 
    96   }}} 
    97  
    98   This substitution is made for local working arrays where possible to minimise memory consumption when using tiling.  
    99   No further changes are generally required, except in specific cases described in [https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2020WP/HPC-02_Daley_Tiling/Tiling_code_issues.pdf this document] and other common cases described in steps 5 & 6 below. 
     24Implement tiling over horizontal dimensions (i and j). 
     25 
     26=== Branch 
     27 
     28[https://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/2020/dev_r13383_HPC-02_Daley_Tiling dev_r13383_HPC-02_Daley_Tiling] 
     29 
     30* [https://forge.ipsl.jussieu.fr/nemo/changeset?sfp_email=&sfph_mail=&reponame=&new=13745%40NEMO%2Fbranches%2F2020%2Fdev_r13383_HPC-02_Daley_Tiling%2Fsrc&old=13688%40NEMO%2Ftrunk%2Fsrc&sfp_email=&sfph_mail= Changes in src] 
     31* [https://forge.ipsl.jussieu.fr/nemo/changeset?sfp_email=&sfph_mail=&reponame=&new=13745%40NEMO%2Fbranches%2F2020%2Fdev_r13383_HPC-02_Daley_Tiling%2Fcfgs&old=13688%40NEMO%2Ftrunk%2Fcfgs&sfp_email=&sfph_mail= Changes in cfgs] 
     32* [https://forge.ipsl.jussieu.fr/nemo/changeset?sfp_email=&sfph_mail=&reponame=&new=13745%40NEMO%2Fbranches%2F2020%2Fdev_r13383_HPC-02_Daley_Tiling%2Ftests&old=13688%40NEMO%2Ftrunk%2Ftests&sfp_email=&sfph_mail= Changes in tests] 
     33 
     34=== Summary of the tiling method 
     35 
     36The processor domain (dimensions `jpi` x `jpj`) is split into one or more tiles/subdomains, which are iterated over asynchronously within the timestepping loop. 
     37 
     38These tile domains are defined by a new set of indices representing the internal part of the domain (`ntsi`/`ntei`/`ntsj`/`ntej`).  
     39These indices replace those of the processor domain (`Nis0`/`Nie0`/`Njs0`/`Nje0`) in DO loops, array shape declarations, and other appropriate places. 
     40 
     41These changes are implemented via new and existing CPP macros, allowing the tiling to be implemented with relatively few changes at lower levels. 
     42A number of temporary workarounds are required to preserve results, but will be removed before the 2020 merge party or as part of the 2021 work plan. 
     43 
     44=== Details of the implementation 
     45 
     46==== Namelist 
     47 
     48In `dom_nam` (`domain.F90`) a new namelist (`namtile`) is read and control prints written to ocean.output: 
     49 
     50    {{{ 
     51    !----------------------------------------------------------------------- 
     52    &namtile        !   parameters of the tiling 
     53    !----------------------------------------------------------------------- 
     54       ln_tile = .false.     !  Use tiling (T) or not (F) 
     55       nn_ltile_i = 10       !  Length of tiles in i 
     56       nn_ltile_j = 10       !  Length of tiles in j 
     57    / 
     58    }}} 
     59 
     60These variables are declared in `dom_oce.F90`. 
     61 
     62==== Setting the tile domain 
     63 
     64A new subroutine `dom_tile` (`domain.F90`) sets the values of the tile indices (`ntsi`/`ntei`/`ntsj`/`ntej`) and the active tile number (`ntile`). 
     65 
     66During initialisation, this subroutine calculates the number of tiles (`nijtile`) and the tile indices, which are stored in public arrays (`ntsi_a`/`ntei_a`/`ntsj_a`/`ntej_a`) with lengths equal to `nijtile + 1`. 
     67When `dom_tile` is otherwise called, these arrays are used to set the tiling indices for the current tile (e.g. `ntsi = ntsi_a(ntile)`). 
     68 
     69`ntile = 0` indicates that tiling is disabled, i.e. the full domain is to be used. 
     70 
     71`dom_tile` is called whenever the active tile needs to be set, if tiling needs to be disabled and for initialisation (in `dom_init`): 
     72 
     73    {{{ 
     74    #!fortran 
     75    CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=3 ) ! Work on tile 3 
     76    CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Work on the full domain 
     77    CALL dom_tile( ntsi, ntsj, ntei, ntej )          ! Initialisation (implies ktile=0) 
     78    }}} 
     79 
     80Variables `ntsi`/`ntei`/`ntsj`/`ntej`/`nijtile` are declared in `par_oce.F90`. 
     81Variables `ntsi_a`/`ntei_a`/`ntsj_a`/`ntej_a` are declared in `dom_oce.F90` 
     82    
     83==== Changes to CPP macros 
     84 
     85In `do_loop_substitute.h90`, the DO loop macros are modified to instead use the tiling indices: 
     86 
     87    {{{ 
     88    #!diff 
     89    - #define DO_2D(B, T, L, R) DO jj = Njs0-(B), Nje0+(T)   ;   DO ji = Nis0-(L), Nie0+(R) 
     90    + #define DO_2D(B, T, L, R) DO jj = ntsj-(B), ntej+(T)   ;   DO ji = ntsi-(L), ntei+(R) 
     91    }}} 
     92 
     93A number of new macros have been added that replace `jpi`/`jpj` in DIMENSION and ALLOCATE statements (see “Local working array declarations” section below): 
     94 
     95    {{{ 
     96    #define A1Di(H) ntsi-H:ntei+H                # H is equivalent to B/T/L/R in DO loop macros 
     97    #define A1Dj(H) ntsj-H:ntej+H 
     98    #define A2D(H) A1Di(H),A1Dj(H) 
     99 
     100    #define A1Di_T(T) (ntsi-nn_hls-1)*T+1:       # T is 1 (= ntsi:) or 0 (= 1:) 
     101    #define A1Dj_T(T) (ntsj-nn_hls-1)*T+1: 
     102    #define A2D_T(T) A1Di_T(T),A1Dj_T(T) 
     103 
     104    #define JPK  :                                
     105    #define JPTS  : 
     106    #define KJPT  : 
     107    }}} 
     108 
     109The purpose of the `A1Di`/`A1Dj`/`A2D` macros is to allow local working arrays to be declared with the size of the tile (or the full domain, if tiling is not used), minimising memory use. 
     110Furthermore, the tile-sized arrays will be declared with lower and upper bounds corresponding to the position of the tile in the full domain.  
     111Horizontal indices, for example in DO loops, will therefore apply to both tile- and full-sized arrays: 
     112 
     113    {{{ 
     114    #!fortran 
     115    ! ntsi = 3, ntsj = 7, ntei = 5, ntej = 9 
     116    REAL(wp), DIMENSION(ntsi:ntei,ntsj:ntej) :: z2d 
     117    REAL(wp), DIMENSION(jpi,jpj) :: a2d 
     118 
     119    DO_2D(1,1,1,1) 
     120      z2d(ji,jj) = a2d(ji,jj) 
     121    END_2D 
     122    }}}     
     123 
     124The `A1Di_T`/`A1Dj_T`/`A2D_T` macros are assumed-shape versions of the `A1Di`/`A1Dj`/`A2D` macros, used in dummy array argument declarations where the shape of the actual array argument is inconsistent between calls to the subroutine (see “Local working array declarations” section below). 
     125 
     126The `JPK`/`JPTS`/`KJPT` macros are used where explicit-shape declarations have been replaced by assumed-shape declarations. 
     127Their only purpose is to preserve readability. 
     128 
     129==== Changes at the timestepping level 
     130 
     131The looping over tiles occurs in the `stp` subroutine.  
     132The domain indices for the current tile (`ntile /= 0`) are set at the start of each iteration.  
     133After exiting the loop (and before, during initialisation) the tiling is disabled (`ntile == 0`): 
     134 
     135    {{{ 
     136    #!fortran 
     137    ! Loop over tile domains 
     138    DO jtile = 1, nijtile 
     139       IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=jtile ) 
    100140   
    101 '''3. Looping over tiles at the timestepping level''' 
    102  
    103   A loop over tiles has been added to `stp`. The domain indices for the current tile (`ntile /= 0`) are set at the start of each iteration. After exiting the loop (and before, during initialisation) the tiling is suppressed (`ntile == 0`): 
    104  
    105   {{{ 
    106   #!fortran 
    107   ! Loop over tile domains 
    108   DO jtile = 1, nijtile 
    109      IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=jtile ) 
    110  
    111      CALL tra_ldf( kstp, Nbb, Nnn, ts, Nrhs )  ! lateral mixing 
    112   END DO 
    113  
    114   IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 )        ! Revert to full domain 
    115   }}} 
    116  
    117   DO loops within the tiling loop therefore work on the current tile, while those outside the tiling loop work on the full domain. 
    118  
    119 '''4. A new namelist (`namtile`)''' 
    120  
    121 {{{ 
    122    !----------------------------------------------------------------------- 
    123    &namtile        !   parameters of the tiling 
    124    !----------------------------------------------------------------------- 
    125       ln_tile = .false.     !  Use tiling (T) or not (F) 
    126       nn_ltile_i = 10       !  Length of tiles in i 
    127       nn_ltile_j = 10       !  Length of tiles in j 
    128    / 
    129 }}} 
    130  
    131   The number of tiles is calculated from the tile lengths, `nn_ltile_i` and `nn_ltile_j`, with respect to the full domain. 
    132  
    133 '''5. Replacing `:` subscripts with a DO loop macro where appropriate''' 
    134  
    135   This is only necessary when step 2 would introduce conformance issues: 
    136  
    137   {{{ 
    138   #!diff 
    139   - REAL(wp), DIMENSION(jpi,jpj,jpk)   :: a3d 
    140   - REAL(wp), DIMENSION(jpi,jpj)       :: z2d 
    141   - z2d(:,:) = a3d(:,:,1). 
    142   + REAL(wp), DIMENSION(jpi,jpj,jpk)   :: a3d 
    143   + REAL(wp), DIMENSION(ST_2D(nn_hls)) :: z2d 
    144   + DO_2D(1,1,1,1) 
    145   +    z2d(ji,jj) = a3d(ji,jj,1) 
    146   + END_2D 
    147   }}} 
    148  
    149 '''6. Suppressing code that should not be called more than once per timestep''' 
    150  
    151   Examples include ocean.output write statements and initialisation steps outside of an "_ini" routine. 
    152  
    153 __Branch__ 
    154  
    155 [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/2020/dev_r13383_HPC-02_Daley_Tiling] 
    156  
    157 __New subroutines__ 
    158  
    159 * `OCE/DOM/domain/dom_tile`- Calculate/set tiling variables (domain indices, number of tiles) 
    160  
    161 __Modified modules__ 
    162  
    163 * `cfgs/SHARED/namelist_ref`- Add `namtile` namelist 
    164 * `OCE/DOM/dom_oce`- Declare tiling namelist and other tiling variables 
    165 * `OCE/DOM/domain`- Read `namtile` namelist (`dom_nam`), calculate tiling variables and do control print (`dom_tile`) 
    166 * `OCE/DOM/domutl`- `is_tile` functions 
    167 * `OCE/do_loop_substitute`- Modify DO loop macro to use domain indices, add CPP macros 
    168 * `OCE/par_oce`- Declare tiling variables 
    169 * `OCE/step`- Add tiling loop 
    170 * `OCE/step_oce`- Add USE statement for `dom_tile` in `step` 
    171 * Various others.. 
    172  
    173 __New variables (excluding local)__ 
    174  
    175 * Global variables 
    176   * `ntsi`, `ntsj`- start index of tile 
    177   * `ntei`, `ntej`- end index of tile 
    178   * `ntsi_a`, `ntsj_a`- start indices of each tile 
    179   * `ntei_a`, `ntej_a`- end indices of each tile 
    180   * `ntile`- current tile number 
    181   * `nijtile`- number of tiles 
    182 * Namelist (`namtile`) 
    183   * `ln_tile`- logical control on use of tiling 
    184   * `nn_ltile_i`, `nn_ltile_j`- tile length 
    185 * Pre-processor macros 
    186   * `ST_*D`- substitutions for ALLOCATE or DIMENSION arguments 
    187   * `ST_*DT`- substitutions for ALLOCATE or DIMENSION arguments when the shape of the array is unknown 
    188 * Functions 
    189   * `is_tile`- Returns 0 if the array has the dimensions of the full domain, else 1 
     141       ! Tiled region of code 
     142    END DO 
     143 
     144    IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 )        ! Revert to full domain 
     145    }}} 
     146 
     147The tiled code currently encompasses the active tracers (TRA) region of `stp`. 
     148The loop over tiles must currently be broken into two separate loops to preserve results.  
     149This is due to the temporary workaround implemented in `tra_adv`, which disables tiling for certain options and therefore can change the order in which the tracer trends are updated. 
     150 
     151==== General changes at the module and subroutine levels 
     152 
     153  __DO loop bounds__ 
     154 
     155  Each tile has an internal area and overlapping halo, but unlike the MPP domain the halo points are not set by `lbc_lnk`.  
     156  The internal part of a tile may therefore be partly overwritten by the halo of an adjacent tile, which will change results. 
     157  In these cases, DO loops must work on the internal part of the tile only. 
     158 
     159  This is generally only an issue for persistent variables (i.e. declared at the module level, or with `SAVE`), e.g. when zeroing an array: 
     160 
     161    {{{ 
     162    #!diff 
     163    - DO_3D_11_11(1, jpk) 
     164    -    akz(ji,jj,jk) = 0._wp 
     165    - END_3D 
     166    + DO_3D_00_00(1, jpk) 
     167    +    akz(ji,jj,jk) = 0._wp 
     168    + END_3D 
     169    }}} 
     170 
     171  or where an array appears on both sides of the assignment: 
     172 
     173    {{{ 
     174    #!diff 
     175    - DO_2D( 0, 1, 0, 0 ) 
     176    + DO_2D( 0, 0, 0, 0 ) 
     177         pts(ji,jj,1,jn,Krhs) = pts(ji,jj,1,jn,Krhs) + zfact * ( sbc_tsc_b(ji,jj,jn) + sbc_tsc(ji,jj,jn) ) / e3t(ji,jj,1,Kmm) 
     178    }}} 
     179 
     180  In some cases (e.g. `tra_bbl_adv` in `trabbl.F90`) DO loops must also work on the outer halo of a processor domain, which requires a slightly different approach: 
     181 
     182    {{{ 
     183    #!diff 
     184    - DO_2D( 1, 0, 1, 0 ) 
     185    + IF( ntsi == Nis0 ) THEN ; isi = 1 ; ELSE ; isi = 0 ; ENDIF    ! Avoid double-counting when using tiling 
     186    + IF( ntsj == Njs0 ) THEN ; isj = 1 ; ELSE ; isj = 0 ; ENDIF 
     187    + DO_2D( isi, 0, isj, 0 ) 
     188    }}} 
     189     
     190  __Local working array declarations__ 
     191 
     192  The new CPP macros in `do_loop_substitute.h90` replace references to the full domain in explicit shape declarations for local working arrays: 
     193     
     194    {{{ 
     195    #!diff 
     196    - ALLOCATE(jpi,jpj      ) DIMENSION(jpi,jpj      ) 
     197    + ALLOCATE(A2D(nn_hls)  ) DIMENSION(A2D(nn_hls)  ) 
     198    }}} 
     199 
     200  This allows the arrays to be declared with the size of the tile (or the full domain, if tiling is not used), minimising memory use. 
     201 
     202  This approach does not work for subroutines that are called with actual array arguments of varying shape; an assumed-shape declaration must be used instead.  
     203  It is necessary to use the form specifying lower bounds (i.e. `DIMENSION(start_i:, start_j:)`), as this information is required to correctly index the array when tiling is used. 
     204  However, these bounds must be passed as additional arguments to the subroutine. 
     205 
     206  To avoid widespread changes, the subroutine is replaced by a wrapper subroutine that calculates the bounds and passes them to the original subroutine. 
     207  For example: 
     208 
     209    {{{ 
     210    #!diff 
     211    - SUBROUTINE eos_insitu( pts, prd, pdep ) 
     212    -    REAL(wp), DIMENSION(jpi,jpj,jpk,jpts), INTENT(in   ) ::   pts 
     213    -    REAL(wp), DIMENSION(jpi,jpj,jpk     ), INTENT(  out) ::   prd 
     214    -    REAL(wp), DIMENSION(jpi,jpj,jpk     ), INTENT(in   ) ::   pdep 
     215    + SUBROUTINE eos_insitu( pts, prd, pdep )  
     216    +    REAL(wp), DIMENSION(:,:,:,:), INTENT(in   ) ::   pts 
     217    +    REAL(wp), DIMENSION(:,:,:)  , INTENT(  out) ::   prd 
     218    +    REAL(wp), DIMENSION(:,:,:)  , INTENT(in   ) ::   pdep 
     219    +     
     220    +    CALL eos_insitu_t( pts, is_tile(pts), prd, is_tile(prd), pdep, is_tile(pdep) ) 
     221    + END SUBROUTINE eos_insitu 
     222 
     223    + SUBROUTINE eos_insitu_t( pts, ktts, prd, ktrd, pdep, ktdep ) 
     224    +    INTEGER, INTENT(in   ) ::   ktts, ktrd, ktdep 
     225    +    REAL(wp), DIMENSION(A2D_T(ktts) ,JPK,JPTS), INTENT(in   ) ::   pts 
     226    +    REAL(wp), DIMENSION(A2D_T(ktrd) ,JPK     ), INTENT(  out) ::   prd 
     227    +    REAL(wp), DIMENSION(A2D_T(ktdep),JPK     ), INTENT(in   ) ::   pdep 
     228    }}} 
     229 
     230  Here, `is_tile` is an interface of functions returning 1 or 0 depending on the size of the array, e.g.: 
     231 
     232    {{{ 
     233    #!fortran 
     234    FUNCTION is_tile_2d( pt ) 
     235       REAL(wp), DIMENSION(:,:), INTENT(in) ::   pt 
     236       INTEGER :: is_tile_2d 
     237       IF( ln_tile .AND. SIZE(pt, 1) < jpi ) THEN 
     238          is_tile_2d = 1 
     239       ELSE 
     240          is_tile_2d = 0 
     241       ENDIF 
     242    END FUNCTION is_tile_2d 
     243    }}} 
     244 
     245  and `A2D_T` is a version of the `A2D` CPP macro that returns `1:,1:` if `is_tile(array) = 0` (`array` is the size of the full domain) or `ntsi:,ntsj:` if `is_tile(array) = 1` (`array` is the size of the tile). 
     246  The `JPK`/`JPTS` macros each return `:` and are used to preserve readability. 
     247 
     248  These wrappers are added in the following locations: 
     249      * `IOM/prtctl.F90` (`prt_ctl`) 
     250      * `TRA/eosbn2.F90` (various subroutines) 
     251      * `TRA/traldf_iso.F90` (`tra_ldf_iso`) 
     252      * `TRA/traldf_lap_blp.F90` (`tra_ldf_lap`) 
     253      * `TRA/traldf_triad.F90` (`tra_ldf_triad`) 
     254      * `TRA/zpshde.F90` (`zps_hde`, `zps_hde_isf`) 
     255 
     256  __`:` array subscripts__ 
     257 
     258  The above array declaration changes may introduce conformance issues when `:` subscripts are used for indexing, or if no indexing is used at all. 
     259  These are resolved by instead using an equivalent DO loop: 
     260 
     261    {{{ 
     262    #!diff 
     263    - REAL(wp), DIMENSION(jpi,jpj,jpk)   :: a3d 
     264    - REAL(wp), DIMENSION(jpi,jpj)       :: z2d 
     265    - z2d(:,:) = a3d(:,:,1). 
     266    + REAL(wp), DIMENSION(jpi,jpj,jpk)   :: a3d 
     267    + REAL(wp), DIMENSION(A2D(nn_hls)) :: z2d 
     268    + DO_2D(1,1,1,1) 
     269    +    z2d(ji,jj) = a3d(ji,jj,1) 
     270    + END_2D 
     271    }}} 
     272 
     273  __Code called once per timestep__ 
     274 
     275  Some code should only be called once per timestep (e.g. ocean.output write statements, initialisation steps, reading data from files) but will be called by each tile. 
     276  IF statements are used to suppress these calls and generally take the form: 
     277 
     278    {{{ 
     279    #!fortran 
     280    IF( ntile == 0 .OR. ntile == 1 )  THEN             ! Do only on the first tile 
     281       ! ... 
     282    ENDIF 
     283 
     284    IF( ntile == 0 .OR. ntile == nijtile )  THEN       ! Do only on the last tile 
     285       ! ... 
     286    ENDIF 
     287    }}} 
     288 
     289  Sometimes `dom_tile` must also be called to temporarily disable the tiling: 
     290 
     291    {{{ 
     292    #!fortran 
     293    IF( ntile == 0 .OR. ntile == 1 )  THEN   ! Do only for the full domain 
     294       itile = ntile 
     295       IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile = 0 )     ! Use full domain 
     296          CALL fld_read( kt, 1, sf_tsd )     !==   read T & S data at kt time step   ==! 
     297       IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile = itile )  ! Revert to tile 
     298    ENDIF 
     299    }}} 
     300 
     301==== Special cases 
     302 
     303  __`diaptr` module__ 
     304 
     305  At present, between `dia_ptr` and `dia_ptr_hst` the following operations are generally performed: 
     306 
     307    1. Calculate the zonal integral 
     308    2. Call `mpp_sum` 
     309    3. Perform additional arithmetic or copy result onto 2D/3D arrays 
     310    4. Call `iom_put` 
     311 
     312  The tiling is not easily implemented here due to steps 2 and 4. 
     313  The `diaptr` module has therefore been largely restructured to accommodate the tiling, with the added bonus of significantly reducing the number of communications. 
     314 
     315    * `dia_ptr` has been split into `dia_ptr_zint` (steps 1-2) and `dia_ptr_iom` (steps 3-4) 
     316       * `dia_ptr_zint` is called for every tile, `dia_ptr_iom` is called after `dia_ptr_zint` has been run for all tiles 
     317    * The zonal integrals that were calculated in `dia_ptr` are now calculated by `dia_ptr_zint` and are stored in two new module arrays, `pvtr_int` and `pzon_int`, which are used by `dia_ptr_iom` 
     318    * Zonal integrals are calculated by calling `ptr_sj` (calculates the integral for the tile) and a new subroutine `ptr_sum` (accumulates the `ptr_sj` integrals and calls `mpp_sum` on the last tile only) 
     319        * The number of communications is reduced from 90 to 13 (data for all basins are exchanged at once) 
     320        * Code relating to `mpp_sum` has been removed from `ptr_sj` 
     321        * Pointers and their target arrays have been removed from `ptr_sj` 
     322    * `dia_ptr_hst` has been condensed by moving the loop over basins out of the IF statements 
     323 
     324  __`dia_ar5_hst` subroutine__ 
     325 
     326  This code contains `iom_put` and `lbc_lnk` calls, which cannot be tiled. 
     327  The code has therefore been rearranged to separate the transport calculations from the `iom_put` calls. 
     328 
     329  The transport calculations are performed for each tile and saved in two new module variables, `hstr_adv` and `hstr_ldf`. 
     330  The `iom_put` calls are called only on the last tile, and the `lbc_lnk` calls have been removed as per #2367 (clean-up of communications). 
     331 
     332  __`prt_ctl` subroutine__ 
     333 
     334  This code prints the sums of the input arrays over the processor domain, minus the sums calculated by the previous `prt_ctl` call. 
     335  It is possible to implement tiling for this subroutine by aggregating the result over tiles, similar to the approach taken with `dia_ptr`. 
     336  However, this was difficult to implement cleanly and bit-level differences in the global sum could not be avoided when using tiling. 
     337     
     338  I have taken a simpler approach where the sum is output for each tile, rather than the processor domain, if tiling is used. 
     339  The `prt_ctl` utility can therefore be used to diagnose differences on a tile-by-tile basis per processor. 
     340  However, the `prt_ctl` output cannot be compared between a simulation with tiling and one without. 
     341 
     342  __`timing` module__ 
     343 
     344  The timing utility already works with the tiling; the only change is to ensure that the call iteration counter is incremented once for all tiles. 
     345 
     346  __Trends diagnostics__ 
     347 
     348  Tiling has not yet been implemented in these diagnostics, meaning that tiling has to be disabled for the various `trd_tra` calls throughout the TRA modules. 
     349 
     350  Rather than add IF statements around each of these calls, I simply set `ln_tile = .false.` in `trd_init` if the trends diagnostics are used. 
     351 
     352==== Temporary workarounds 
     353 
     354This code has been marked with a `! TEMP: [tiling]` comment. 
     355 
     356  __`iom_put` calls__ 
     357 
     358  XIOS does not currently support tiling, so the data must be complete (i.e. all tiles must have finished) at the time of an `iom_put` call. 
     359  The general workaround for this is to call `iom_put` only on the last tile. 
     360  Additional workarounds are required for some local working arrays, which are not preserved between subsequent calls to the subroutine. 
     361 
     362  Some code was rearranged in order to calculate the diagnostic and call `iom_put` at the same time on the last tile (`traldf_triad.F90`). 
     363  In other cases this was not possible, so the working arrays were declared with `SAVE` so that they could be processed by each tile (`traadv.F90`, `tramle.F90`). 
     364  In all cases, it is necessary to declare the working arrays with the size of the full domain (`DIMENSION(jpi,jpj)`) instead of the tile (`DIMENSION(A2D(nn_hls))`). 
     365     
     366  XIOS support for tiling is expected to be fully implemented sometime in December-January. 
     367  At this point all of the `iom_put` workarounds can be removed.  
     368 
     369  A preliminary development branch has been provided for testing by Olga Abramkina. 
     370  It seems likely that the workarounds would need to stay in place for the merge, as I presume that we would not want to merge based on a development version of XIOS. 
     371  However, the workarounds could certainly be removed post-merge before the 4.2 release. 
     372 
     373  __`lbc_lnk` calls__ 
     374 
     375  Similar to `iom_put` calls, `lbc_lnk` can only be called once all tiles have finished and the data is complete. 
     376  The general workaround for this is the same as for `iom_put`; call `lbc_lnk` only on the last tile. 
     377 
     378  This is done in a few cases (`tranpc.F90`, `traqsr.F90`), but often it was cleaner to simply disable the tiling due to the frequency of `lbc_lnk` calls. 
     379  This has been implemented in `tra_adv` for all schemes except 2nd order centred advection (`ln_traadv_cen = .true.` with `nn_cen_h = 2`), in `tra_ldf` for all bi-laplacian schemes, and for calls to `zps_hde`. 
     380  It was also necessary to split the tiling loop in `step.F90` so that the first loop ended before `tra_adv`, in order to preserve results. 
     381 
     382  Most of the `lbc_lnk` calls are removed in the `nn_hls = 2` case by #2367 (clean-up of communications), which will be merged with the tiling branch to form the basis of this year’s merge party. 
     383  #2367 also removes several `lbc_lnk` calls that were used to set the halo points on arrays being passed to `iom_put`; these have already been removed from the tiling branch. 
     384 
     385  Tiling will only be used with `nn_hls = 2`, so most of the remaining workarounds for the `lbc_lnk` calls should be removed. 
     386  However there are a number of changes in #2367 that prevent tiling, such as the addition of `lbc_lnk` calls in `tra_adv` and `zps_hde`, as well as expanding the bounds of some DO loops so that they work on the halo (see the “DO loop bounds” section above). 
     387  Removal of the workarounds therefore depends on whether these issues can be resolved before the merge. 
     388 
     389==== List of new variables and functions (excluding local) 
     390 
     391    * Global variables (`par_oce.F90`, `dom_oce.F90`) 
     392        * `ntsi`, `ntsj`- start index of tile 
     393        * `ntei`, `ntej`- end index of tile 
     394        * `ntsi_a`, `ntsj_a`- start indices of each tile 
     395        * `ntei_a`, `ntej_a`- end indices of each tile 
     396        * `ntile`- current tile number 
     397        * `nijtile`- number of tiles 
     398    * Module variables 
     399        * `hstr_adv`, `hstr_ldf` (`diaar5.F90`)- saved transports 
     400        * `pvtr_int`, `pzon_int` (`diaptr.F90`)- zonal integrals 
     401        * `jp_msk`, `jp_vtr` (`diaptr.F90`)- indices for `pvtr_int` & `pzon_int` 
     402        * `nnpcc` (`tranpc.F90`)- replaces local variable `inpcc` 
     403    * Namelist `namtile` (`dom_oce.F90`) 
     404        * `ln_tile`- logical control on use of tiling 
     405        * `nn_ltile_i`, `nn_ltile_j`- tile length 
     406    * Pre-processor macros (`do_loop_substitute.h90`) 
     407        * `A1Di`/`A1Dj`/`A2D`- substitutions for ALLOCATE or DIMENSION arguments 
     408        * `A1Di_T`/`A1Dj_T`/`A2D_T`- substitutions for ALLOCATE or DIMENSION arguments when the shape of the array is unknown 
     409        * `JPK`/`JPTS`/`KJPT`- placeholders for `:` to preserve readability 
     410    * Functions and subroutines 
     411        * `dom_tile` (`domain.F90`)- Calculate/set tiling variables 
     412        * `is_tile` (`domutl.F90`)- returns 0 if the array has the dimensions of the full domain, else 1 
     413        * `ptr_sum` (`diaptr.F90`)- sum `ptr_sj` zonal integrals over tiles and processors to get total 
     414    * The following subroutines have all been renamed to `<SUBROUTINE>_t`, where `<SUBROUTINE>` is now a wrapper function for `<SUBROUTINE>_t`: 
     415        * `eos_insitu`, `eos_insitu_pot`, `eos_insitu_2d`, `rab_3d`, `rab_2d`, `bn2`, `eos_fzp_2d` (`eosbn2.F90`) 
     416        * `tra_ldf_iso` (`traldf_iso.F90`) 
     417        * `tra_ldf_lap` (`traldf_lap_blp.F90`) 
     418        * `tra_ldf_triad` (`traldf_triad.F90`) 
     419        * `prt_ctl` (`prtctl.F90`) 
     420        * `zps_hde`, `zps_hde_isf` (`zpshde.F90`) 
    190421 
    191422=== Documentation updates