= Name and subject of the action Last edition: '''[[Wikinfo(changed_ts)]]''' by '''[[Wikinfo(changed_by)]]''' The PI is responsible to closely follow the progress of the action, and especially to contact NEMO project manager if the delay on preview (or review) are longer than the 2 weeks expected. [[PageOutline(2, , inline)]] == Summary ||=Action || Implement 2D tiling (with the LFRA version of NEMO) || ||=PI(S) || Daley Calvert, Andrew Coward || ||=Digest || Implement 2D tiling to reduce traffic between main memory and L3 cache || ||=Dependencies || DO loop macros ([wiki:2020WP/KERNEL-02_Coward_DoLoopMacros_part1]), extended haloes (Italo Epicoco, Seb Masson and Francesca Mele), extension of XIOS to accept 2D tiles of data (Yann Meurdesoif & Seb Masson) || ||=Branch || source:/NEMO/branches/{YEAR}/dev_r{REV}_{ACTION_NAME} || ||=Previewer(s) || Gurvan Madec || ||=Reviewer(s) || Gurvan Madec || ||=Ticket || #2365 || === Description Implement loop tiling over horizontal dimensions (i and j). === Implementation As of 24/09/20, most of the code called by the "active tracers" part of the step subroutine (between `trc_stp` and `tra_atf`) has been tiled. Solutions and workarounds for the issues encountered to date are described in [https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2020WP/HPC-02_Daley_Tiling/Tiling_code_issues.pdf this document]. The tiling implementation has been tested using GYRE in benchmark mode with mono-processor and MPI configurations. The tests comprise 10 day simulations using different tile decompositions (including no tiling) and different science options particular to the tiled modules. A test passes if the tiling does not change results at the bit level (`run.stat`) or in the diagnostics. __Summary of method__ The full processor domain (dimensions `jpi` x `jpj`) is split into one or more tiles/subdomains. This is implemented by: '''1. Modifying the DO loop macros in `do_loop_substitute.h90` to use the tile bounds''' The tile domain is defined by a new set of domain indices (`ntsi`, `ntei`, `ntsj`, `ntej`), which represent the internal part of the domain: {{{ #!diff - #define DO_2D(B, T, L, R) DO jj = Njs0-(B), Nje0+(T) ; DO ji = Nis0-(L), Nie0+(R) + #define DO_2D(B, T, L, R) DO jj = ntsj-(B), ntej+(T) ; DO ji = ntsi-(L), ntei+(R) }}} A new subroutine `dom_tile` (in `domain.F90`) sets the values of these indices. During initialisation, this subroutine calculates and stores the indices in global arrays (`ntsi_a`, `ntei_a`, `ntsj_a`, `ntej_a`) with lengths equal to the number of tiles (`nijtile`) plus one. The zero index is used to store the indices for the full domain: {{{ #!fortran ntsi_a(0) = Nis0 ntsj_a(0) = Njs0 ntei_a(0) = Nie0 ntej_a(0) = Nje0 }}} `dom_tile` is called whenever the active tile needs to be set or if tiling needs to be disabled: {{{ #!fortran CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=3 ) ! Work on tile 3 CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Work on the full domain }}} '''2. Declaring SUBROUTINE-level arrays using the tile bounds''' A new set of substitution macros in `do_loop_substitute.h90`: {{{ #define ST_1Di(H) ntsi-H:ntei+H #define ST_1Dj(H) ntsj-H:ntej+H #define ST_2D(H) ST_1Di(H),ST_1Dj(H) }}} replaces references to the full domain in explicit shape and allocatable array declarations: {{{ #!diff - ALLOCATE(jpi,jpj ) DIMENSION(jpi,jpj ) + ALLOCATE(ST_2D(nn_hls)) DIMENSION(ST_2D(nn_hls)) }}} These arrays then have the same dimensions as the tile if tiling is used, otherwise they will have the same dimensions as the full domain as before. Furthermore, the tile-sized arrays are declared with lower and upper bounds corresponding to the position of the tile in the full domain. Horizontal indices, for example in DO loops, will therefore apply to both tile- and full-sized arrays: {{{ #!fortran ! ntsi = 3, ntsj = 7, ntei = 5, ntej = 9 REAL(wp), DIMENSION(ntsi:ntei,ntsj:ntej) :: z2d REAL(wp), DIMENSION(jpi,jpj) :: a2d DO_2D(1,1,1,1) z2d(ji,jj) = a2d(ji,jj) END_2D }}} This substitution is made for local working arrays where possible to minimise memory consumption when using tiling. No further changes are generally required, except in specific cases described in [https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2020WP/HPC-02_Daley_Tiling/Tiling_code_issues.pdf this document] and other common cases described in steps 5 & 6 below. '''3. Looping over tiles at the timestepping level''' A loop over tiles has been added to `stp`. The domain indices for the current tile (`ntile /= 0`) are set at the start of each iteration. After exiting the loop (and before, during initialisation) the tiling is suppressed (`ntile == 0`): {{{ #!fortran ! Loop over tile domains DO jtile = 1, nijtile IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=jtile ) CALL tra_ldf( kstp, Nbb, Nnn, ts, Nrhs ) ! lateral mixing END DO IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Revert to full domain }}} DO loops within the tiling loop therefore work on the current tile, while those outside the tiling loop work on the full domain. '''4. A new namelist (`namtile`)''' {{{ !----------------------------------------------------------------------- &namtile ! parameters of the tiling !----------------------------------------------------------------------- ln_tile = .false. ! Use tiling (T) or not (F) nn_ltile_i = 10 ! Length of tiles in i nn_ltile_j = 10 ! Length of tiles in j / }}} The number of tiles is calculated from the tile lengths, `nn_ltile_i` and `nn_ltile_j`, with respect to the full domain. '''5. Replacing `:` subscripts with a DO loop macro where appropriate''' This is only necessary when step 2 would introduce conformance issues: {{{ #!diff - REAL(wp), DIMENSION(jpi,jpj,jpk) :: a3d - REAL(wp), DIMENSION(jpi,jpj) :: z2d - z2d(:,:) = a3d(:,:,1). + REAL(wp), DIMENSION(jpi,jpj,jpk) :: a3d + REAL(wp), DIMENSION(ST_2D(nn_hls)) :: z2d + DO_2D(1,1,1,1) + z2d(ji,jj) = a3d(ji,jj,1) + END_2D }}} '''6. Suppressing code that should not be called more than once per timestep''' Examples include ocean.output write statements and initialisation steps outside of an "_ini" routine. __Branch__ [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/2020/dev_r13383_HPC-02_Daley_Tiling] __New subroutines__ * `OCE/DOM/domain/dom_tile`- Calculate/set tiling variables (domain indices, number of tiles) __Modified modules__ * `cfgs/SHARED/namelist_ref`- Add `namtile` namelist * `OCE/DOM/dom_oce`- Declare tiling namelist and other tiling variables * `OCE/DOM/domain`- Read `namtile` namelist (`dom_nam`), calculate tiling variables and do control print (`dom_tile`) * `OCE/DOM/domutl`- `is_tile` functions * `OCE/do_loop_substitute`- Modify DO loop macro to use domain indices, add CPP macros * `OCE/par_oce`- Declare tiling variables * `OCE/step`- Add tiling loop * `OCE/step_oce`- Add USE statement for `dom_tile` in `step` * Various others.. __New variables (excluding local)__ * Global variables * `ntsi`, `ntsj`- start index of tile * `ntei`, `ntej`- end index of tile * `ntsi_a`, `ntsj_a`- start indices of each tile * `ntei_a`, `ntej_a`- end indices of each tile * `ntile`- current tile number * `nijtile`- number of tiles * Namelist (`namtile`) * `ln_tile`- logical control on use of tiling * `nn_ltile_i`, `nn_ltile_j`- tile length * Pre-processor macros * `ST_*D`- substitutions for ALLOCATE or DIMENSION arguments * `ST_*DT`- substitutions for ALLOCATE or DIMENSION arguments when the shape of the array is unknown * Functions * `is_tile`- Returns 0 if the array has the dimensions of the full domain, else 1 === Documentation updates {{{#!box width=55em help Using previous parts, define the main changes to be done in the NEMO literature (manuals, guide, web pages, …). }}} ''...'' == Preview {{{#!box width=50em info [[Include(wiki:Developers/DevProcess#preview_)]] }}} ''...'' == Tests {{{#!box width=50em info [[Include(wiki:Developers/DevProcess#tests)]] }}} ''...'' == Review {{{#!box width=50em info [[Include(wiki:Developers/DevProcess#review)]] }}} ''...''