= Name and subject of the action Last edition: '''[[Wikinfo(changed_ts)]]''' by '''[[Wikinfo(changed_by)]]''' The PI is responsible to closely follow the progress of the action, and especially to contact NEMO project manager if the delay on preview (or review) are longer than the 2 weeks expected. [[PageOutline(2, , inline)]] == Summary ||=Action || Implement 2D tiling (with the LFRA version of NEMO) || ||=PI(S) || Daley Calvert, Andrew Coward || ||=Digest || Implement 2D tiling to reduce traffic between main memory and L3 cache || ||=Dependencies || DO loop macros ([wiki:2020WP/KERNEL-02_Coward_DoLoopMacros_part1]), extended haloes (Italo Epicoco, Seb Masson and Francesca Mele), extension of XIOS to accept 2D tiles of data (Yann Meurdesoif & Seb Masson) || ||=Branch || source:/NEMO/branches/{YEAR}/dev_r{REV}_{ACTION_NAME} || ||=Previewer(s) || Gurvan Madec || ||=Reviewer(s) || Gurvan Madec || ||=Ticket || #2365 || === Description Implement loop tiling over horizontal dimensions (i and j). === Implementation The current approach to tiling is described below. A document describing the issues encountered to date are described in [https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2020WP/HPC-02_Daley_Tiling/Tiling_code_issues.pdf this document]. Several modules have been tiled as of 18/06/20: `tra_ldf`, `tra_zdf`, `tra_adv` and `dia_ptr`. The tiling implementation has been tested using GYRE with 1 CPU. The tests comprise 10 day simulations using different tile decompositions (including no tiling) and different science options particular to the tiled modules. A test passes if the tiling does not change results at the bit level (`run.stat`) or in the diagnostics. __Summary of method__ The full processor domain (dimensions `jpi` x `jpj`) is split into one or more tiles/subdomains. This is implemented by: '''1. Modifying the DO loop macros in `do_loop_substitute.h90` to use the tile bounds''' The tile domain is defined by a new set of domain indices (`ntsi`, `ntei`, `ntsj`, `ntej`), which represent the internal part of the domain: {{{ #!diff - #define __kIs_ 2 + #define __kIs_ ntsi }}} A new subroutine `dom_tile` (in `domain.F90`) sets the values of these indices. During initialisation, this subroutine calculates and stores the indices in global arrays (`ntsi_a`, `ntei_a`, `ntsj_a`, `ntej_a`) with lengths equal to the number of tiles (`nijtile`) plus one. The zero index is used to store the indices for the full domain: {{{ #!fortran ntsi_a(0) = 1 + nn_hls ntsj_a(0) = 1 + nn_hls ntei_a(0) = jpi - nn_hls ntej_a(0) = jpj - nn_hls }}} `dom_tile` is called whenever the active tile needs to be set or if tiling needs to be suppressed: {{{ #!fortran CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=3 ) ! Work on tile 3 CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Work on the full domain }}} '''2. Declaring SUBROUTINE-level arrays using the tile bounds''' A new substitution macro in `do_loop_substitute.h90`: {{{ #define A2D __kIsm1_:__kIep1_,__kJsm1_:__kJep1_ }}} is used such that: {{{ #!diff - ALLOCATE(jpi,jpj) DIMENSION(jpi,jpj) + ALLOCATE(A2D) DIMENSION(A2D) }}} and therefore operations between local working arrays (which have the dimensions of the tile) and global/input arrays (which have the dimensions of either the tile or full domain) require no further changes, unless using `:` subscripts as described below. '''3. Replacing `:` subscripts with a DO loop macro where appropriate''' This is only necessary when step 2 would introduce conformance issues: {{{ #!diff - REAL(wp), DIMENSION(jpi,jpj,jpk) :: a3d - REAL(wp), DIMENSION(jpi,jpj) :: z2d - z2d(:,:) = a3d(:,:,1). + REAL(wp), DIMENSION(jpi,jpj,jpk) :: a3d + REAL(wp), DIMENSION(A2D) :: z2d + DO_2D_11_11 + z2d(ji,jj) = a3d(ji,jj,1) + END_2D }}} '''4. Looping over tiles at the timestepping level''' A loop over tiles has been added to `stp`. The domain indices for the current tile (`ntile /= 0`) are set at the start of each iteration. After exiting the loop (and before, during initialisation) the tiling is suppressed (`ntile == 0`): {{{ #!fortran ! Loop over tile domains DO jtile = 1, nijtile IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=jtile ) CALL tra_ldf( kstp, Nbb, Nnn, ts, Nrhs ) ! lateral mixing END DO IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Revert to full domain }}} DO loops within the tiling loop therefore work on the current tile, while those outside the loop work on the full domain. '''5. A new namelist (`namtile`)''' {{{ !----------------------------------------------------------------------- &namtile ! parameters of the tiling !----------------------------------------------------------------------- ln_tile = .false. ! Use tiling (T) or not (F) nn_ltile_i = 10 ! Length of tiles in i nn_ltile_j = 10 ! Length of tiles in j / }}} The number of tiles is calculated from the tile lengths, `nn_ltile_i` and `nn_ltile_j`, with respect to the full domain. __Branch__ ''These branches contain a trial implementation of tiling in `tra_ldf_iso`; there is not yet a formal branch for the development.'' [http://fcm3/projects/NEMO.xm/changeset?reponame=&new=12979%40NEMO%2Fbranches%2FUKMO%2Fdev_r12745_HPC-02_Daley_Tiling_trial_public&old=12740%40NEMO%2Ftrunk Implementation in trunk] [http://fcm3/projects/NEMO.xm/changeset?reponame=&new=12979%40NEMO%2Fbranches%2FUKMO%2Fdev_r12866_HPC-02_Daley_Tiling_trial_extra_halo&old=12866%40NEMO%2Fbranches%2F2020%2Fdev_r12558_HPC-08_epico_Extra_Halo Implementation in extended haloes branch] __New subroutines__ * `OCE/DOM/domain/dom_tile`- Calculate/set tiling variables (domain indices, number of tiles) __Modified modules__ * `cfgs/SHARED/namelist_ref`- Add `namtile` namelist * `OCE/DOM/dom_oce`- Declare tiling namelist and other tiling variables * `OCE/DOM/domain`- Read `namtile` namelist (`dom_nam`), calculate tiling variables and do control print (`dom_tile`) * `OCE/IOM/prtctl`- Add IF statement to prevent execution of `prt_ctl` by each tile * `OCE/TRA/traldf`- Add IF statements to prevent execution of `trd_tra` by each tile * `OCE/TRA/traldf_iso`- Add IF statements (as above), modify local arrays for tiling * `OCE/do_loop_substitute`- Modify DO loop macros to use domain indices, add `A2D` macro * `OCE/par_oce`- Declare tiling variables * `OCE/step`- Add tiling loop * `OCE/step_oce`- Add USE statement for `dom_tile` in `step` * `OCE/timing`- Add IF statements to prevent execution of `timing_start` and `timing_stop` by each tile __New variables (excluding local)__ * Global variables * `ntsi`, `ntsj`- start index of tile * `ntei`, `ntej`- end index of tile * `ntsi_a`, `ntsj_a`- start indices of each tile * `ntei_a`, `ntej_a`- end indices of each tile * `ntile`- current tile number * `nijtile`- number of tiles * Namelist * `ln_tile`- logical control on use of tiling * `nn_ltile_i`, `nn_ltile_j`- tile length * Pre-processor macros * `A2D`- substitution for ALLOCATE or DIMENSION arguments __Notes__ '''Issues with the tiling implementation''' See the attached [https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2020WP/HPC-02_Daley_Tiling/Tiling_code_issues.pdf document]. '''Extended haloes''' The tiling trial has also been implemented in the [http://fcm3/projects/NEMO.xm/changeset?reponame=&new=12979%40NEMO%2Fbranches%2FUKMO%2Fdev_r12866_HPC-02_Daley_Tiling_trial_extra_halo&old=12866%40NEMO%2Fbranches%2F2020%2Fdev_r12558_HPC-08_epico_Extra_Halo extended haloes branch]. There are few differences between this and the trunk implementation. === Documentation updates {{{#!box width=55em help Using previous parts, define the main changes to be done in the NEMO literature (manuals, guide, web pages, …). }}} ''...'' == Preview {{{#!box width=50em info [[Include(wiki:Developers/DevProcess#preview_)]] }}} ''...'' == Tests {{{#!box width=50em info [[Include(wiki:Developers/DevProcess#tests)]] }}} ''...'' == Review {{{#!box width=50em info [[Include(wiki:Developers/DevProcess#review)]] }}} ''...''