Version 8 (modified by hadcv, 5 years ago) (diff) |
---|
Name and subject of the action
Last edition: Wikinfo(changed_ts)? by Wikinfo(changed_by)?
The PI is responsible to closely follow the progress of the action, and especially to contact NEMO project manager if the delay on preview (or review) are longer than the 2 weeks expected.
Summary
Action | Implement 2D tiling (with the LFRA version of NEMO) |
---|---|
PI(S) | Daley Calvert, Andrew Coward |
Digest | Implement 2D tiling to reduce traffic between main memory and L3 cache |
Dependencies | DO loop macros (2020WP/KERNEL-02_Coward_DoLoopMacros_part1), extended haloes (Italo Epicoco, Seb Masson and Francesca Mele), extension of XIOS to accept 2D tiles of data (Yann Meurdesoif & Seb Masson) |
Branch | source:/NEMO/branches/{YEAR}/dev_r{REV}_{ACTION_NAME} |
Previewer(s) | Gurvan Madec |
Reviewer(s) | Gurvan Madec |
Ticket | #2365 |
Description
Implement loop tiling over horizontal dimensions (i and j).
Implementation
The current approach to tiling is described below. A document describing the issues encountered to date are described in this document.
Several modules have been tiled as of 18/06/20: tra_ldf, tra_zdf, tra_adv and dia_ptr.
The tiling implementation has been tested using GYRE with 1 CPU. The tests comprise 10 day simulations using different tile decompositions (including no tiling) and different science options particular to the tiled modules. A test passes if the tiling does not change results at the bit level (run.stat) or in the diagnostics.
Summary of method
The full processor domain (dimensions jpi x jpj) is split into one or more tiles/subdomains. This is implemented by:
1. Modifying the DO loop macros in do_loop_substitute.h90 to use the tile bounds
The tile domain is defined by a new set of domain indices (ntsi, ntei, ntsj, ntej), which represent the internal part of the domain:
- #define __kIs_ 2 + #define __kIs_ ntsi
A new subroutine dom_tile (in domain.F90) sets the values of these indices.
During initialisation, this subroutine calculates and stores the indices in global arrays (ntsi_a, ntei_a, ntsj_a, ntej_a) with lengths equal to the number of tiles (nijtile) plus one. The zero index is used to store the indices for the full domain:
ntsi_a(0) = 1 + nn_hls ntsj_a(0) = 1 + nn_hls ntei_a(0) = jpi - nn_hls ntej_a(0) = jpj - nn_hls
dom_tile is called whenever the active tile needs to be set or if tiling needs to be suppressed:
CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=3 ) ! Work on tile 3 CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Work on the full domain
2. Declaring SUBROUTINE-level arrays using the tile bounds
A new substitution macro in do_loop_substitute.h90:
#define A2D __kIsm1_:__kIep1_,__kJsm1_:__kJep1_
is used such that:
- ALLOCATE(jpi,jpj) DIMENSION(jpi,jpj) + ALLOCATE(A2D) DIMENSION(A2D)
and therefore operations between local working arrays (which have the dimensions of the tile) and global/input arrays (which have the dimensions of either the tile or full domain) require no further changes, unless using : subscripts as described below.
3. Replacing : subscripts with a DO loop macro where appropriate
This is only necessary when step 2 would introduce conformance issues:
- REAL(wp), DIMENSION(jpi,jpj,jpk) :: a3d - REAL(wp), DIMENSION(jpi,jpj) :: z2d - z2d(:,:) = a3d(:,:,1). + REAL(wp), DIMENSION(jpi,jpj,jpk) :: a3d + REAL(wp), DIMENSION(A2D) :: z2d + DO_2D_11_11 + z2d(ji,jj) = a3d(ji,jj,1) + END_2D
4. Looping over tiles at the timestepping level
A loop over tiles has been added to stp. The domain indices for the current tile (ntile /= 0) are set at the start of each iteration. After exiting the loop (and before, during initialisation) the tiling is suppressed (ntile == 0):
! Loop over tile domains DO jtile = 1, nijtile IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=jtile ) CALL tra_ldf( kstp, Nbb, Nnn, ts, Nrhs ) ! lateral mixing END DO IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Revert to full domain
DO loops within the tiling loop therefore work on the current tile, while those outside the loop work on the full domain.
5. A new namelist (namtile)
!----------------------------------------------------------------------- &namtile ! parameters of the tiling !----------------------------------------------------------------------- ln_tile = .false. ! Use tiling (T) or not (F) nn_ltile_i = 10 ! Length of tiles in i nn_ltile_j = 10 ! Length of tiles in j /
The number of tiles is calculated from the tile lengths, nn_ltile_i and nn_ltile_j, with respect to the full domain.
Branch
These branches contain a trial implementation of tiling in tra_ldf_iso; there is not yet a formal branch for the development.
Implementation in extended haloes branch
New subroutines
- OCE/DOM/domain/dom_tile- Calculate/set tiling variables (domain indices, number of tiles)
Modified modules
- cfgs/SHARED/namelist_ref- Add namtile namelist
- OCE/DOM/dom_oce- Declare tiling namelist and other tiling variables
- OCE/DOM/domain- Read namtile namelist (dom_nam), calculate tiling variables and do control print (dom_tile)
- OCE/IOM/prtctl- Add IF statement to prevent execution of prt_ctl by each tile
- OCE/TRA/traldf- Add IF statements to prevent execution of trd_tra by each tile
- OCE/TRA/traldf_iso- Add IF statements (as above), modify local arrays for tiling
- OCE/do_loop_substitute- Modify DO loop macros to use domain indices, add A2D macro
- OCE/par_oce- Declare tiling variables
- OCE/step- Add tiling loop
- OCE/step_oce- Add USE statement for dom_tile in step
- OCE/timing- Add IF statements to prevent execution of timing_start and timing_stop by each tile
New variables (excluding local)
- Global variables
- ntsi, ntsj- start index of tile
- ntei, ntej- end index of tile
- ntsi_a, ntsj_a- start indices of each tile
- ntei_a, ntej_a- end indices of each tile
- ntile- current tile number
- nijtile- number of tiles
- Namelist
- ln_tile- logical control on use of tiling
- nn_ltile_i, nn_ltile_j- tile length
- Pre-processor macros
- A2D- substitution for ALLOCATE or DIMENSION arguments
Notes
Issues with the tiling implementation
See the attached document.
Extended haloes
The tiling trial has also been implemented in the extended haloes branch. There are few differences between this and the trunk implementation.
Documentation updates
...
Preview
...
Tests
...
Review
...
Attachments (5)
-
tra_ldf_iso trial.pdf
(142.0 KB) -
added by hadcv 5 years ago.
Trial implementation in tra_ldf_iso
- Tiling_call_notes_220420.pdf (20.0 KB) - added by hadcv 5 years ago.
-
Tiling_code_issues.pdf
(114.8 KB) -
added by hadcv 4 years ago.
Description of tiling issues
-
timing_results.pdf
(42.9 KB) -
added by hadcv 4 years ago.
Tiling performance results (Sept 2020)
-
Tiling_progress_summary_240920.pdf
(90.1 KB) -
added by hadcv 4 years ago.
September 2020 progress summary
Download all attachments as: .zip