Version 11 (modified by hadcv, 4 years ago) (diff) |
---|
Name and subject of the action
Last edition: Wikinfo(changed_ts)? by Wikinfo(changed_by)?
The PI is responsible to closely follow the progress of the action, and especially to contact NEMO project manager if the delay on preview (or review) are longer than the 2 weeks expected.
Summary
Action | Implement 2D tiling (with the LFRA version of NEMO) |
---|---|
PI(S) | Daley Calvert, Andrew Coward |
Digest | Implement 2D tiling to reduce traffic between main memory and L3 cache |
Dependencies | DO loop macros (2020WP/KERNEL-02_Coward_DoLoopMacros_part1), extended haloes (Italo Epicoco, Seb Masson and Francesca Mele), extension of XIOS to accept 2D tiles of data (Yann Meurdesoif & Seb Masson) |
Branch | source:/NEMO/branches/{YEAR}/dev_r{REV}_{ACTION_NAME} |
Previewer(s) | Gurvan Madec |
Reviewer(s) | Gurvan Madec |
Ticket | #2365 |
Description
Implement tiling over horizontal dimensions (i and j).
Branch
dev_r13383_HPC-02_Daley_Tiling
Summary of the tiling method
The processor domain (dimensions jpi x jpj) is split into one or more tiles/subdomains, which are iterated over asynchronously within the timestepping loop.
These tile domains are defined by a new set of indices representing the internal part of the domain (ntsi/ntei/ntsj/ntej). These indices replace those of the processor domain (Nis0/Nie0/Njs0/Nje0) in DO loops, array shape declarations, and other appropriate places.
These changes are implemented via new and existing CPP macros, allowing the tiling to be implemented with relatively few changes at lower levels. A number of temporary workarounds are required to preserve results, but will be removed before the 2020 merge party or as part of the 2021 work plan.
Details of the implementation
Namelist
In dom_nam (domain.F90) a new namelist (namtile) is read and control prints written to ocean.output:
!----------------------------------------------------------------------- &namtile ! parameters of the tiling !----------------------------------------------------------------------- ln_tile = .false. ! Use tiling (T) or not (F) nn_ltile_i = 10 ! Length of tiles in i nn_ltile_j = 10 ! Length of tiles in j /
These variables are declared in dom_oce.F90.
Setting the tile domain
A new subroutine dom_tile (domain.F90) sets the values of the tile indices (ntsi/ntei/ntsj/ntej) and the active tile number (ntile).
During initialisation, this subroutine calculates the number of tiles (nijtile) and the tile indices, which are stored in public arrays (ntsi_a/ntei_a/ntsj_a/ntej_a) with lengths equal to nijtile + 1. When dom_tile is otherwise called, these arrays are used to set the tiling indices for the current tile (e.g. ntsi = ntsi_a(ntile)).
ntile = 0 indicates that tiling is disabled, i.e. the full domain is to be used.
dom_tile is called whenever the active tile needs to be set, if tiling needs to be disabled and for initialisation (in dom_init):
CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=3 ) ! Work on tile 3 CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Work on the full domain CALL dom_tile( ntsi, ntsj, ntei, ntej ) ! Initialisation (implies ktile=0)
Variables ntsi/ntei/ntsj/ntej/nijtile are declared in par_oce.F90. Variables ntsi_a/ntei_a/ntsj_a/ntej_a are declared in dom_oce.F90
Changes to CPP macros
In do_loop_substitute.h90, the DO loop macros are modified to instead use the tiling indices:
- #define DO_2D(B, T, L, R) DO jj = Njs0-(B), Nje0+(T) ; DO ji = Nis0-(L), Nie0+(R) + #define DO_2D(B, T, L, R) DO jj = ntsj-(B), ntej+(T) ; DO ji = ntsi-(L), ntei+(R)
A number of new macros have been added that replace jpi/jpj in DIMENSION and ALLOCATE statements (see “Local working array declarations” section below):
#define A1Di(H) ntsi-H:ntei+H # H is equivalent to B/T/L/R in DO loop macros #define A1Dj(H) ntsj-H:ntej+H #define A2D(H) A1Di(H),A1Dj(H) #define A1Di_T(T) (ntsi-nn_hls-1)*T+1: # T is 1 (= ntsi:) or 0 (= 1:) #define A1Dj_T(T) (ntsj-nn_hls-1)*T+1: #define A2D_T(T) A1Di_T(T),A1Dj_T(T) #define JPK : #define JPTS : #define KJPT :
The purpose of the A1Di/A1Dj/A2D macros is to allow local working arrays to be declared with the size of the tile (or the full domain, if tiling is not used), minimising memory use. Furthermore, the tile-sized arrays will be declared with lower and upper bounds corresponding to the position of the tile in the full domain. Horizontal indices, for example in DO loops, will therefore apply to both tile- and full-sized arrays:
! ntsi = 3, ntsj = 7, ntei = 5, ntej = 9 REAL(wp), DIMENSION(ntsi:ntei,ntsj:ntej) :: z2d REAL(wp), DIMENSION(jpi,jpj) :: a2d DO_2D(1,1,1,1) z2d(ji,jj) = a2d(ji,jj) END_2D
The A1Di_T/A1Dj_T/A2D_T macros are assumed-shape versions of the A1Di/A1Dj/A2D macros, used in dummy array argument declarations where the shape of the actual array argument is inconsistent between calls to the subroutine (see “Local working array declarations” section below).
The JPK/JPTS/KJPT macros are used where explicit-shape declarations have been replaced by assumed-shape declarations. Their only purpose is to preserve readability.
Changes at the timestepping level
The looping over tiles occurs in the stp subroutine. The domain indices for the current tile (ntile /= 0) are set at the start of each iteration. After exiting the loop (and before, during initialisation) the tiling is disabled (ntile == 0):
! Loop over tile domains DO jtile = 1, nijtile IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=jtile ) ! Tiled region of code END DO IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Revert to full domain
The tiled code currently encompasses the active tracers (TRA) region of stp. The loop over tiles must currently be broken into two separate loops to preserve results. This is due to the temporary workaround implemented in tra_adv, which disables tiling for certain options and therefore can change the order in which the tracer trends are updated.
General changes at the module and subroutine levels
DO loop bounds
Each tile has an internal area and overlapping halo, but unlike the MPP domain the halo points are not set by lbc_lnk. The internal part of a tile may therefore be partly overwritten by the halo of an adjacent tile, which will change results. In these cases, DO loops must work on the internal part of the tile only.
This is generally only an issue for persistent variables (i.e. declared at the module level, or with SAVE), e.g. when zeroing an array:
- DO_3D_11_11(1, jpk) - akz(ji,jj,jk) = 0._wp - END_3D + DO_3D_00_00(1, jpk) + akz(ji,jj,jk) = 0._wp + END_3D
or where an array appears on both sides of the assignment:
- DO_2D( 0, 1, 0, 0 ) + DO_2D( 0, 0, 0, 0 ) pts(ji,jj,1,jn,Krhs) = pts(ji,jj,1,jn,Krhs) + zfact * ( sbc_tsc_b(ji,jj,jn) + sbc_tsc(ji,jj,jn) ) / e3t(ji,jj,1,Kmm)
In some cases (e.g. tra_bbl_adv in trabbl.F90) DO loops must also work on the outer halo of a processor domain, which requires a slightly different approach:
- DO_2D( 1, 0, 1, 0 ) + IF( ntsi == Nis0 ) THEN ; isi = 1 ; ELSE ; isi = 0 ; ENDIF ! Avoid double-counting when using tiling + IF( ntsj == Njs0 ) THEN ; isj = 1 ; ELSE ; isj = 0 ; ENDIF + DO_2D( isi, 0, isj, 0 )
Local working array declarations
The new CPP macros in do_loop_substitute.h90 replace references to the full domain in explicit shape declarations for local working arrays:
- ALLOCATE(jpi,jpj ) DIMENSION(jpi,jpj ) + ALLOCATE(A2D(nn_hls) ) DIMENSION(A2D(nn_hls) )
This allows the arrays to be declared with the size of the tile (or the full domain, if tiling is not used), minimising memory use.
This approach does not work for subroutines that are called with actual array arguments of varying shape; an assumed-shape declaration must be used instead. It is necessary to use the form specifying lower bounds (i.e. DIMENSION(start_i:, start_j:)), as this information is required to correctly index the array when tiling is used. However, these bounds must be passed as additional arguments to the subroutine.
To avoid widespread changes, the subroutine is replaced by a wrapper subroutine that calculates the bounds and passes them to the original subroutine. For example:
- SUBROUTINE eos_insitu( pts, prd, pdep ) - REAL(wp), DIMENSION(jpi,jpj,jpk,jpts), INTENT(in ) :: pts - REAL(wp), DIMENSION(jpi,jpj,jpk ), INTENT( out) :: prd - REAL(wp), DIMENSION(jpi,jpj,jpk ), INTENT(in ) :: pdep + SUBROUTINE eos_insitu( pts, prd, pdep ) + REAL(wp), DIMENSION(:,:,:,:), INTENT(in ) :: pts + REAL(wp), DIMENSION(:,:,:) , INTENT( out) :: prd + REAL(wp), DIMENSION(:,:,:) , INTENT(in ) :: pdep + + CALL eos_insitu_t( pts, is_tile(pts), prd, is_tile(prd), pdep, is_tile(pdep) ) + END SUBROUTINE eos_insitu + SUBROUTINE eos_insitu_t( pts, ktts, prd, ktrd, pdep, ktdep ) + INTEGER, INTENT(in ) :: ktts, ktrd, ktdep + REAL(wp), DIMENSION(A2D_T(ktts) ,JPK,JPTS), INTENT(in ) :: pts + REAL(wp), DIMENSION(A2D_T(ktrd) ,JPK ), INTENT( out) :: prd + REAL(wp), DIMENSION(A2D_T(ktdep),JPK ), INTENT(in ) :: pdep
Here, is_tile is an interface of functions returning 1 or 0 depending on the size of the array, e.g.:
FUNCTION is_tile_2d( pt ) REAL(wp), DIMENSION(:,:), INTENT(in) :: pt INTEGER :: is_tile_2d IF( ln_tile .AND. SIZE(pt, 1) < jpi ) THEN is_tile_2d = 1 ELSE is_tile_2d = 0 ENDIF END FUNCTION is_tile_2d
and A2D_T is a version of the A2D CPP macro that returns 1:,1: if is_tile(array) = 0 (array is the size of the full domain) or ntsi:,ntsj: if is_tile(array) = 1 (array is the size of the tile). The JPK/JPTS macros each return : and are used to preserve readability.
These wrappers are added in the following locations:
- IOM/prtctl.F90 (prt_ctl)
- TRA/eosbn2.F90 (various subroutines)
- TRA/traldf_iso.F90 (tra_ldf_iso)
- TRA/traldf_lap_blp.F90 (tra_ldf_lap)
- TRA/traldf_triad.F90 (tra_ldf_triad)
- TRA/zpshde.F90 (zps_hde, zps_hde_isf)
: array subscripts
The above array declaration changes may introduce conformance issues when : subscripts are used for indexing, or if no indexing is used at all. These are resolved by instead using an equivalent DO loop:
- REAL(wp), DIMENSION(jpi,jpj,jpk) :: a3d - REAL(wp), DIMENSION(jpi,jpj) :: z2d - z2d(:,:) = a3d(:,:,1). + REAL(wp), DIMENSION(jpi,jpj,jpk) :: a3d + REAL(wp), DIMENSION(A2D(nn_hls)) :: z2d + DO_2D(1,1,1,1) + z2d(ji,jj) = a3d(ji,jj,1) + END_2D
Code called once per timestep
Some code should only be called once per timestep (e.g. ocean.output write statements, initialisation steps, reading data from files) but will be called by each tile. IF statements are used to suppress these calls and generally take the form:
IF( ntile == 0 .OR. ntile == 1 ) THEN ! Do only on the first tile ! ... ENDIF IF( ntile == 0 .OR. ntile == nijtile ) THEN ! Do only on the last tile ! ... ENDIF
Sometimes dom_tile must also be called to temporarily disable the tiling:
IF( ntile == 0 .OR. ntile == 1 ) THEN ! Do only for the full domain itile = ntile IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile = 0 ) ! Use full domain CALL fld_read( kt, 1, sf_tsd ) !== read T & S data at kt time step ==! IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile = itile ) ! Revert to tile ENDIF
Special cases
diaptr module
At present, between dia_ptr and dia_ptr_hst the following operations are generally performed:
- Calculate the zonal integral
- Call mpp_sum
- Perform additional arithmetic or copy result onto 2D/3D arrays
- Call iom_put
The tiling is not easily implemented here due to steps 2 and 4. The diaptr module has therefore been largely restructured to accommodate the tiling, with the added bonus of significantly reducing the number of communications.
- dia_ptr has been split into dia_ptr_zint (steps 1-2) and dia_ptr_iom (steps 3-4)
- dia_ptr_zint is called for every tile, dia_ptr_iom is called after dia_ptr_zint has been run for all tiles
- The zonal integrals that were calculated in dia_ptr are now calculated by dia_ptr_zint and are stored in two new module arrays, pvtr_int and pzon_int, which are used by dia_ptr_iom
- Zonal integrals are calculated by calling ptr_sj (calculates the integral for the tile) and a new subroutine ptr_sum (accumulates the ptr_sj integrals and calls mpp_sum on the last tile only)
- The number of communications is reduced from 90 to 13 (data for all basins are exchanged at once)
- Code relating to mpp_sum has been removed from ptr_sj
- Pointers and their target arrays have been removed from ptr_sj
- dia_ptr_hst has been condensed by moving the loop over basins out of the IF statements
dia_ar5_hst subroutine
This code contains iom_put and lbc_lnk calls, which cannot be tiled. The code has therefore been rearranged to separate the transport calculations from the iom_put calls.
The transport calculations are performed for each tile and saved in two new module variables, hstr_adv and hstr_ldf. The iom_put calls are called only on the last tile, and the lbc_lnk calls have been removed as per #2367 (clean-up of communications).
prt_ctl subroutine
This code prints the sums of the input arrays over the processor domain, minus the sums calculated by the previous prt_ctl call. It is possible to implement tiling for this subroutine by aggregating the result over tiles, similar to the approach taken with dia_ptr. However, this was difficult to implement cleanly and bit-level differences in the global sum could not be avoided when using tiling.
I have taken a simpler approach where the sum is output for each tile, rather than the processor domain, if tiling is used. The prt_ctl utility can therefore be used to diagnose differences on a tile-by-tile basis per processor. However, the prt_ctl output cannot be compared between a simulation with tiling and one without.
timing module
The timing utility already works with the tiling; the only change is to ensure that the call iteration counter is incremented once for all tiles.
Trends diagnostics
Tiling has not yet been implemented in these diagnostics, meaning that tiling has to be disabled for the various trd_tra calls throughout the TRA modules.
Rather than add IF statements around each of these calls, I simply set ln_tile = .false. in trd_init if the trends diagnostics are used.
Temporary workarounds
This code has been marked with a ! TEMP: [tiling] comment.
iom_put calls
XIOS does not currently support tiling, so the data must be complete (i.e. all tiles must have finished) at the time of an iom_put call. The general workaround for this is to call iom_put only on the last tile. Additional workarounds are required for some local working arrays, which are not preserved between subsequent calls to the subroutine.
Some code was rearranged in order to calculate the diagnostic and call iom_put at the same time on the last tile (traldf_triad.F90). In other cases this was not possible, so the working arrays were declared with SAVE so that they could be processed by each tile (traadv.F90, tramle.F90). In all cases, it is necessary to declare the working arrays with the size of the full domain (DIMENSION(jpi,jpj)) instead of the tile (DIMENSION(A2D(nn_hls))).
XIOS support for tiling is expected to be fully implemented sometime in December-January. At this point all of the iom_put workarounds can be removed.
A preliminary development branch has been provided for testing by Olga Abramkina. It seems likely that the workarounds would need to stay in place for the merge, as I presume that we would not want to merge based on a development version of XIOS. However, the workarounds could certainly be removed post-merge before the 4.2 release.
lbc_lnk calls
Similar to iom_put calls, lbc_lnk can only be called once all tiles have finished and the data is complete. The general workaround for this is the same as for iom_put; call lbc_lnk only on the last tile.
This is done in a few cases (tranpc.F90, traqsr.F90), but often it was cleaner to simply disable the tiling due to the frequency of lbc_lnk calls. This has been implemented in tra_adv for all schemes except 2nd order centred advection (ln_traadv_cen = .true. with nn_cen_h = 2), in tra_ldf for all bi-laplacian schemes, and for calls to zps_hde. It was also necessary to split the tiling loop in step.F90 so that the first loop ended before tra_adv, in order to preserve results.
Most of the lbc_lnk calls are removed in the nn_hls = 2 case by #2367 (clean-up of communications), which will be merged with the tiling branch to form the basis of this year’s merge party. #2367 also removes several lbc_lnk calls that were used to set the halo points on arrays being passed to iom_put; these have already been removed from the tiling branch.
Tiling will only be used with nn_hls = 2, so most of the remaining workarounds for the lbc_lnk calls should be removed. However there are a number of changes in #2367 that prevent tiling, such as the addition of lbc_lnk calls in tra_adv and zps_hde, as well as expanding the bounds of some DO loops so that they work on the halo (see the “DO loop bounds” section above). Removal of the workarounds therefore depends on whether these issues can be resolved before the merge.
List of new variables and functions (excluding local)
- Global variables (par_oce.F90, dom_oce.F90)
- ntsi, ntsj- start index of tile
- ntei, ntej- end index of tile
- ntsi_a, ntsj_a- start indices of each tile
- ntei_a, ntej_a- end indices of each tile
- ntile- current tile number
- nijtile- number of tiles
- Module variables
- hstr_adv, hstr_ldf (diaar5.F90)- saved transports
- pvtr_int, pzon_int (diaptr.F90)- zonal integrals
- jp_msk, jp_vtr (diaptr.F90)- indices for pvtr_int & pzon_int
- nnpcc (tranpc.F90)- replaces local variable inpcc
- Namelist namtile (dom_oce.F90)
- ln_tile- logical control on use of tiling
- nn_ltile_i, nn_ltile_j- tile length
- Pre-processor macros (do_loop_substitute.h90)
- A1Di/A1Dj/A2D- substitutions for ALLOCATE or DIMENSION arguments
- A1Di_T/A1Dj_T/A2D_T- substitutions for ALLOCATE or DIMENSION arguments when the shape of the array is unknown
- JPK/JPTS/KJPT- placeholders for : to preserve readability
- Functions and subroutines
- dom_tile (domain.F90)- Calculate/set tiling variables
- is_tile (domutl.F90)- returns 0 if the array has the dimensions of the full domain, else 1
- ptr_sum (diaptr.F90)- sum ptr_sj zonal integrals over tiles and processors to get total
- The following subroutines have all been renamed to <SUBROUTINE>_t, where <SUBROUTINE> is now a wrapper function for <SUBROUTINE>_t:
- eos_insitu, eos_insitu_pot, eos_insitu_2d, rab_3d, rab_2d, bn2, eos_fzp_2d (eosbn2.F90)
- tra_ldf_iso (traldf_iso.F90)
- tra_ldf_lap (traldf_lap_blp.F90)
- tra_ldf_triad (traldf_triad.F90)
- prt_ctl (prtctl.F90)
- zps_hde, zps_hde_isf (zpshde.F90)
Documentation updates
...
Preview
...
Tests
...
Review
...
Attachments (5)
-
tra_ldf_iso trial.pdf
(142.0 KB) -
added by hadcv 5 years ago.
Trial implementation in tra_ldf_iso
- Tiling_call_notes_220420.pdf (20.0 KB) - added by hadcv 5 years ago.
-
Tiling_code_issues.pdf
(114.8 KB) -
added by hadcv 4 years ago.
Description of tiling issues
-
timing_results.pdf
(42.9 KB) -
added by hadcv 4 years ago.
Tiling performance results (Sept 2020)
-
Tiling_progress_summary_240920.pdf
(90.1 KB) -
added by hadcv 4 years ago.
September 2020 progress summary
Download all attachments as: .zip