24 | | Implement loop tiling over horizontal dimensions (i and j). |
25 | | |
26 | | === Implementation |
27 | | |
28 | | As of 24/09/20, most of the code called by the "active tracers" part of the step subroutine (between `trc_stp` and `tra_atf`) has been tiled. Solutions and workarounds for the issues encountered to date are described in [https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2020WP/HPC-02_Daley_Tiling/Tiling_code_issues.pdf this document]. A progress summary can be found [https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2020WP/HPC-02_Daley_Tiling/Tiling_progress_summary_240920.pdf here]. |
29 | | |
30 | | The tiling implementation has been tested using GYRE in benchmark mode with mono-processor and MPI configurations. The tests comprise 10 day simulations using different tile decompositions (including no tiling) and different science options particular to the tiled modules. A test passes if the tiling does not change results at the bit level (`run.stat`) or in the diagnostics. |
31 | | |
32 | | __Summary of method__ |
33 | | |
34 | | The full processor domain (dimensions `jpi` x `jpj`) is split into one or more tiles/subdomains. |
35 | | This is implemented by: |
36 | | |
37 | | '''1. Modifying the DO loop macros in `do_loop_substitute.h90` to use the tile bounds''' |
38 | | |
39 | | The tile domain is defined by a new set of domain indices (`ntsi`, `ntei`, `ntsj`, `ntej`), which represent the internal part of the domain: |
40 | | |
41 | | {{{ |
42 | | #!diff |
43 | | - #define DO_2D(B, T, L, R) DO jj = Njs0-(B), Nje0+(T) ; DO ji = Nis0-(L), Nie0+(R) |
44 | | + #define DO_2D(B, T, L, R) DO jj = ntsj-(B), ntej+(T) ; DO ji = ntsi-(L), ntei+(R) |
45 | | }}} |
46 | | |
47 | | A new subroutine `dom_tile` (in `domain.F90`) sets the values of these indices. |
48 | | |
49 | | During initialisation, this subroutine calculates and stores the indices in global arrays (`ntsi_a`, `ntei_a`, `ntsj_a`, `ntej_a`) with lengths equal to the number of tiles (`nijtile`) plus one. The zero index is used to store the indices for the full domain: |
50 | | |
51 | | {{{ |
52 | | #!fortran |
53 | | ntsi_a(0) = Nis0 |
54 | | ntsj_a(0) = Njs0 |
55 | | ntei_a(0) = Nie0 |
56 | | ntej_a(0) = Nje0 |
57 | | }}} |
58 | | |
59 | | `dom_tile` is called whenever the active tile needs to be set or if tiling needs to be disabled: |
60 | | |
61 | | {{{ |
62 | | #!fortran |
63 | | CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=3 ) ! Work on tile 3 |
64 | | CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Work on the full domain |
65 | | }}} |
66 | | |
67 | | '''2. Declaring SUBROUTINE-level arrays using the tile bounds''' |
68 | | |
69 | | A new set of substitution macros in `do_loop_substitute.h90`: |
70 | | |
71 | | {{{ |
72 | | #define ST_1Di(H) ntsi-H:ntei+H |
73 | | #define ST_1Dj(H) ntsj-H:ntej+H |
74 | | #define ST_2D(H) ST_1Di(H),ST_1Dj(H) |
75 | | }}} |
76 | | |
77 | | replaces references to the full domain in explicit shape and allocatable array declarations: |
78 | | |
79 | | {{{ |
80 | | #!diff |
81 | | - ALLOCATE(jpi,jpj ) DIMENSION(jpi,jpj ) |
82 | | + ALLOCATE(ST_2D(nn_hls)) DIMENSION(ST_2D(nn_hls)) |
83 | | }}} |
84 | | |
85 | | These arrays then have the same dimensions as the tile if tiling is used, otherwise they will have the same dimensions as the full domain as before. Furthermore, the tile-sized arrays are declared with lower and upper bounds corresponding to the position of the tile in the full domain. Horizontal indices, for example in DO loops, will therefore apply to both tile- and full-sized arrays: |
86 | | |
87 | | {{{ |
88 | | #!fortran |
89 | | ! ntsi = 3, ntsj = 7, ntei = 5, ntej = 9 |
90 | | REAL(wp), DIMENSION(ntsi:ntei,ntsj:ntej) :: z2d |
91 | | REAL(wp), DIMENSION(jpi,jpj) :: a2d |
92 | | |
93 | | DO_2D(1,1,1,1) |
94 | | z2d(ji,jj) = a2d(ji,jj) |
95 | | END_2D |
96 | | }}} |
97 | | |
98 | | This substitution is made for local working arrays where possible to minimise memory consumption when using tiling. |
99 | | No further changes are generally required, except in specific cases described in [https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/2020WP/HPC-02_Daley_Tiling/Tiling_code_issues.pdf this document] and other common cases described in steps 5 & 6 below. |
| 24 | Implement tiling over horizontal dimensions (i and j). |
| 25 | |
| 26 | === Branch |
| 27 | |
| 28 | [https://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/2020/dev_r13383_HPC-02_Daley_Tiling dev_r13383_HPC-02_Daley_Tiling] |
| 29 | |
| 30 | * [https://forge.ipsl.jussieu.fr/nemo/changeset?sfp_email=&sfph_mail=&reponame=&new=13745%40NEMO%2Fbranches%2F2020%2Fdev_r13383_HPC-02_Daley_Tiling%2Fsrc&old=13688%40NEMO%2Ftrunk%2Fsrc&sfp_email=&sfph_mail= Changes in src] |
| 31 | * [https://forge.ipsl.jussieu.fr/nemo/changeset?sfp_email=&sfph_mail=&reponame=&new=13745%40NEMO%2Fbranches%2F2020%2Fdev_r13383_HPC-02_Daley_Tiling%2Fcfgs&old=13688%40NEMO%2Ftrunk%2Fcfgs&sfp_email=&sfph_mail= Changes in cfgs] |
| 32 | * [https://forge.ipsl.jussieu.fr/nemo/changeset?sfp_email=&sfph_mail=&reponame=&new=13745%40NEMO%2Fbranches%2F2020%2Fdev_r13383_HPC-02_Daley_Tiling%2Ftests&old=13688%40NEMO%2Ftrunk%2Ftests&sfp_email=&sfph_mail= Changes in tests] |
| 33 | |
| 34 | === Summary of the tiling method |
| 35 | |
| 36 | The processor domain (dimensions `jpi` x `jpj`) is split into one or more tiles/subdomains, which are iterated over asynchronously within the timestepping loop. |
| 37 | |
| 38 | These tile domains are defined by a new set of indices representing the internal part of the domain (`ntsi`/`ntei`/`ntsj`/`ntej`). |
| 39 | These indices replace those of the processor domain (`Nis0`/`Nie0`/`Njs0`/`Nje0`) in DO loops, array shape declarations, and other appropriate places. |
| 40 | |
| 41 | These changes are implemented via new and existing CPP macros, allowing the tiling to be implemented with relatively few changes at lower levels. |
| 42 | A number of temporary workarounds are required to preserve results, but will be removed before the 2020 merge party or as part of the 2021 work plan. |
| 43 | |
| 44 | === Details of the implementation |
| 45 | |
| 46 | ==== Namelist |
| 47 | |
| 48 | In `dom_nam` (`domain.F90`) a new namelist (`namtile`) is read and control prints written to ocean.output: |
| 49 | |
| 50 | {{{ |
| 51 | !----------------------------------------------------------------------- |
| 52 | &namtile ! parameters of the tiling |
| 53 | !----------------------------------------------------------------------- |
| 54 | ln_tile = .false. ! Use tiling (T) or not (F) |
| 55 | nn_ltile_i = 10 ! Length of tiles in i |
| 56 | nn_ltile_j = 10 ! Length of tiles in j |
| 57 | / |
| 58 | }}} |
| 59 | |
| 60 | These variables are declared in `dom_oce.F90`. |
| 61 | |
| 62 | ==== Setting the tile domain |
| 63 | |
| 64 | A new subroutine `dom_tile` (`domain.F90`) sets the values of the tile indices (`ntsi`/`ntei`/`ntsj`/`ntej`) and the active tile number (`ntile`). |
| 65 | |
| 66 | During initialisation, this subroutine calculates the number of tiles (`nijtile`) and the tile indices, which are stored in public arrays (`ntsi_a`/`ntei_a`/`ntsj_a`/`ntej_a`) with lengths equal to `nijtile + 1`. |
| 67 | When `dom_tile` is otherwise called, these arrays are used to set the tiling indices for the current tile (e.g. `ntsi = ntsi_a(ntile)`). |
| 68 | |
| 69 | `ntile = 0` indicates that tiling is disabled, i.e. the full domain is to be used. |
| 70 | |
| 71 | `dom_tile` is called whenever the active tile needs to be set, if tiling needs to be disabled and for initialisation (in `dom_init`): |
| 72 | |
| 73 | {{{ |
| 74 | #!fortran |
| 75 | CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=3 ) ! Work on tile 3 |
| 76 | CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Work on the full domain |
| 77 | CALL dom_tile( ntsi, ntsj, ntei, ntej ) ! Initialisation (implies ktile=0) |
| 78 | }}} |
| 79 | |
| 80 | Variables `ntsi`/`ntei`/`ntsj`/`ntej`/`nijtile` are declared in `par_oce.F90`. |
| 81 | Variables `ntsi_a`/`ntei_a`/`ntsj_a`/`ntej_a` are declared in `dom_oce.F90` |
| 82 | |
| 83 | ==== Changes to CPP macros |
| 84 | |
| 85 | In `do_loop_substitute.h90`, the DO loop macros are modified to instead use the tiling indices: |
| 86 | |
| 87 | {{{ |
| 88 | #!diff |
| 89 | - #define DO_2D(B, T, L, R) DO jj = Njs0-(B), Nje0+(T) ; DO ji = Nis0-(L), Nie0+(R) |
| 90 | + #define DO_2D(B, T, L, R) DO jj = ntsj-(B), ntej+(T) ; DO ji = ntsi-(L), ntei+(R) |
| 91 | }}} |
| 92 | |
| 93 | A number of new macros have been added that replace `jpi`/`jpj` in DIMENSION and ALLOCATE statements (see “Local working array declarations” section below): |
| 94 | |
| 95 | {{{ |
| 96 | #define A1Di(H) ntsi-H:ntei+H # H is equivalent to B/T/L/R in DO loop macros |
| 97 | #define A1Dj(H) ntsj-H:ntej+H |
| 98 | #define A2D(H) A1Di(H),A1Dj(H) |
| 99 | |
| 100 | #define A1Di_T(T) (ntsi-nn_hls-1)*T+1: # T is 1 (= ntsi:) or 0 (= 1:) |
| 101 | #define A1Dj_T(T) (ntsj-nn_hls-1)*T+1: |
| 102 | #define A2D_T(T) A1Di_T(T),A1Dj_T(T) |
| 103 | |
| 104 | #define JPK : |
| 105 | #define JPTS : |
| 106 | #define KJPT : |
| 107 | }}} |
| 108 | |
| 109 | The purpose of the `A1Di`/`A1Dj`/`A2D` macros is to allow local working arrays to be declared with the size of the tile (or the full domain, if tiling is not used), minimising memory use. |
| 110 | Furthermore, the tile-sized arrays will be declared with lower and upper bounds corresponding to the position of the tile in the full domain. |
| 111 | Horizontal indices, for example in DO loops, will therefore apply to both tile- and full-sized arrays: |
| 112 | |
| 113 | {{{ |
| 114 | #!fortran |
| 115 | ! ntsi = 3, ntsj = 7, ntei = 5, ntej = 9 |
| 116 | REAL(wp), DIMENSION(ntsi:ntei,ntsj:ntej) :: z2d |
| 117 | REAL(wp), DIMENSION(jpi,jpj) :: a2d |
| 118 | |
| 119 | DO_2D(1,1,1,1) |
| 120 | z2d(ji,jj) = a2d(ji,jj) |
| 121 | END_2D |
| 122 | }}} |
| 123 | |
| 124 | The `A1Di_T`/`A1Dj_T`/`A2D_T` macros are assumed-shape versions of the `A1Di`/`A1Dj`/`A2D` macros, used in dummy array argument declarations where the shape of the actual array argument is inconsistent between calls to the subroutine (see “Local working array declarations” section below). |
| 125 | |
| 126 | The `JPK`/`JPTS`/`KJPT` macros are used where explicit-shape declarations have been replaced by assumed-shape declarations. |
| 127 | Their only purpose is to preserve readability. |
| 128 | |
| 129 | ==== Changes at the timestepping level |
| 130 | |
| 131 | The looping over tiles occurs in the `stp` subroutine. |
| 132 | The domain indices for the current tile (`ntile /= 0`) are set at the start of each iteration. |
| 133 | After exiting the loop (and before, during initialisation) the tiling is disabled (`ntile == 0`): |
| 134 | |
| 135 | {{{ |
| 136 | #!fortran |
| 137 | ! Loop over tile domains |
| 138 | DO jtile = 1, nijtile |
| 139 | IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=jtile ) |
101 | | '''3. Looping over tiles at the timestepping level''' |
102 | | |
103 | | A loop over tiles has been added to `stp`. The domain indices for the current tile (`ntile /= 0`) are set at the start of each iteration. After exiting the loop (and before, during initialisation) the tiling is suppressed (`ntile == 0`): |
104 | | |
105 | | {{{ |
106 | | #!fortran |
107 | | ! Loop over tile domains |
108 | | DO jtile = 1, nijtile |
109 | | IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=jtile ) |
110 | | |
111 | | CALL tra_ldf( kstp, Nbb, Nnn, ts, Nrhs ) ! lateral mixing |
112 | | END DO |
113 | | |
114 | | IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Revert to full domain |
115 | | }}} |
116 | | |
117 | | DO loops within the tiling loop therefore work on the current tile, while those outside the tiling loop work on the full domain. |
118 | | |
119 | | '''4. A new namelist (`namtile`)''' |
120 | | |
121 | | {{{ |
122 | | !----------------------------------------------------------------------- |
123 | | &namtile ! parameters of the tiling |
124 | | !----------------------------------------------------------------------- |
125 | | ln_tile = .false. ! Use tiling (T) or not (F) |
126 | | nn_ltile_i = 10 ! Length of tiles in i |
127 | | nn_ltile_j = 10 ! Length of tiles in j |
128 | | / |
129 | | }}} |
130 | | |
131 | | The number of tiles is calculated from the tile lengths, `nn_ltile_i` and `nn_ltile_j`, with respect to the full domain. |
132 | | |
133 | | '''5. Replacing `:` subscripts with a DO loop macro where appropriate''' |
134 | | |
135 | | This is only necessary when step 2 would introduce conformance issues: |
136 | | |
137 | | {{{ |
138 | | #!diff |
139 | | - REAL(wp), DIMENSION(jpi,jpj,jpk) :: a3d |
140 | | - REAL(wp), DIMENSION(jpi,jpj) :: z2d |
141 | | - z2d(:,:) = a3d(:,:,1). |
142 | | + REAL(wp), DIMENSION(jpi,jpj,jpk) :: a3d |
143 | | + REAL(wp), DIMENSION(ST_2D(nn_hls)) :: z2d |
144 | | + DO_2D(1,1,1,1) |
145 | | + z2d(ji,jj) = a3d(ji,jj,1) |
146 | | + END_2D |
147 | | }}} |
148 | | |
149 | | '''6. Suppressing code that should not be called more than once per timestep''' |
150 | | |
151 | | Examples include ocean.output write statements and initialisation steps outside of an "_ini" routine. |
152 | | |
153 | | __Branch__ |
154 | | |
155 | | [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/2020/dev_r13383_HPC-02_Daley_Tiling] |
156 | | |
157 | | __New subroutines__ |
158 | | |
159 | | * `OCE/DOM/domain/dom_tile`- Calculate/set tiling variables (domain indices, number of tiles) |
160 | | |
161 | | __Modified modules__ |
162 | | |
163 | | * `cfgs/SHARED/namelist_ref`- Add `namtile` namelist |
164 | | * `OCE/DOM/dom_oce`- Declare tiling namelist and other tiling variables |
165 | | * `OCE/DOM/domain`- Read `namtile` namelist (`dom_nam`), calculate tiling variables and do control print (`dom_tile`) |
166 | | * `OCE/DOM/domutl`- `is_tile` functions |
167 | | * `OCE/do_loop_substitute`- Modify DO loop macro to use domain indices, add CPP macros |
168 | | * `OCE/par_oce`- Declare tiling variables |
169 | | * `OCE/step`- Add tiling loop |
170 | | * `OCE/step_oce`- Add USE statement for `dom_tile` in `step` |
171 | | * Various others.. |
172 | | |
173 | | __New variables (excluding local)__ |
174 | | |
175 | | * Global variables |
176 | | * `ntsi`, `ntsj`- start index of tile |
177 | | * `ntei`, `ntej`- end index of tile |
178 | | * `ntsi_a`, `ntsj_a`- start indices of each tile |
179 | | * `ntei_a`, `ntej_a`- end indices of each tile |
180 | | * `ntile`- current tile number |
181 | | * `nijtile`- number of tiles |
182 | | * Namelist (`namtile`) |
183 | | * `ln_tile`- logical control on use of tiling |
184 | | * `nn_ltile_i`, `nn_ltile_j`- tile length |
185 | | * Pre-processor macros |
186 | | * `ST_*D`- substitutions for ALLOCATE or DIMENSION arguments |
187 | | * `ST_*DT`- substitutions for ALLOCATE or DIMENSION arguments when the shape of the array is unknown |
188 | | * Functions |
189 | | * `is_tile`- Returns 0 if the array has the dimensions of the full domain, else 1 |
| 141 | ! Tiled region of code |
| 142 | END DO |
| 143 | |
| 144 | IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=0 ) ! Revert to full domain |
| 145 | }}} |
| 146 | |
| 147 | The tiled code currently encompasses the active tracers (TRA) region of `stp`. |
| 148 | The loop over tiles must currently be broken into two separate loops to preserve results. |
| 149 | This is due to the temporary workaround implemented in `tra_adv`, which disables tiling for certain options and therefore can change the order in which the tracer trends are updated. |
| 150 | |
| 151 | ==== General changes at the module and subroutine levels |
| 152 | |
| 153 | __DO loop bounds__ |
| 154 | |
| 155 | Each tile has an internal area and overlapping halo, but unlike the MPP domain the halo points are not set by `lbc_lnk`. |
| 156 | The internal part of a tile may therefore be partly overwritten by the halo of an adjacent tile, which will change results. |
| 157 | In these cases, DO loops must work on the internal part of the tile only. |
| 158 | |
| 159 | This is generally only an issue for persistent variables (i.e. declared at the module level, or with `SAVE`), e.g. when zeroing an array: |
| 160 | |
| 161 | {{{ |
| 162 | #!diff |
| 163 | - DO_3D_11_11(1, jpk) |
| 164 | - akz(ji,jj,jk) = 0._wp |
| 165 | - END_3D |
| 166 | + DO_3D_00_00(1, jpk) |
| 167 | + akz(ji,jj,jk) = 0._wp |
| 168 | + END_3D |
| 169 | }}} |
| 170 | |
| 171 | or where an array appears on both sides of the assignment: |
| 172 | |
| 173 | {{{ |
| 174 | #!diff |
| 175 | - DO_2D( 0, 1, 0, 0 ) |
| 176 | + DO_2D( 0, 0, 0, 0 ) |
| 177 | pts(ji,jj,1,jn,Krhs) = pts(ji,jj,1,jn,Krhs) + zfact * ( sbc_tsc_b(ji,jj,jn) + sbc_tsc(ji,jj,jn) ) / e3t(ji,jj,1,Kmm) |
| 178 | }}} |
| 179 | |
| 180 | In some cases (e.g. `tra_bbl_adv` in `trabbl.F90`) DO loops must also work on the outer halo of a processor domain, which requires a slightly different approach: |
| 181 | |
| 182 | {{{ |
| 183 | #!diff |
| 184 | - DO_2D( 1, 0, 1, 0 ) |
| 185 | + IF( ntsi == Nis0 ) THEN ; isi = 1 ; ELSE ; isi = 0 ; ENDIF ! Avoid double-counting when using tiling |
| 186 | + IF( ntsj == Njs0 ) THEN ; isj = 1 ; ELSE ; isj = 0 ; ENDIF |
| 187 | + DO_2D( isi, 0, isj, 0 ) |
| 188 | }}} |
| 189 | |
| 190 | __Local working array declarations__ |
| 191 | |
| 192 | The new CPP macros in `do_loop_substitute.h90` replace references to the full domain in explicit shape declarations for local working arrays: |
| 193 | |
| 194 | {{{ |
| 195 | #!diff |
| 196 | - ALLOCATE(jpi,jpj ) DIMENSION(jpi,jpj ) |
| 197 | + ALLOCATE(A2D(nn_hls) ) DIMENSION(A2D(nn_hls) ) |
| 198 | }}} |
| 199 | |
| 200 | This allows the arrays to be declared with the size of the tile (or the full domain, if tiling is not used), minimising memory use. |
| 201 | |
| 202 | This approach does not work for subroutines that are called with actual array arguments of varying shape; an assumed-shape declaration must be used instead. |
| 203 | It is necessary to use the form specifying lower bounds (i.e. `DIMENSION(start_i:, start_j:)`), as this information is required to correctly index the array when tiling is used. |
| 204 | However, these bounds must be passed as additional arguments to the subroutine. |
| 205 | |
| 206 | To avoid widespread changes, the subroutine is replaced by a wrapper subroutine that calculates the bounds and passes them to the original subroutine. |
| 207 | For example: |
| 208 | |
| 209 | {{{ |
| 210 | #!diff |
| 211 | - SUBROUTINE eos_insitu( pts, prd, pdep ) |
| 212 | - REAL(wp), DIMENSION(jpi,jpj,jpk,jpts), INTENT(in ) :: pts |
| 213 | - REAL(wp), DIMENSION(jpi,jpj,jpk ), INTENT( out) :: prd |
| 214 | - REAL(wp), DIMENSION(jpi,jpj,jpk ), INTENT(in ) :: pdep |
| 215 | + SUBROUTINE eos_insitu( pts, prd, pdep ) |
| 216 | + REAL(wp), DIMENSION(:,:,:,:), INTENT(in ) :: pts |
| 217 | + REAL(wp), DIMENSION(:,:,:) , INTENT( out) :: prd |
| 218 | + REAL(wp), DIMENSION(:,:,:) , INTENT(in ) :: pdep |
| 219 | + |
| 220 | + CALL eos_insitu_t( pts, is_tile(pts), prd, is_tile(prd), pdep, is_tile(pdep) ) |
| 221 | + END SUBROUTINE eos_insitu |
| 222 | |
| 223 | + SUBROUTINE eos_insitu_t( pts, ktts, prd, ktrd, pdep, ktdep ) |
| 224 | + INTEGER, INTENT(in ) :: ktts, ktrd, ktdep |
| 225 | + REAL(wp), DIMENSION(A2D_T(ktts) ,JPK,JPTS), INTENT(in ) :: pts |
| 226 | + REAL(wp), DIMENSION(A2D_T(ktrd) ,JPK ), INTENT( out) :: prd |
| 227 | + REAL(wp), DIMENSION(A2D_T(ktdep),JPK ), INTENT(in ) :: pdep |
| 228 | }}} |
| 229 | |
| 230 | Here, `is_tile` is an interface of functions returning 1 or 0 depending on the size of the array, e.g.: |
| 231 | |
| 232 | {{{ |
| 233 | #!fortran |
| 234 | FUNCTION is_tile_2d( pt ) |
| 235 | REAL(wp), DIMENSION(:,:), INTENT(in) :: pt |
| 236 | INTEGER :: is_tile_2d |
| 237 | IF( ln_tile .AND. SIZE(pt, 1) < jpi ) THEN |
| 238 | is_tile_2d = 1 |
| 239 | ELSE |
| 240 | is_tile_2d = 0 |
| 241 | ENDIF |
| 242 | END FUNCTION is_tile_2d |
| 243 | }}} |
| 244 | |
| 245 | and `A2D_T` is a version of the `A2D` CPP macro that returns `1:,1:` if `is_tile(array) = 0` (`array` is the size of the full domain) or `ntsi:,ntsj:` if `is_tile(array) = 1` (`array` is the size of the tile). |
| 246 | The `JPK`/`JPTS` macros each return `:` and are used to preserve readability. |
| 247 | |
| 248 | These wrappers are added in the following locations: |
| 249 | * `IOM/prtctl.F90` (`prt_ctl`) |
| 250 | * `TRA/eosbn2.F90` (various subroutines) |
| 251 | * `TRA/traldf_iso.F90` (`tra_ldf_iso`) |
| 252 | * `TRA/traldf_lap_blp.F90` (`tra_ldf_lap`) |
| 253 | * `TRA/traldf_triad.F90` (`tra_ldf_triad`) |
| 254 | * `TRA/zpshde.F90` (`zps_hde`, `zps_hde_isf`) |
| 255 | |
| 256 | __`:` array subscripts__ |
| 257 | |
| 258 | The above array declaration changes may introduce conformance issues when `:` subscripts are used for indexing, or if no indexing is used at all. |
| 259 | These are resolved by instead using an equivalent DO loop: |
| 260 | |
| 261 | {{{ |
| 262 | #!diff |
| 263 | - REAL(wp), DIMENSION(jpi,jpj,jpk) :: a3d |
| 264 | - REAL(wp), DIMENSION(jpi,jpj) :: z2d |
| 265 | - z2d(:,:) = a3d(:,:,1). |
| 266 | + REAL(wp), DIMENSION(jpi,jpj,jpk) :: a3d |
| 267 | + REAL(wp), DIMENSION(A2D(nn_hls)) :: z2d |
| 268 | + DO_2D(1,1,1,1) |
| 269 | + z2d(ji,jj) = a3d(ji,jj,1) |
| 270 | + END_2D |
| 271 | }}} |
| 272 | |
| 273 | __Code called once per timestep__ |
| 274 | |
| 275 | Some code should only be called once per timestep (e.g. ocean.output write statements, initialisation steps, reading data from files) but will be called by each tile. |
| 276 | IF statements are used to suppress these calls and generally take the form: |
| 277 | |
| 278 | {{{ |
| 279 | #!fortran |
| 280 | IF( ntile == 0 .OR. ntile == 1 ) THEN ! Do only on the first tile |
| 281 | ! ... |
| 282 | ENDIF |
| 283 | |
| 284 | IF( ntile == 0 .OR. ntile == nijtile ) THEN ! Do only on the last tile |
| 285 | ! ... |
| 286 | ENDIF |
| 287 | }}} |
| 288 | |
| 289 | Sometimes `dom_tile` must also be called to temporarily disable the tiling: |
| 290 | |
| 291 | {{{ |
| 292 | #!fortran |
| 293 | IF( ntile == 0 .OR. ntile == 1 ) THEN ! Do only for the full domain |
| 294 | itile = ntile |
| 295 | IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile = 0 ) ! Use full domain |
| 296 | CALL fld_read( kt, 1, sf_tsd ) !== read T & S data at kt time step ==! |
| 297 | IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile = itile ) ! Revert to tile |
| 298 | ENDIF |
| 299 | }}} |
| 300 | |
| 301 | ==== Special cases |
| 302 | |
| 303 | __`diaptr` module__ |
| 304 | |
| 305 | At present, between `dia_ptr` and `dia_ptr_hst` the following operations are generally performed: |
| 306 | |
| 307 | 1. Calculate the zonal integral |
| 308 | 2. Call `mpp_sum` |
| 309 | 3. Perform additional arithmetic or copy result onto 2D/3D arrays |
| 310 | 4. Call `iom_put` |
| 311 | |
| 312 | The tiling is not easily implemented here due to steps 2 and 4. |
| 313 | The `diaptr` module has therefore been largely restructured to accommodate the tiling, with the added bonus of significantly reducing the number of communications. |
| 314 | |
| 315 | * `dia_ptr` has been split into `dia_ptr_zint` (steps 1-2) and `dia_ptr_iom` (steps 3-4) |
| 316 | * `dia_ptr_zint` is called for every tile, `dia_ptr_iom` is called after `dia_ptr_zint` has been run for all tiles |
| 317 | * The zonal integrals that were calculated in `dia_ptr` are now calculated by `dia_ptr_zint` and are stored in two new module arrays, `pvtr_int` and `pzon_int`, which are used by `dia_ptr_iom` |
| 318 | * Zonal integrals are calculated by calling `ptr_sj` (calculates the integral for the tile) and a new subroutine `ptr_sum` (accumulates the `ptr_sj` integrals and calls `mpp_sum` on the last tile only) |
| 319 | * The number of communications is reduced from 90 to 13 (data for all basins are exchanged at once) |
| 320 | * Code relating to `mpp_sum` has been removed from `ptr_sj` |
| 321 | * Pointers and their target arrays have been removed from `ptr_sj` |
| 322 | * `dia_ptr_hst` has been condensed by moving the loop over basins out of the IF statements |
| 323 | |
| 324 | __`dia_ar5_hst` subroutine__ |
| 325 | |
| 326 | This code contains `iom_put` and `lbc_lnk` calls, which cannot be tiled. |
| 327 | The code has therefore been rearranged to separate the transport calculations from the `iom_put` calls. |
| 328 | |
| 329 | The transport calculations are performed for each tile and saved in two new module variables, `hstr_adv` and `hstr_ldf`. |
| 330 | The `iom_put` calls are called only on the last tile, and the `lbc_lnk` calls have been removed as per #2367 (clean-up of communications). |
| 331 | |
| 332 | __`prt_ctl` subroutine__ |
| 333 | |
| 334 | This code prints the sums of the input arrays over the processor domain, minus the sums calculated by the previous `prt_ctl` call. |
| 335 | It is possible to implement tiling for this subroutine by aggregating the result over tiles, similar to the approach taken with `dia_ptr`. |
| 336 | However, this was difficult to implement cleanly and bit-level differences in the global sum could not be avoided when using tiling. |
| 337 | |
| 338 | I have taken a simpler approach where the sum is output for each tile, rather than the processor domain, if tiling is used. |
| 339 | The `prt_ctl` utility can therefore be used to diagnose differences on a tile-by-tile basis per processor. |
| 340 | However, the `prt_ctl` output cannot be compared between a simulation with tiling and one without. |
| 341 | |
| 342 | __`timing` module__ |
| 343 | |
| 344 | The timing utility already works with the tiling; the only change is to ensure that the call iteration counter is incremented once for all tiles. |
| 345 | |
| 346 | __Trends diagnostics__ |
| 347 | |
| 348 | Tiling has not yet been implemented in these diagnostics, meaning that tiling has to be disabled for the various `trd_tra` calls throughout the TRA modules. |
| 349 | |
| 350 | Rather than add IF statements around each of these calls, I simply set `ln_tile = .false.` in `trd_init` if the trends diagnostics are used. |
| 351 | |
| 352 | ==== Temporary workarounds |
| 353 | |
| 354 | This code has been marked with a `! TEMP: [tiling]` comment. |
| 355 | |
| 356 | __`iom_put` calls__ |
| 357 | |
| 358 | XIOS does not currently support tiling, so the data must be complete (i.e. all tiles must have finished) at the time of an `iom_put` call. |
| 359 | The general workaround for this is to call `iom_put` only on the last tile. |
| 360 | Additional workarounds are required for some local working arrays, which are not preserved between subsequent calls to the subroutine. |
| 361 | |
| 362 | Some code was rearranged in order to calculate the diagnostic and call `iom_put` at the same time on the last tile (`traldf_triad.F90`). |
| 363 | In other cases this was not possible, so the working arrays were declared with `SAVE` so that they could be processed by each tile (`traadv.F90`, `tramle.F90`). |
| 364 | In all cases, it is necessary to declare the working arrays with the size of the full domain (`DIMENSION(jpi,jpj)`) instead of the tile (`DIMENSION(A2D(nn_hls))`). |
| 365 | |
| 366 | XIOS support for tiling is expected to be fully implemented sometime in December-January. |
| 367 | At this point all of the `iom_put` workarounds can be removed. |
| 368 | |
| 369 | A preliminary development branch has been provided for testing by Olga Abramkina. |
| 370 | It seems likely that the workarounds would need to stay in place for the merge, as I presume that we would not want to merge based on a development version of XIOS. |
| 371 | However, the workarounds could certainly be removed post-merge before the 4.2 release. |
| 372 | |
| 373 | __`lbc_lnk` calls__ |
| 374 | |
| 375 | Similar to `iom_put` calls, `lbc_lnk` can only be called once all tiles have finished and the data is complete. |
| 376 | The general workaround for this is the same as for `iom_put`; call `lbc_lnk` only on the last tile. |
| 377 | |
| 378 | This is done in a few cases (`tranpc.F90`, `traqsr.F90`), but often it was cleaner to simply disable the tiling due to the frequency of `lbc_lnk` calls. |
| 379 | This has been implemented in `tra_adv` for all schemes except 2nd order centred advection (`ln_traadv_cen = .true.` with `nn_cen_h = 2`), in `tra_ldf` for all bi-laplacian schemes, and for calls to `zps_hde`. |
| 380 | It was also necessary to split the tiling loop in `step.F90` so that the first loop ended before `tra_adv`, in order to preserve results. |
| 381 | |
| 382 | Most of the `lbc_lnk` calls are removed in the `nn_hls = 2` case by #2367 (clean-up of communications), which will be merged with the tiling branch to form the basis of this year’s merge party. |
| 383 | #2367 also removes several `lbc_lnk` calls that were used to set the halo points on arrays being passed to `iom_put`; these have already been removed from the tiling branch. |
| 384 | |
| 385 | Tiling will only be used with `nn_hls = 2`, so most of the remaining workarounds for the `lbc_lnk` calls should be removed. |
| 386 | However there are a number of changes in #2367 that prevent tiling, such as the addition of `lbc_lnk` calls in `tra_adv` and `zps_hde`, as well as expanding the bounds of some DO loops so that they work on the halo (see the “DO loop bounds” section above). |
| 387 | Removal of the workarounds therefore depends on whether these issues can be resolved before the merge. |
| 388 | |
| 389 | ==== List of new variables and functions (excluding local) |
| 390 | |
| 391 | * Global variables (`par_oce.F90`, `dom_oce.F90`) |
| 392 | * `ntsi`, `ntsj`- start index of tile |
| 393 | * `ntei`, `ntej`- end index of tile |
| 394 | * `ntsi_a`, `ntsj_a`- start indices of each tile |
| 395 | * `ntei_a`, `ntej_a`- end indices of each tile |
| 396 | * `ntile`- current tile number |
| 397 | * `nijtile`- number of tiles |
| 398 | * Module variables |
| 399 | * `hstr_adv`, `hstr_ldf` (`diaar5.F90`)- saved transports |
| 400 | * `pvtr_int`, `pzon_int` (`diaptr.F90`)- zonal integrals |
| 401 | * `jp_msk`, `jp_vtr` (`diaptr.F90`)- indices for `pvtr_int` & `pzon_int` |
| 402 | * `nnpcc` (`tranpc.F90`)- replaces local variable `inpcc` |
| 403 | * Namelist `namtile` (`dom_oce.F90`) |
| 404 | * `ln_tile`- logical control on use of tiling |
| 405 | * `nn_ltile_i`, `nn_ltile_j`- tile length |
| 406 | * Pre-processor macros (`do_loop_substitute.h90`) |
| 407 | * `A1Di`/`A1Dj`/`A2D`- substitutions for ALLOCATE or DIMENSION arguments |
| 408 | * `A1Di_T`/`A1Dj_T`/`A2D_T`- substitutions for ALLOCATE or DIMENSION arguments when the shape of the array is unknown |
| 409 | * `JPK`/`JPTS`/`KJPT`- placeholders for `:` to preserve readability |
| 410 | * Functions and subroutines |
| 411 | * `dom_tile` (`domain.F90`)- Calculate/set tiling variables |
| 412 | * `is_tile` (`domutl.F90`)- returns 0 if the array has the dimensions of the full domain, else 1 |
| 413 | * `ptr_sum` (`diaptr.F90`)- sum `ptr_sj` zonal integrals over tiles and processors to get total |
| 414 | * The following subroutines have all been renamed to `<SUBROUTINE>_t`, where `<SUBROUTINE>` is now a wrapper function for `<SUBROUTINE>_t`: |
| 415 | * `eos_insitu`, `eos_insitu_pot`, `eos_insitu_2d`, `rab_3d`, `rab_2d`, `bn2`, `eos_fzp_2d` (`eosbn2.F90`) |
| 416 | * `tra_ldf_iso` (`traldf_iso.F90`) |
| 417 | * `tra_ldf_lap` (`traldf_lap_blp.F90`) |
| 418 | * `tra_ldf_triad` (`traldf_triad.F90`) |
| 419 | * `prt_ctl` (`prtctl.F90`) |
| 420 | * `zps_hde`, `zps_hde_isf` (`zpshde.F90`) |