= Name and subject of the action Last edition: '''[[Wikinfo(changed_ts)]]''' by '''[[Wikinfo(changed_by)]]''' The PI is responsible to closely follow the progress of the action, and especially to contact NEMO project manager if the delay on preview (or review) are longer than the 2 weeks expected. [[PageOutline(2, , inline)]] == Summary ||=Action || Implement 2D tiling (continuation of [wiki:2020WP/HPC-02_Daley_Tiling 2020 work]) || ||=PI(S) || Daley Calvert || ||=Digest || Implement 2D tiling in `DYN` and `ZDF` code || ||=Dependencies || Cleanup of `lbc_lnk` calls (wiki:2021WP/HPC-03_Mele_Comm_Cleanup) incl. extra halo science-neutral changes ([https://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/2021/ticket2607_r14608_halo1_halo2_compatibility ticket2607_r14608_halo1_halo2_compatibility branch]) || ||=Branch || source:/NEMO/branches/2021/dev_r14273_HPC-02_Daley_Tiling || ||=Previewer(s) || Italo Epicoco || ||=Reviewer(s) || Italo Epicoco || ||=Ticket || #2600 || === Description Implement tiling for code appearing in `stp` and `stpmlf` (DYN, ZDF modules), clean up existing tiling code. === Implementation ==== Branch [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/2021/dev_r14273_HPC-02_Daley_Tiling?rev=14819 dev_r14273_HPC-02_Daley_Tiling@14819] [https://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/2021/ticket2607_r14608_halo1_halo2_compatibility ticket2607_r14608_halo1_halo2_compatibility] has been merged into the trunk at r14820. * [http://forge.ipsl.jussieu.fr/nemo/changeset?sfp_email=&sfph_mail=&reponame=&new=14819%40NEMO%2Fbranches%2F2021%2Fdev_r14273_HPC-02_Daley_Tiling/src/OCE&old=14820%40NEMO%2Ftrunk/src/OCE Difference vs trunk@14820] [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/2021/dev_r14393_HPC-03_Mele_Comm_Cleanup?rev=14776 dev_r14393_HPC-03_Mele_Comm_Cleanup@14776] has been merged into the branch. * [http://forge.ipsl.jussieu.fr/nemo/changeset?sfp_email=&sfph_mail=&reponame=&new=14805%40NEMO%2Fbranches%2F2021%2Fdev_r14273_HPC-02_Daley_Tiling/src/OCE&old=14776%40NEMO%2Fbranches%2F2021%2Fdev_r14393_HPC-03_Mele_Comm_Cleanup%2Fsrc%2FOCE Difference@14805 vs dev_r14393_HPC-03_Mele_Comm_Cleanup] The following bug fixes were applied to the trunk post-merge: * [14840]- Add `ln_tile` to ORCA2_ICE_PISCES/namelist_cfg * [14845]- Fix diagnostics preventing ORCA2_ICE_PISCES running with `nn_hls = 2` and tiling * [14857]- Fixes in MY_SRC for `nn_hls = 2`/tiling and traadv_fct.F90 for `nn_hls = 1` * [14882]- Fix diagnostics preventing ORCA2_ICE_PISCES running with `nn_hls = 2` and tiling; r14845 missing pieces * [14903]- Fix bug with A1Di/A1Dj/A2D macros, update standard tiling namelists ==== Changes to tiling framework __New DO loop macros Each tile is effectively a subdomain with the same structure as the full processor domain, i.e. it has an internal part (with `i` indices `ntsi:ntei` and `j` indices `ntsj:ntej`) and a halo. The example below demonstrates that operations performed on the halo points of one tile will affect the internal part of adjacent tiles. [[Image(tiling_overlap.png)]] This is quite a common issue. In fact, this will occur whenever a DO loop does work on halo points for a full-sized array (i.e. it is the size of the full domain, rather than just the tile) that is persistent in memory (i.e. it is declared at the module level or as allocatable with the `SAVE` attribute). Therefore, the [http://forge.ipsl.jussieu.fr/nemo/changeset/14780/NEMO/branches/2021/dev_r14273_HPC-02_Daley_Tiling/src/OCE/TRA/traqsr.F90?old=14215&old_path=NEMO%2Ftrunk%2Fsrc%2FOCE%2FTRA%2Ftraqsr.F90 existing workaround must be replaced] by something better, in order to avoid many code changes. A new set of [http://forge.ipsl.jussieu.fr/nemo/changeset/14797/NEMO/branches/2021/dev_r14273_HPC-02_Daley_Tiling/src/OCE/do_loop_substitute.h90?old=14215&old_path=NEMO%2Ftrunk%2Fsrc%2FOCE%2Fdo_loop_substitute.h90 DO loop macros] have been added as one solution for this problem. However, as noted in a [http://forge.ipsl.jussieu.fr/nemo/wiki/2021WP/HPC-02_Daley_Tiling#Otherchangesofnote later section] (specifically, the changes to `zdfphy.F90` and `zdfevd.F90`) this does not resolve all issues caused by calculation overlap when using tiling. The aim is to avoid repeating operations on particular points by adjusting the DO loop bounds, which has the advantage of reducing unneccessary calculations when using tiling. This is achieved by keeping track of which tiles have been completed (`l_tilefin` module variable in `domtile.F90`) and using the stencil of the DO loop to work out which points have therefore been processed: {{{#!fortran #define DO_2D_OVR(L, R, B, T) DO_2D(L-(L+R)*nthl, R-(R+L)*nthr, B-(B+T)*nthb, T-(T+B)*ntht) }}} Here, `nthl`/`nthr`/`nthb`/`ntht` are equal to 1 if work on the adjacent left/right/bottom/top tiles has finished, otherwise they are equal to 0. If there is no adjacent tile (i.e. the tile is at the edge of the domain), then the corresponding integer is again equal to 0. As a general rule (although it is not always necessary), these new DO loop macros must be used whenever: * The bounds of the DO loop include the halo (i.e. the offset is greater than 0) * The DO loop contains an assignment to an array that is persistent in memory (e.g. the state variables `ts`, `uu`, `vv` etc) As such, they have been implemented widely in this development and have replaced previous workarounds in: * TRA/trabbl.F90 * TRA/tranpc.F90 * TRA/traqsr.F90 * TRA/trasbc.F90 __Restructuring of core `domtile.F90` routines Several [http://forge.ipsl.jussieu.fr/nemo/changeset/14797/NEMO/branches/2021/dev_r14273_HPC-02_Daley_Tiling/src/OCE/DOM/domtile.F90?old=14090&old_path=NEMO%2Ftrunk%2Fsrc%2FOCE%2FDOM%2Fdomtile.F90 new routines] have been added to `domtile.F90` in order to implement the new DO loop macros, to make the tiling implementation tidier and easier to use, and to add warnings/errors. * `dom_tile`- sets the currently active tile * `dom_tile_init`- contains initialisation steps moved out of `dom_tile` * `dom_tile_start`- declare the start of a tiled code region * `dom_tile_end`- declare the end of a tiled code region The following variables have been added (see [https://forge.ipsl.jussieu.fr/nemo/wiki/2020WP/HPC-02_Daley_Tiling#Listofnewvariablesandfunctionsexcludinglocal here] for a list of the pre-existing tiling variables): * `INTEGER, PUBLIC :: nthl, nthr, nthb, ntht`- modifiers on bounds in the new DO loop macros (see above) * `LOGICAL, PUBLIC :: l_istiled`- whether tiling is **currently** active * This replaces instances of `ntile /= 0` and `ntile == 0` throughout the code * `LOGICAL, ALLOCATABLE, DIMENSION(:) :: l_tilefin`- whether a specific tile has been processed (size `nijtile`) Below is an example of how the tiling routines are now used, as well as the action taken by each routine: {{{#!fortran IF( ln_tile ) CALL dom_tile_start ! Set l_istiled = .true. DO jtile = 1, nijtile IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile = jtile ) ! Set ntile = ktile ! Set ntsi/ntei/ntsj/ntej for the given tile (ktile) ! Set nthl/nthr/nthb/ntht for the given tile (ktile) ! Set l_tilefin = .true. for the previous tile ! Tiled code END DO IF( ln_tile ) CALL dom_tile_stop ! Set ntile = 0 ! Set l_istiled = .false. ! Set ntsi/ntei/ntsj/ntej for the full domain (equal to Nis0/Nie0/Njs0/Nje0) ! Set nthl/nthr/nthb/ntht for the full domain (equal to 0) ! Set l_tilefin(:) = .false. }}} In addition, the new tiling routines now include a "pause/resume" functionality which is activated by setting `ldhold = .true.`. This [http://forge.ipsl.jussieu.fr/nemo/changeset/14780/NEMO/branches/2021/dev_r14273_HPC-02_Daley_Tiling/src/OCE/DOM/dtatsd.F90?old=14189&old_path=NEMO%2Ftrunk%2Fsrc%2FOCE%2FDOM%2Fdtatsd.F90 replaces the existing workaround] with a tidier implementation. Below is an example of how this is used, in the context of the above example: {{{#!fortran IF( ln_tile ) CALL dom_tile_start DO jtile = 1, nijtile IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile = jtile ) ! Tiled code IF( ln_tile ) CALL dom_tile_stop( ldhold=.TRUE. ) ! Set l_istiled = .false. ! Set ntsi/ntei/ntsj/ntej for the full domain (equal to Nis0/Nie0/Njs0/Nje0) ! Set nthl/nthr/nthb/ntht for the full domain (equal to 0) ! Untiled code IF( ln_tile ) CALL dom_tile_start( ldhold=.TRUE. ) ! Set l_istiled = .true. ! Set ntsi/ntei/ntsj/ntej for the currently active tile (ntile) ! Set nthl/nthr/nthb/ntht for the currently active tile (ntile) END DO IF( ln_tile ) CALL dom_tile_stop }}} __XIOS tiling support Support for tiled data has been added to the XIOS trunk at [http://forge.ipsl.jussieu.fr/ioserver/changeset/2131 r2131]. These changes have been implemented in [http://forge.ipsl.jussieu.fr/nemo/changeset/14607/NEMO/branches/2021/dev_r14273_HPC-02_Daley_Tiling/src/OCE/IOM/iom.F90?old=14239&old_path=NEMO%2Ftrunk%2Fsrc%2FOCE%2FIOM%2Fiom.F90 iom.F90]: * New XIOS interface arguments added to `iom_set_domain_attr` and `set_grid` routines * `ntile` argument added to `xios_send_field` in 2D/3D/4D routines used by `iom_put` interface Data is only passed to XIOS if: * The array is smaller than the full domain (i.e. it is tile sized) * The array is the size of the full domain AND tiling is not active * The array is the size of the full domain AND the current tile is the final tile This is necessary because not all `iom_put` calls will pass tile-sized data when tiling is active (e.g. global arrays, which are the size of the full domain and must be calculated gradually by each tile). It was also necessary to expand the `is_tile` interface in `DOM/domutl.F90` to include sp and dp versions. The workarounds required to use `iom_put` with tiling (e.g. in [http://forge.ipsl.jussieu.fr/nemo/changeset/14607/NEMO/branches/2021/dev_r14273_HPC-02_Daley_Tiling/src/OCE/TRA/trabbc.F90?old=14072&old_path=NEMO%2Ftrunk%2Fsrc%2FOCE%2FTRA%2Ftrabbc.F90 trabbc.F90]) have been removed. * **This development therefore requires the use of the [http://forge.ipsl.jussieu.fr/ioserver/browser/XIOS/trunk?rev=2131 XIOS trunk at r2131]** ==== Tiling code coverage __DYN and ZDF coverage Tiling coverage in `stp` (step.F90) and `stpmlf` (`stpmlf.F90`) has been expanded to include DYN and ZDF code. * All routines in the DYN 'block' of code except `ssh_nxt`, `wzv`, `wAimp` and `dyn_spg` * All routines in `zdf_phy` except `zdf_osm` __Improved TRA coverage Tiling can now be used with most of the TRA code. In particular, the bilaplacian lateral diffusion operator (`ln_traldf_blp = .true.`) and all advection schemes except the FCT scheme are now able to be used with tiling. Following the [wiki:2021WP/HPC-03_Mele_Comm_Cleanup extended haloes development], most of the `lbc_lnk` calls affecting the tiled code have been removed in the `nn_hls = 2` case. As a result, the tiling workarounds required to bypass `lbc_lnk` calls (e.g. in [http://forge.ipsl.jussieu.fr/nemo/changeset/14780/NEMO/branches/2021/dev_r14273_HPC-02_Daley_Tiling/src/OCE/TRA/traldf.F90?old=14189&old_path=NEMO%2Ftrunk%2Fsrc%2FOCE%2FTRA%2Ftraldf.F90 traldf.F90]) have mostly been removed. * **This development therefore requires the use of `nn_hls = 2` when tiling is enabled with `ln_tile = .true.`** __Untiled code The tiling has been implemented in the standard (`step.F90`) and QCO (`stpmlf.F90`) code, as well as code relating to the loop fusion (`key_loop_fusion`). Code relating to the RK3 scheme (`key_RK3`) has not been tiled. ==== Other changes of note __Removed workarounds * `TRA/traadv.F90`- workarounds for tiling now only required for the FCT scheme * `TRA/traadv.F90`- changed the input array dimension declarations for several routines called from `tra_adv` * This was to remove arguments causing "copy-in" to occur, e.g. `CALL tra_mle_trp( kt, nit000, zuu(A2D(nn_hls),:), ...` * The "copy-in" for `dia_ptr` is still present * `TRA/traldf.F90`- workarounds for tiling no longer required __Removal of tiling * `ASM/asminc.F90`- removed tiling for code activated by `ln_asmdin =.true.` * This code seems to only be called during initialisation, which is not within the tiled code region __Refactoring * `DYN/dynhpg.F90`- parts of `hpg_djc` have been refactored to give consistent results for different `nn_hls` * Harmonic averages have been refactored to use the machine epsilon coefficient (`zep`) in a similar way to other code in NEMO * **This changes the results with respect to the trunk** {{{#!fortran ! Old cffu = 2._wp * zdrhox(ji-1,jj,jk) * zdrhox(ji,jj,jk) IF( cffu > zep ) THEN zdrho_i(ji,jj,jk) = cffu / ( zdrhox(ji-1,jj,jk) + zdrhox(ji,jj,jk) ) ELSE zdrho_i(ji,jj,jk ) = 0._wp ENDIF ! New cffu = MAX( 2._wp * zdrhox(ji-1,jj,jk) * zdrhox(ji,jj,jk), 0._wp ) z1_cff = zdrhox(ji-1,jj,jk) + zdrhox(ji,jj,jk) zdrho_i(ji,jj,jk) = cffu / SIGN( MAX( ABS(z1_cff), zep ), z1_cff ) }}} * `DYN/dynhpg.F90`- `hpg_prj` has been refactored to implement the tiling and to improve readability * `DYN/dynldf_iso.F90`- several working arrays have been declared as automatic local arrays instead of allocatable module arrays * This was seen as a tidier implementation, and there was no need for these arrays to persist in memory * `DYN/wet_dry.F90`- added a `ctl_warn` when `ln_wd_il = .true.` * This controls several code blocks throughout `dyn_hpg.F90` * Tiling has not been tested with this option, but it is apparently due to be deprecated in favor of the less intrusive `ln_wd_dl` option, so there are no plans to test it * `ZDF/zdfphy.F90`- a read-only copy of `avm_k` (`avm_k_n`) is saved and used by `zdf_sh2` when using tiling * The closure schemes (`zdf_tke` etc) will update `avm_k`, which is used for the calculation of `zsh2` by `zdf_sh2` with a stencil that includes halo points. When using tiling, the calculation of `zsh2` will therefore include `avm_k` values that have been updated (and are therefore valid for the next timestep) by adjacent tiles, changing the results * To preserve results when using tiling, a read-only copy of the "now" `avm_k` is saved for use in the calculation of `zsh2` * `ZDF/zdfevd.F90`- `zavt_evd`/`zavm_evd` have been declared as allocatable arrays instead of automatic arrays * `p_avt`/`p_avm` are updated on halo points when using `nn_hls > 1`. When using tiling, `p_avt`/`p_avm` will therefore already have been partially updated by adjacent tiles, since the halo of a tile corresponds to internal points on adjacent tiles. `zavt_evd`/`zavm_evd` then evaluate to zero on these points, changing the results * To preserve results when using tiling, the diagnostic data is not sent for each tile, but is instead stored for the full domain and sent once after all tiles are complete __Bug fixes * `ZDF/zdfmxl.F90`- the calculation of `hmld` has been moved into a new routine `zdf_mxl_turb` * This was done to address a bug in `zdftke.F90` that caused the results to change when using tiling and `nn_etau = 2` * When using `nn_etau = 2`, `zdf_tke_init` calls `zdf_mxl` to initialise `nmln`, which depends on `rn2b`. However, `rn2b` is not yet initialised at this point, so the `nn_etau = 2` option is not restartable and tiling changes the results. Furthermore, the diagnostics calculated by `zdf_mxl` are not correct for the first timestep * To address these issues, the calculation of `hmld` was moved into a new routine `zdf_mxl_turb`. `zdf_mxl` is now called before `zdf_sh2` in `zdfphy.F90`, while `zdf_mxl_turb` is called where `zdf_mxl` was previously called. Additionally, `zdf_mxl` is no longer called by `zdf_tke_init` * **This bug fix changes the results with respect to the trunk** when using `ln_zdftke = .true.` with `nn_etau = 2` ==== Outstanding issues * The new DO loop macros result in line lengths that easily exceed 132 characters * This can be overcome by using the appropriate compiler options, but is not permitted by the NEMO coding standard ==== List of new variables (excluding local) and functions * Global variables * `nthl`/`nthr`/`nthb`/`ntht` (`par_oce.F90`)- modifiers on bounds in the new DO loop macros * `l_istiled` (`DOM/dom_oce.F90`) - whether tiling is currently active * Module variables * `l_tilefin` (`DOM/domtile.F90`)- whether a specific tile has been processed (size `nijtile`) * `avm_k_n` (`ZDF/zdfphy.F90`)- copy of `avm_k` passed to `zdf_sh2` when using tiling * Preprocessor macros * `DO_2D_OVR`/`DO_3D_OVR`/`DO_3DS_OVR` (`do_loop_substitute.h90`)- new versions of the DO loop macros that avoid repeat calculations due to overlap of tiles * Functions and subroutines * `dom_tile_init` (`DOM/domtile.F90`)- initialisation of tiling * `dom_tile_start` (`DOM/domtile.F90`)- declare the start of a tiled code region * `dom_tile_stop` (`DOM/domtile.F90`)- declare the end of a tiled code region * `zdf_mxl_turb` (`ZDF/zdfmxl.F90`)- calculation of `hmld`, separated from `zdf_mxl` * `is_tile_*_sp`, `is_tile_*_dp` (`DOM/domutl.F90`)- single and double precision versions of the existing `is_tile` functions * The following subroutines have all been renamed to `_t`, where `` is now a wrapper function for `_t`: * `dyn_ldf_lap` (`DYN/dynldf_lap_blp.F90`) * `eos_insitu_pot_2d` (`TRA/eosbn2.F90`) === Documentation updates {{{#!box width=55em help Using previous parts, define the main changes to be done in the NEMO literature (manuals, guide, web pages, …). }}} ''...'' == Preview {{{#!box width=50em info [[Include(wiki:Developers/DevProcess#preview_)]] }}} ''...'' == Tests === SETTE SETTE ([http://forge.ipsl.jussieu.fr/nemo/browser/utils/CI/sette?rev=14561 r14561]) has been run with the QCO (`NOT_USING_QCO`) and icebergs (`USING_ICEBERGS`) options turned on and off. These are run for [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/2021/dev_r14273_HPC-02_Daley_Tiling?rev=14819 dev_r14273_HPC-02_Daley_Tiling@14819] with the following: * `nn_hls = 1` (`USING_EXTRA_HALO="no"`) * `nn_hls = 2` (`USING_EXTRA_HALO="yes"`) * `nn_hls = 2` (`USING_EXTRA_HALO="yes"`) and `ln_tile = .true.` (using default 10i x 10j tile sizes) and are compared with results from the [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/trunk?rev=14820 trunk@14820] with `nn_hls = 1` and the same settings for `NOT_USING_QCO`/`USING_ICEBERGS`. The Intel compiler (ifort 18.0.5 20180823, `XC40_METO_IFORT` arch file) is used with XIOS ([http://forge.ipsl.jussieu.fr/ioserver/browser/XIOS/trunk?rev=2131 r2131 of the trunk]) in detached mode. All tests (including SWG) pass, but it should be noted that the `USING_EXTRA_HALO` option is only used by ORCA2_ICE_PISCES. All other tests fail when using tiling, due to the requirement of `nn_hls = 2`. All tests give the same results as the trunk when icebergs are turned off. When icebergs are activated, tests using `nn_hls = 2` give different results to the trunk (which is always run using `nn_hls = 1`). This is a known issue when using `nn_hls = 2`. ==== Regular checks * Can this change be shown to produce expected impact (option activated)? __YES__ * Can this change be shown to have a null impact (option not activated)? __YES__ * Results of the required bit comparability tests been run: are there no differences when activating the development? __YES (SETTE), NO (other tests)__ * If some differences appear, is reason for the change valid/understood? __YES (see [http://forge.ipsl.jussieu.fr/nemo/wiki/2021WP/HPC-02_Daley_Tiling#Testfailures test failures])__ * If some differences appear, is the impact as expected on model configurations? __YES__ * Is this change expected to preserve all diagnostics? __NO__ * If no, is reason for the change valid/understood? __YES (see [http://forge.ipsl.jussieu.fr/nemo/wiki/2021WP/HPC-02_Daley_Tiling#Testfailures test failures])__ * Are there significant changes in run time/memory? __NO__ * ORCA2_ICE_PISCES/REPRO* changes with respect to the trunk using `nn_hls = 1` (icebergs turned off): * Execution time * QCO, `nn_hls = 1`: < 5% * QCO, `nn_hls = 2`: + 5-7% * QCO, `nn_hls = 2` and `ln_tile = .true.`: + 6-7% * non-QCO, `nn_hls = 1`: < 5% * non-QCO, `nn_hls = 2`: + 4-6% * non-QCO, `nn_hls = 2` and `ln_tile = .true.`: + 8-11% * Memory * QCO, `nn_hls = 1`: < 0.1Gb * QCO, `nn_hls = 2`: + 0.4Gb * QCO, `nn_hls = 2` and `ln_tile = .true.`: + 0.7-0.8Gb * non-QCO, `nn_hls = 1`: < 0.1Gb * non-QCO, `nn_hls = 2`: + 0.4-0.5Gb * non-QCO, `nn_hls = 2` and `ln_tile = .true.`: + 0.8-0.9Gb * The increase in execution time due to tiling is likely because the 10x10 tile size is not optimal (`nn_ltile_i < Ni_0` causes performance loss) * The trunk with `nn_hls = 1` uses about 11Gb of memory * The increase in memory due to `nn_hls = 2` is probably mostly due to an increase in the size of the domain * The increase in memory due to tiling may be because a number of additional arrays are declared when `ln_tile = .true.` === SETTE (post merge) The SETTE tests have been repeated with the [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/trunk?rev=14922 trunk@14922] in order to include bug fixes that allow all SETTE tests to be run with `nn_hls = 2` and tiling. The tests are the same as detailed above except: * The [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/trunk?rev=14922 trunk@14922] is used (but still compared with results from the [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/trunk?rev=14820 trunk@14820]) * [http://forge.ipsl.jussieu.fr/nemo/browser/utils/CI/sette?rev=14844 SETTE@14844] is used * Additional tests with `key_loop_fusion` have been performed * `nn_hls = 2` is set directly in namelist_ref, instead of via `USING_EXTRA_HALO`, in order to run all SETTE tests with the extended haloes (and tiling) * The default tile size in namelist_ref is 99999i x 10j (to ensure there is always only 1 tile in i) * Icebergs are not activated All SETTE tests pass and give the same results as the [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/trunk?rev=14820 trunk@14820], except AGRIF_DEMO which differs after 17 timesteps for all `nn_hls = 2` tests. This is thought to be because one of the AGRIF domains in this configuration is not large enough for `nn_hls = 2`. ==== Regular checks All checks are the same as before, but the run time/memory changes are significant in some cases. These are reported here for increases in time/memory larger than 10% that are present in both REPRO experiments of a configuration: * QCO, `nn_hls == 1` * No significant changes * QCO, `nn_hls == 2` * GYRE_PISCES: time + 13-18%, memory + 13-18% * QCO, `nn_hls == 2` and `ln_tile = .true.` * AMM12: memory + 18% * WED025: memory + 17% * QCO, loop fusion and `nn_hls == 2` * AMM12: time + 20% * QCO, loop fusion, `nn_hls == 2` and `ln_tile = .true.` * AGRIF_DEMO: time + 11-15% * AMM12: memory + 17-20% * WED025: memory + 19% * non-QCO, `nn_hls == 1` * No significant changes * non-QCO, `nn_hls == 2` * No significant changes * non-QCO, `nn_hls == 2` and `ln_tile = .true.` * AGRIF_DEMO: memory + 13% * AMM12: memory + 18-20% * GYRE_PISCES: time + 11-24% * ORCA2_ICE_OBS: memory + 12-16% * WED025: memory + 15-16% * non-QCO, loop fusion and `nn_hls == 2` * ORCA2_ICE_OBS: time + 11-17% * non-QCO, loop fusion, `nn_hls == 2` and `ln_tile = .true.` * AGRIF_DEMO: memory + 11-12% * AMM12: memory + 21-23% * WED025: memory + 17-19% The time increases do not seem consistent enough to indicate a systematic issue. However, there is evidence to suggest that tiling increases the memory cost of AGRIF_DEMO (11-13%), AMM12 (17-23%) and WED025 (15-19%). This is partly due to the use of `nn_hls = 2`, which increases the domain size, but in AMM12 & WED025 this is only responsible for up to 7% of the increased memory cost. === Development testing A configuration based on ORCA2_ICE_PISCES (without `key_si3` or `key_top`) was used to test code modified by the tiling development. To facilitate cleaner testing, `ln_trabbc`, `ln_trabbl`, `ln_icebergs`, `ln_rnf`, `ln_ssr`, `ln_tradmp`, `ln_ldfeiv`, `ln_traldf_msc`, `ln_mle`, `ln_zdfddm` and `ln_zdfiwm` were all set to `.false.`. `ln_qsr_2bd` was used instead of `ln_qsr_rgb`, `nn_havtb`/`nn_etau`/`nn_ice`/`nn_fwb` were set to 0, and `nn_fsbc` was set to 1. All tests were run with the standard VVL code, the QCO code (`key_qco`) and the new linear free surface code (`key_linssh`). The Intel compiler (ifort 18.0.5 20180823) was used with XIOS ([http://forge.ipsl.jussieu.fr/ioserver/browser/XIOS/trunk?rev=2131 r2131 of the trunk]) in detached mode. A `jpni = 4`, `jpnj = 9` decomposition was used with 6 XIOS processors. Simulations using [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/2021/dev_r14273_HPC-02_Daley_Tiling?rev=14819 dev_r14273_HPC-02_Daley_Tiling@14819] and the [http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/trunk?rev=14820 trunk@14820] were run for 100 days with 1-day diagnostic output, for all scientific options relevant to the affected code. Each simulation of the tiling branch was run with: 1. `nn_hls = 1` 2. `nn_hls = 2` 3. `nn_hls = 2` and `ln_tile = .true.`, using 5x5 tiles 4. `nn_hls = 2` and `ln_tile = .true.`, using 50x50 tiles (equivalent to one tile over the full domain) `run.stat` and diagnostic output were compared with simulations of the trunk using `nn_hls = 1`, and with equivalent simulations of the tiling branch that were run in two 50-day submissions (i.e. testing for restartability). * **NOTE**: this testing is not exhaustive and covers only the scientific options required to test the code directly affected by the tiling, although some limited additional testing was included. For example, the `dynspg_ts` scheme is used in all tests as this is the standard setting for ORCA2_ICE_PISCES, but one additional test was included for the `dynspg_exp` scheme. ==== Test failures This list does not include tests that fail due to pre-existing issues in the trunk (e.g. model crashes or restartability failures). * Results differ when using `nn_hls = 2` * Standard (non-QCO) code * `ln_dynldf_blp = .true.` with `ln_dynldf_lev = .true.` * `ln_dynldf_blp = .true.` with `ln_dynldf_hor = .true.` * `ln_dynldf_lap = .true.` with `ln_dynldf_hor = .true.` and `ln_traldf_triad = .true.` * Standard (non-QCO) and `key_linssh` code * `ln_traldf_lap = .true.` with `ln_traldf_triad = .true.` * `ln_traldf_blp = .true.` with `ln_traldf_triad = .true.` and `ln_traldf_msc = .true.` * `ln_traldf_blp = .true.` with `ln_traldf_triad = .true.` and `ln_traldf_msc = .true.` and `ln_botmix_triad = .true.` * QCO code * `ln_dynldf_blp = .true.` with `ln_dynldf_lev = .true.` and `nn_dynldf_typ = 1` * **NOTE**: These differences are very hard to track down, as they seem to disappear when unrelated scientific options (e.g. vertical mixing coefficients) are changed. They are also completely different test failures to those from the previous tests using r14805. This indicates that the differences are very small and sensitive; they could perhaps be investigated at a later point. __Expected failures * Results differ when using tiling * `ln_trabbl = .true.` with `nn_bbl_adv > 0` * This is a known issue from the [https://forge.ipsl.jussieu.fr/nemo/wiki/2020WP/HPC-02_Daley_Tiling#Knownfailuresindevelopmenttests 2020 development] * Diagnostics produced by `dia_ptr` * The diagnostics change only very slightly, only for a few tests, and only for a single point in the Indian Ocean basin where there are few zonal points * I suspect this is due to the additional integration step over tiles. I don't think it is a major issue, but I note it here for future investigation * Results differ with respect to the trunk * `ln_zdftke = .true.` with `nn_etau = 2` * This is because of a bug fix (the results were incorrect in the trunk) * `ln_hpg_djc = .true.` * This is because of refactoring (machine epsilon is applied in a different way, to preserve results for different `nn_hls`) ==== Untested code * Code in `DIA/diaptr.F90` for the `uocetr_vsum_cumul` diagnostic (`ptr_ci_2d` subroutine) * XIOS hangs when trying to output this diagnostic * Code in `DYN/dynldf_iso.F90` that requires `ln_sco = .true.` * Specifically, this requires `ln_dynldf_hor .AND. ln_traldf_iso` * Code that requires `ln_rnf_depth = .true.` * I was unable to produce the required input file * Code that requires a wave model * Specifically, code requiring one or more of `ln_stcor = .true.`, `ln_vortex_force = .true.`, `ln_wave = .true.` and `ln_sdw = .true.` * Code that requires a sea ice model * i.e. anything not covered by ORCA2_ICE_PISCES * Code that requires a coupled surface boundary * Specifically, code requiring `key_oasis3` or `ln_cpl` * The trends diagnostics were not enabled == Review {{{#!box width=50em info [[Include(wiki:Developers/DevProcess#review)]] }}} ''...''