New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
2021WP/HPC-02_Daley_Tiling – NEMO
wiki:2021WP/HPC-02_Daley_Tiling

Version 4 (modified by hadcv, 3 years ago) (diff)

--

Name and subject of the action

Last edition: Wikinfo(changed_ts)? by Wikinfo(changed_by)?

The PI is responsible to closely follow the progress of the action, and especially to contact NEMO project manager if the delay on preview (or review) are longer than the 2 weeks expected.

  1. Summary
  2. Preview
  3. Tests
  4. Review

Summary

Action Implement 2D tiling (continuation of 2020 work)
PI(S) Daley Calvert
Digest Implement 2D tiling in DYN and ZDF code
Dependencies Cleanup of lbc_lnk calls (wiki:2021WP/HPC-03_Mele_Comm_Cleanup)
Branch source:/NEMO/branches/2021/dev_r14273_HPC-02_Daley_Tiling
Previewer(s) Italo Epico
Reviewer(s) Italo Epico
Ticket #2600

Description

Further implement tiling for code appearing in stp and stpmlf

Implementation

Branch

dev_r14273_HPC-02_Daley_Tiling@14787

Changes to tiling framework

New DO loop macros

Each tile is effectively a subdomain with the same structure as the full processor domain, i.e. it has an internal part (with i indices ntsi:ntei and j indices ntsj:ntej) and a halo. The example below demonstrates that operations performed on the halo points of one tile will affect the internal part of adjacent tiles.

This is quite a common issue. In fact, this will occur whenever a DO loop does work on halo points for a full-sized array (i.e. it is the size of the full domain, rather than just the tile) that is persistent in memory (i.e. it is declared at the module level or as allocatable with the SAVE attribute). Therefore, the existing workaround must be replaced by something better, in order to avoid many code changes.

A new set of DO loop macros have been added as one solution for this problem.

The aim is to avoid repeating operations on particular points by adjusting the DO loop bounds, which has the advantage of reducing unneccessary calculations when using the tiling. This is achieved by keeping track of which tiles have been completed (l_tilefin module variable in domtile.F90) and using the stencil of the DO loop to work out which points have therefore been processed:

#define DO_2D_OVR(L, R, B, T) DO_2D(L-(L+R)*nthl, R-(R+L)*nthr, B-(B+T)*nthb, T-(T+B)*ntht)

Here, nthl/nthr/nthb/ntht are equal to 1 if work on the adjacent left/right/bottom/top tiles is finished, otherwise they are equal to 0. If there is no adjacent tile (i.e. the tile is at the edge of the domain), then the corresponding integer is again equal to 0.

As a general rule (although it is not always necessary), this new DO loop macro must be used whenever:

  • The bounds of the DO loop include the halo (i.e. are greater than 0)
  • The DO loop contains an assignment to an array that is persistent in memory (e.g. the state variables ts, uu, vv etc)

As such, they have been implemented widely in this development and have replaced previous workarounds in:

  • TRA/trabbl.F90
  • TRA/tranpc.F90
  • TRA/traqsr.F90
  • TRA/trasbc.F90

Restructuring of core domtile.F90 routines

Several new routines have been added to domtile.F90 in order to implement the new DO loop macros, to make the tiling implementation tidier and easier to use, and to add warnings/errors.

  • dom_tile- sets the currently active tile
  • dom_tile_init- contains initialisation steps moved out of dom_tile
  • dom_tile_start- declare the start of a tiled code region
  • dom_tile_end- declare the end of a tiled code region

The following variables have been added (see here for a list of the pre-existing tiling variables):

  • INTEGER, PUBLIC :: nthl, nthr, nthb, ntht- modifiers on bounds in the new DO loop macros (see above)
  • LOGICAL, PUBLIC :: l_istiled- whether tiling is currently active
    • This replaces instances of ntile /= 0 and ntile == 0 throughout the code
  • LOGICAL, ALLOCATABLE, DIMENSION(:) :: l_tilefin- whether a specific tile has been processed (size nijtile)

Below is an example of how the tiling routines are now used, as well as the action taken by each routine:

IF( ln_tile ) CALL dom_tile_start                                          ! Set l_istiled = .true.

DO jtile = 1, nijtile 
   IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile = jtile )    ! Set ntile = ktile
                                                                           ! Set ntsi/ntei/ntsj/ntej for the given tile (ktile)
                                                                           ! Set nthl/nthr/nthb/ntht for the given tile (ktile)
                                                                           ! Set l_tilefin = .true. for the previous tile 

   ! Tiled code

END DO
IF( ln_tile ) CALL dom_tile_stop                                           ! Set ntile = 0
                                                                           ! Set l_istiled = .false.
                                                                           ! Set ntsi/ntei/ntsj/ntej for the full domain (equal to Nis0/Nie0/Njs0/Nje0)
                                                                           ! Set nthl/nthr/nthb/ntht for the full domain (equal to 0)
                                                                           ! Set l_tilefin(:) = .false.

In addition, the new tiling routines now include a "pause/resume" functionality which is activated by setting ldhold = .true.. This replaces the existing workaround with a tidier implementation. Below is an example of how this is used, in the context of the above example:

IF( ln_tile ) CALL dom_tile_start

DO jtile = 1, nijtile 
   IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile = jtile )

   ! Tiled code

   IF( ln_tile ) CALL dom_tile_stop( ldhold=.TRUE. )                       ! Set l_istiled = .false.
                                                                           ! Set ntsi/ntei/ntsj/ntej for the full domain (equal to Nis0/Nie0/Njs0/Nje0)
                                                                           ! Set nthl/nthr/nthb/ntht for the full domain (equal to 0)
   ! Untiled code

   IF( ln_tile ) CALL dom_tile_start( ldhold=.TRUE. )                      ! Set l_istiled = .true.
                                                                           ! Set ntsi/ntei/ntsj/ntej for the currently active tile (ntile)
                                                                           ! Set nthl/nthr/nthb/ntht for the currently active tile (ntile)
END DO
IF( ln_tile ) CALL dom_tile_stop                                           

XIOS tiling support

Support for tiled data has been added to the XIOS trunk at r2131.

These changes to the Fortran interface have been implemented in iom.F90:

  • New XIOS interface routine arguments added to iom_set_domain_attr and set_grid routines
  • ntile argument added to xios_send_field in 2D/3D/4D routines used by iom_put interface
    • Data is only passed to XIOS if:
      • The size of the tile
      • The size of the full domain AND tiling is not active
      • The size of the full domain AND the current tile is the final tile
    • This is necessary because not all iom_put calls will pass tile-sized data when tiling is active (e.g. global arrays, which are full sized and must be calculated gradually by each tile)

Additionally, it was also necessary to expand the is_tile interface in DOM/domutl.F90 to include sp and dp versions.

The NEMO workarounds required to use iom_put with tiling (e.g. in trabbc.F90) have been removed.

This development therefore requires the use of the XIOS trunk at r2131

Tiling code coverage

DYN and ZDF coverage

Tiling coverage in stp (step.F90) and stpmlf (stpmlf.F90) has been expanded to include DYN and ZDF code.

  • All routines in the DYN 'block' of code except ssh_nxt, wzv, wAimp and dyn_spg
  • All schemes in zdf_phy except zdf_osm

Expanded TRA coverage

Tiling can now be used with most of the TRA code, in particular the bilaplacian lateral diffusion operator (ln_traldf_blp = .true.) and all advection schemes except the FCT scheme. Following the extended haloes development, most of the lbc_lnk calls affecting the tiled code have been removed in the nn_hls = 2 case. As a result, the tiling workarounds required to bypass lbc_lnk calls (e.g. in `traldf.F90` have mostly been removed.

This development therefore requires the use of nn_hls = 2 when tiling is enabled with ln_tile = .true.

Untiled code

The tiling implementation is currently focussed on the standard (step.F90) and QCO (stpmlf.F90) code.

Code relating to the loop fusion (key_loop_fusion) and RK3 scheme (key_RK3) has not been tiled.

Other changes of note

Removed workarounds

  • TRA/traadv.F90- updated to reflect that the FCT scheme is now the only scheme that cannot be used with tiling
  • TRA/traadv.F90- changed the input array dimension declarations for several routines called from tra_adv
    • This was to remove arguments causing "copy-in" to occur, e.g. CALL tra_mle_trp( kt, nit000, zuu(A2D(nn_hls),:), ...
    • The "copy-in" for dia_ptr is still present

Removal of tiling

  • ASM/asminc.F90- removed tiling for code activated by ln_asmdin =.true.
    • This code seems to only be called during initialisation, which is not within the tiled code region

Refactoring

  • DYN/dynhpg.F90- parts of hpg_djc have been refactored to give consistent results for different nn_hls
    • Harmonic averages have been refactored to use the machine epsilon coefficient (zep) in a similar way to other code in NEMO
    • This changes the results with respect to the trunk
! Old
cffu = 2._wp * zdrhox(ji-1,jj,jk) * zdrhox(ji,jj,jk)
IF( cffu > zep ) THEN
   zdrho_i(ji,jj,jk) = cffu / ( zdrhox(ji-1,jj,jk) + zdrhox(ji,jj,jk) )
ELSE
   zdrho_i(ji,jj,jk ) = 0._wp
ENDIF

! New
cffu = MAX( 2._wp * zdrhox(ji-1,jj,jk) * zdrhox(ji,jj,jk), 0._wp )
z1_cff = zdrhox(ji-1,jj,jk) + zdrhox(ji,jj,jk)
zdrho_i(ji,jj,jk) = cffu / SIGN( MAX( ABS(z1_cff), zep ), z1_cff )
  • DYN/dynhpg.F90- hpg_prj has been refactored to implement the tiling and to improve readability
  • DYN/dynldf_iso.F90- several working arrays have been declared as automatic local arrays instead of allocatable module arrays
    • This was seen as a tidier implementation, and there was no need for these arrays to persist in memory
  • DYN/wet_dry.F90- added a ctl_warn when ln_wd_il = .true.
    • This controls several code blocks throughout dyn_hpg.F90
    • Tiling has not been tested with this option, but it is apparently due to be deprecated in favor of the less intrusive ln_wd_dl option, so there are no plans to test it
  • ZDF/zdfphy.F90- a read-only copy of avm_k (avm_k_n) is saved and used by zdf_sh2 when using tiling
    • The closure schemes (zdf_tke etc) will update avm_k, which is used for the calculation of zsh2 by zdf_sh2 with a stencil that includes halo points. When using tiling, the calculation of zsh2 will therefore include avm_k values that have been updated (and are therefore valid for the next timestep) by adjacent tiles, changing the results
    • To preserve results when using tiling, a read-only copy of the "now" avm_k is saved for use in the calculation of zsh2
  • ZDF/zdfevd.F90- zavt_evd/zavm_evd have been declared as allocatable arrays instead of automatic arrays
    • p_avt/p_avm are updated on halo points when using nn_hls > 1. When using tiling, p_avt/p_avm will therefore already have been partially updated by adjacent tiles, since the halo of a tile corresponds to internal points on adjacent tiles. zavt_evd/zavm_evd then evaluate to zero on these points, changing the results
    • To preserve results when using tiling, the diagnostic data is not sent for each tile, but is instead stored for the full domain and sent once after all tiles are complete

Bug fixes

  • ZDF/zdfmxl.F90- the calculation of hmld has been moved into a new routine zdf_mxl_turb
    • This was done to address a bug in zdftke.F90 that caused the results to change when using tiling and nn_etau = 2
    • When using nn_etau = 2, zdf_tke_init calls zdf_mxl to initialise nmln, which depends on rn2b. However, rn2b has not yet been initialised at this point, so the nn_etau = 2 option was not restartable and tiling changed the results. Furthermore, the diagnostics calculated by zdf_mxl were not correct for the first timestep
    • To address these issues, the calculation of hmld was moved into a new routine zdf_mxl_turb. zdf_mxl is now called before zdf_sh2 in zdfphy.F90, while zdf_mxl_turb is called where zdf_mxl was previously called. Additionally, zdf_mxl is no longer called by zdf_tke_init
    • This bug fix changes the results with respect to the trunk when using nn_etau = 2

Outstanding issues

  • The new DO loop macros result in line lengths that easily exceed 132 characters
    • This is easily overcome with the appropriate compiler options, but is not permitted by the NEMO coding standard
  • In `DYN/dynhpg.F90` there is an lbc_lnk call that I think can be removed, but I'm not sure whether Francesca still requires it to resolve her issue with hpg_djc
    • If it is still required, a workaround will be needed to disable the tiling when using hpg_djc
  • Tiling changes some results
    • DIA/diaptr.F90 diagnostics slightly change when using tiling in a few tests
      • I suspect this is due to the additional integration step over tiles
      • The diagnostics change only very slightly, and only for a single point in the Indian Ocean basin where there are few zonal points. I therefore don't think it is a major issue, but I note it here for future investigation
    • TRA/trabbl.F90 results change when using tiling and nn_bbl_adv > 0
      • This is because the order in which the tendency components are added can be changed by tiling
      • This is a known issue from the 2020 development

Documentation updates

Error: Failed to load processor box
No macro or processor named 'box' found

...

Preview

Error: Failed to load processor box
No macro or processor named 'box' found

...

Tests

Error: Failed to load processor box
No macro or processor named 'box' found

...

Review

Error: Failed to load processor box
No macro or processor named 'box' found

...

Attachments (1)

Download all attachments as: .zip