Version 21 (modified by techene, 6 months ago) (diff)

Star coordinate faster implementation

Change the way to deal with the vertical scale factors in NEMO in order to save parallel processing time for z-star coordinate. This modification can be activated through a cpp key : key_qco.

Last edition: 11/02/20 13:25:16 by techene

The PI is responsible to closely follow the progress of the action, and especially to contact NEMO project manager if the delay on preview (or review) are longer than the 2 weeks expected.

  1. Summary
  2. Preview
  3. Tests
  4. Review

Summary

Action optimisation of the vertical scale factor e3 computation
PI(S) Techene, Madec
Digest compute e3 on the fly from e3_0(:,:,:,Ktl) * ( 1 + ssh(:,:,Ktl) / h_0( :,: ) * mask( :,:,: ) instead of storing e3t/u/v/w/f…
Dependencies If any
Branch source:/NEMO/branches/2020/dev_r12377_KERNEL-06_techene_e3
Branch source:/NEMO/branches/2020/dev_r13327_KERNEL-06_2_techene_e3
Previewer(s) Madec, Chanut, Masson
Reviewer(s) Madec
Ticket #2385 #2527 #2555

Description

In z* vertical configuration, NEMO r12377 uses memory to store and update vertical scale factors e3[P] where P = {t-, u-, v-, w-, f-, uw-, vw-} points at "before", "now" and "after" time steps. This means memory storage 6 x 4D + 1 x 3D tables, memory acces and CPU time for updating 3D scale factors. The code modification consists in computing scale factor e3[P] on the fly using each time it is needed with formula e3[P](Kt) = e3[P]_0 * (1 + r3[P](Kt) * mask[P]) where r3P(Kt) (= ssh[P](Kt) / h_0) is a 2D table computed from ssh update at P = {u-,v-,f-} points accordingly with ssh update along a step. This change is only applied in case key_qco is activated.

Because we reduce the number of tables reached in memory we have a better chance to keep using fast RAM memory. Because we no longer compute 3D interpolation but 2D instead algorithm complexity is smaller and use less CPU time. Both make computation about 10% faster whatever the domain size (tested between 10x10 to 100x100 points per computation node). when cutting communications.

This branche also comes with improvements from KERNEL-07 such as symmetric diffusion tensor #2527 implemented in dynldf_lap_blp used controled by nn_dynldf_typ namelist parameter. It contains dynvor correction for using ln_dynvor_msk and a proper fix for using ENS and ENE with partial steps described in #2555. Finally a new shallow water test case has been added.

Implementation

Describe flow chart of the changes in the code.
List the Fortran modules and subroutines to be created/edited/deleted.
Detailed list of new variables to be defined (including namelists),
give for each the chosen name and description wrt coding rules.

KERNEL-06's version 1 [pre mid-merge 2020] : /NEMO/branches/2020/dev_r12377_KERNEL-06_techene_e3

starting point

In NEMO r12377 scale factors (e3*) at u-v-w-uw-vw-f-points are interpolated from e3t at Kbb, Kmm, and Kaa. The module in charge of scale factor management is src/OCE/DOM/domvvl.F90. domvvl.F90 interpolation routine is called :

  • at initialisation or restart at u-v-w-uw-vw-f-points
    nemogcm 
    --> nemo_init
       --> dom_init
          --> dom_vvl_init
             --> dom_vvl_rst
             --> dom_vvl_zgr
                --> dom_vvl_interpol 
    
  • at each time step for "after" scale factor at u-v-points each time ssh[Naa] is computed
    nemogcm 
    --> stp
       --> dom_vvl_sf_nxt
          --> e3t[Kaa]
          --> dom_vvl_interpol
    
  • at each time step for "now" scale factor at u-v-points e3t is directly filtered
    nemogcm 
    --> stp
       --> tra_atf
          --> e3t[Kmm]
       --> dyn_atf
          --> e3t[Kmm]
          --> dom_vvl_interpol
    
  • at each time step after time swapping for "now" at f-point and "before" scale factor both at w-uw-vw-points
    nemogcm 
    --> stp
       --> dom_vvl_sf_update
          --> dom_vvl_interpol
    

developments

In version 1 of source:/NEMO/branches/2020/dev_r12377_KERNEL-06_techene_e3 we implement changes progressively and validate step by step regarding GYRE_PISCES test case. We remove all the vvl routines usage thus we replace e3t/u/v/w/uw/vw at 3 time step + e3f (19) 3d tables storage and twice e3u/v "after" + e3u/v "now" + e3f/w/uw/vw "now" + e3w/uw/vw "before" (13) 3d tables interpolation at each step and e3u/v/f/w/uw/vw "now" + e3u/v/w/uw/vw "before" (11) 3d tables interpolation at initialisation or restart by r3t/u/v at 3 time steps and r3f i.e. (10) 2d table storage and on the fly light computation. For backward compatibility we introduce a cpp key key_qco in order to isolate new scale factor implementation from former vvl version. qco stand for for "quasi eulerian coordinate".

  • new variables added in dom_oce and domain
    r3P with P = [t,u,v,f] are 2d in space ssh/hP0
    
  • new module in DOM : dom_qco
    -- dom_qco_r3c ! compute ssh/ht0 at t-point and interpolate ssh/h.0 at u-v-f-points from ssh
    -- dom_qco_zgr ! set ssh/ht0 at t-u-v-f-points for Kmm and at t-u-v-points for Kbb
       --> dom_qco_r3c[Kmm] at t-u-v-f-points
       --> dom_qco_r3c[Kbb] at t-u-v  -points
    
  • new substitute in DOM : domzgr_substitute When key_qco is active e3. is no longer a variable but an expression. Each time e3. appears in a routine an include domzgr_substitute in the module enables to replace it by a (e3._0 ( 1 + r3. ) * mask.) like expression.
    #   define  e3t(i,j,k,t)   (e3t_0(i,j,k)*(1._wp+r3t(i,j,t)*tmask(i,j,k)))
    #   define  e3u(i,j,k,t)   (e3u_0(i,j,k)*(1._wp+r3u(i,j,t)*umask(i,j,k)))
    #   define  e3v(i,j,k,t)   (e3v_0(i,j,k)*(1._wp+r3v(i,j,t)*vmask(i,j,k)))
    #   define  e3f(i,j,k)     (e3f_0(i,j,k)*(1._wp+r3f(i,j)*fmask(i,j,k)))
    #   define  e3w(i,j,k,t)   (e3w_0(i,j,k)*(1._wp+r3t(i,j,t)))
    #   define  e3uw(i,j,k,t)  (e3uw_0(i,j,k)*(1._wp+r3u(i,j,t)))
    #   define  e3vw(i,j,k,t)  (e3vw_0(i,j,k)*(1._wp+r3v(i,j,t)))
    
  • finally we extended the e3. modification to h. r1_h. and gde..
    #   define  ht(i,j)        (ht_0(i,j)+ssh(i,j,Kmm))
    #   define  hu(i,j,t)      (hu_0(i,j)*(1._wp+r3u(i,j,t)))
    #   define  hv(i,j,t)      (hv_0(i,j)*(1._wp+r3v(i,j,t)))
    #   define  r1_hu(i,j,t)   (r1_hu_0(i,j)/(1._wp+r3u(i,j,t)))
    #   define  r1_hv(i,j,t)   (r1_hv_0(i,j)/(1._wp+r3v(i,j,t)))
    #   define  gdept(i,j,k,t) (gdept_0(i,j,k)*(1._wp+r3t(i,j,t)))
    #   define  gdepw(i,j,k,t) (gdepw_0(i,j,k)*(1._wp+r3t(i,j,t)))
    #   define  gde3w(i,j,k)   (gdept_0(i,j,k)*(1._wp+r3t(i,j,Kmm))-ssh(i,j,Kmm))
    

When this key_qco is not activated NEMO should be exactly the same as the trunk.

Important points :

  • e3. expression involves tables of distinct dimension then e3.(:,:,: ) call fails it may be necessary to introduce temporary variables (same for water height expression)
  • "e3. =" is no longer possible
  • e3t/u/v/f modifications did not introduce any difference in the results, e3w modification does because both approaches vvl and qco do not take into account of the bottom level in the same way
  • in GYRE e3w_0 are not the half sum of e3u_0 so the way it is implemented in the reference version is not convinient
  • e3. substitution makes lines longer than 136 character this may be a problem for compilers (most have been checked but not all)
  • ssh filtering has been displaced upper in order to provide filtered r3P in TOP asselin filtering
  • when key_qco is not active we pass SETTE and this version r13167 has been delivered for mid-merge party ! Some silly allocating memory bugs found and a not that silly bug in implicit mode triggering for SPITZ12 configuration (Dt instead of 2Dt required).

KERNEL-06's version 2 [post mid-merge 2020] : dev_r13327_KERNEL-06_2_techene_e3

In this branch we debug with respect to sette tests :

  • non linear version of GYRE_PISCES shows restartability issue due to

BUG 1 : dom_qco_init.F90 dom_init is called before dyn_adv_init, then default value of ln_dynadv_vec is at false. r3P coefficients computation is not the same if we use flux of vector form. GYRE_PISCES runs in vector form so at the restart r3P are initialised with their flux form instead of their vector from. RESOLUTION : cpp key added key_qcoTest_FluxForm performance tests have to be done if performances are not relevant the flux from qco switch will be removed. Note that a switch as been added in dynspg_ts to compute surface average consistently with qco flux form. En gros soit on fait la distinction a chaque interpolation lineaire de la ssh soit pas du tout, il faut juste être consistent. Si les performances flux form (il y a moins de calcul) ne sont pas significatives, on pourra abandonner le calcul sinon il faudra ajouter une option de restart. Routines from domqco.F90 and dynapg_ts.F90 havs to be modified.

BUG 2.1 : in dynatf_qco.F90 barotrope uub and vvb are compute with the un-filtrered scale factor BUG 2.2 : the way to compute uub and vvb with e3 from ssh/h0 is not exactly the same as the implementation of e3 filtered RESOLUTION : use filtered r3P_f and be carefull the computation has to be done exactly as in domzgr_substitute i.e. including mask and parenthesis otherwise the restartability fails in dynatf_qco.F90.

BUG 3: when time splitting is not active ssh/h0 at f-points after the dynamic was not computed RESOLUTION : an extra dom_qco_r3c after dyn_spg_ts in case ln_dynspg_ts = F in stpMLF.F90

  • ORCA2_ICE_PISCES

WARNING in debug mode restartability fails it is the case even for the trunk @ 13327 need to be confirmed running tests on the current trunk when it will run…

  • SPITZ12 OK
  • AMM12 OK

Documentation updates

Using previous parts, define the main changes to be done in the NEMO literature (manuals, guide, web pages, …).

Preview

Since the preview step must be completed before the PI starts the coding, the previewer(s) answers are expected to be completed within the two weeks after the PI has sent the request to the previewer(s).
Then an iterative process should take place between PI and previewer(s) in order to find a consensus

Possible bottlenecks:

  • the methodology
  • the flowchart and list of routines to be changed
  • the new list of variables wrt coding rules
  • the summary of updates in literature

Once an agreement has been reached, preview is ended and the PI can start the development into his branch.

Eventually, all the dom_vvl_interpol call are removed, each time e3 is called we use a substitute to replace e3 by e3_0 (1 + ssh / h_0). For backward compatibility a cpp key manages the use of the new version vs. the old version. We will duplicate modules such as step and domvvl into stepLF and domQE (QE stands for Quasi Eulerian) and create a subtitute module.

Integrated in mid merge trunk.

List the Fortran modules and subroutines to be created. substitute.F90

Step 1 : Check the error for e3t, e3w between the current way to compute e3 at T-, W-point and the proposed way to compute e3 at T-, W-point.

  • prints added with no change in the results

Step 2 : First we change only the core routine in domvvl which should be changed into domQE.

  • add new variables, duplicate step into steplf and domvvl into domQE
  • change interpolation routines into scaling routines in domQE

Step 3 : Then we change the Asselin filtering routine indeed because water forcing are applied locally.

  • change Asselin routines (maybe not required since e3 scale with vertical with JC modif)

Step 4 : Finally we remove the interpol routine in the whole code

  • remove interpolating routine in all the code (AGRIF, OFF,…)
  • use a SUBSTITUTE when there are e3 CALL
  • make some changes in step and domQE to have the whole thing consistent

Tests

Once the development is done, the PI should complete the tests section below and after ask the reviewers to start their review.

This part should contain the detailed results of SETTE tests (restartability and reproducibility for each of the reference configuration) and detailed results of restartability and reproducibility when the option is activated on specified configurations used for this test

Regular checks:

  • Can this change be shown to produce expected impact (option activated)?
  • Can this change be shown to have a null impact (option not activated)?
  • Results of the required bit comparability tests been run: are there no differences when activating the development?
  • If some differences appear, is reason for the change valid/understood?
  • If some differences appear, is the impact as expected on model configurations?
  • Is this change expected to preserve all diagnostics?
  • If no, is reason for the change valid/understood?
  • Are there significant changes in run time/memory?

We want to track and maybe explain the differences observed at every steps. Reference set up : For that we produce a reference data set with the trunk -r 12377 using the GYRE_PISCES configuration where top cpp_key has been removed. We run it on 120 time steps. The drag coefficient is zero. We XIOS output an averaged field every 5 days.

Step 1 : We print MAXVAL of error between both way to compute the vertical scale factors at each time step, note that we cancelled forcing (in the r12377 revision it should not change anything since water forcings such as run off and emp scale with the vertical).

error between proposed and former way to compute vertical scale factors at time kt = 1, 120, 85 e3t (1) 0.0000000000000000 3999.6591076268369 4.54747350886E-013 e3t (2) 5.68434188608E-014 5.11590769747E-013 4.54747350886E-013 e3w 4.64477238892E-007 6.13657050507E-006 5.27333801869E-006 gde3w 1.81898940354E-012 2.72848410531E-012 2.72848410531E-012

QUESTION : Why do we have such an error on the e3w scale factors ? It is not consistent with machine accuracy error. It seems to be related to the e3w_0 computation. How do we compute e3w_0 ? OK SOLVED ! THIS IS A KIND OF ERROR IN THE CODE !!! DUE TO THE FACT THAT E3W_0(jk) != 0.5 * ( E3T_0(jk) + E3T_0(jk-1) )…

Step 2 : Change the code in domvvl turn into domqe. Duplicate step.F90 into steplf.F90 and call domqe routines inside.

We observe small errors but not errors at the truncature level as expected with the curent trunk version. This is due to the differences spotted above. WE CAN NO LONGER USE THE TRUNK PRODUCTION AS A REFERENCE…

Review

A successful review is needed to schedule the merge of this development into the future NEMO release during next Merge Party (usually in November).

Assessments:

  • Is the proposed methodology now implemented?
  • Are the code changes in agreement with the flowchart defined at preview step?
  • Are the code changes in agreement with list of routines and variables as proposed at preview step?
    If, not, are the discrepancies acceptable?
  • Is the in-line documentation accurate and sufficient?
  • Do the code changes comply with NEMO coding standards?
  • Is the development documented with sufficient details for others to understand the impact of the change?
  • Is the project literature (manual, guide, web, …) now updated or completed following the proposed summary in preview section?

Finding:

Is the review fully successful? If not, please indicate what is still missing


Once review is successful, the development must be scheduled for merge during next Merge Party Meeting.