Version 3 (modified by frrh, 4 years ago) (diff)

Setting up and testing optimisations for #1821

Branch Creation

Here's how I created the branch:

  • Stripped out svn keywords. (I'm surprised we need to do this at all since we're creating a branch of a branch which has already been stripped - there are only a subset of files affected by this too… mostly AGRIF related)
  • Merge in the optimisations from our MEDUSA based branch and add further changes in the same style which should be of benefit to GO6 more generally. Revisions r7581:r7602 show the actual code changes made.

Testing

Initial testing in a copy of the GO6 standard job.

  • Took copy of standard suite u-ah494/trunk@29996 to u-aj380Optim to act as control run
  • Created working copy of this to act as test run. u-aj380Optim
  • Replaced branches/UKMO/dev_r5518_GO6_package@7573 in working copy with branches/UKMO/dev_r5518_optim_GO6_alloc@7602
  • Ran both jobs for a 10 day NRUN with NEMO timings activated. Repeated both runs to try to iron out any random variations in run time on the XC40.
  • Checking solver.stat at 10 days, we have bit comparison. (I've not done a rigorous comparison of restart files)
  • Comparing NEMO timer output, it seems that there are indications that the optimised run may be ~1-2% faster overall, looking at total elapsed and total CPU, though these differences lie well within the noise of variabilities in run time which can be 10% or more.
  • Here's some output;
  • Control run:
    Elapsed Time (s)  CPU Time (s)
           593483.386   584478.452
     
    Averaged timing on all processors :
    -----------------------------------
    Section             Elap. Time(s)  Elap. Time(%)  CPU Time(s)  CPU Time(%)  CPU/Elap Max elap(%)  Min elap(%)  Freq
    sbc_ice_cice        0.1751989E+03   14.17           174.41      14.32        1.00         14.50     13.76     640.00
    tra_adv_tvd         0.7552396E+02    6.11            75.33       6.19        1.00          7.42      5.27     640.00
    tra_nxt             0.6302550E+02    5.10            62.89       5.17        1.00          7.65      2.14     640.00
    tra_ldf_iso         0.5789208E+02    4.68            57.77       4.74        1.00          5.08      4.06     640.00
    sol_pcg             0.5625327E+02    4.55            56.08       4.61        1.00          4.55      4.55     640.00
    dia_wri             0.4800178E+02    3.88            47.84       3.93        1.00          9.16      0.04     640.00
    tra_bbc             0.4565193E+02    3.69            45.39       3.73        0.99         11.38      0.09     640.00
    nonosc              0.4531063E+02    3.66            45.23       3.71        1.00          4.11      3.44    1280.00
    zps_hde             0.3959416E+02    3.20            39.35       3.23        0.99          7.05      0.08    1281.00
    ldf_slp             0.3823219E+02    3.09            38.15       3.13        1.00          3.25      2.98     640.00
    
    
  • Test Run:
    Total timing (sum) :
     --------------------
    Elapsed Time (s)  CPU Time (s)
           583015.673   573647.291
     
    Averaged timing on all processors :
    -----------------------------------
    Section             Elap. Time(s)  Elap. Time(%)  CPU Time(s)  CPU Time(%)  CPU/Elap Max elap(%)  Min elap(%)  Freq
    sbc_ice_cice        0.1728438E+03   14.23           172.01      14.39        1.00         14.56     13.80     640.00
    tra_adv_tvd         0.7375908E+02    6.07            73.52       6.15        1.00          7.29      5.22     640.00
    tra_nxt             0.6276407E+02    5.17            62.63       5.24        1.00          7.68      2.33     640.00
    tra_ldf_iso         0.5655850E+02    4.66            56.48       4.73        1.00          4.86      3.43     640.00
    sol_pcg             0.4990999E+02    4.11            49.77       4.16        1.00          4.11      4.11     640.00
    dia_wri             0.4792860E+02    3.95            47.80       4.00        1.00          9.32      0.05     640.00
    tra_bbc             0.4585938E+02    3.78            45.57       3.81        0.99         11.72      0.10     640.00
    zps_hde             0.3905950E+02    3.22            38.80       3.25        0.99          7.08      0.09    1281.00
    ldf_slp             0.3817614E+02    3.14            38.10       3.19        1.00          3.30      3.03     640.00
    nonosc              0.3657865E+02    3.01            36.47       3.05        1.00          3.48      2.77    1280.00
    
    
    

We should take direct comparisons with a pinch of salt since these vary from run to run, but we do seem to see a consistent reduction in the elapsed time of sol_pcg and possibly tra_ldf_iso.

There's certainly no suggestion of adverse effects in any of these tests.

So as far as GO6 is concerned it would seem viable.

We need to test in a coupled model with MEDUSA too ideally. What do we have? Well I have u-ai927 and variants thereof. A 5 day run of this with nemo timings switched on shows

Elapsed Time (s)  CPU Time (s)
        38158.114    35083.969
 
Averaged timing on all processors :
-----------------------------------
Section             Elap. Time(s)  Elap. Time(%)  CPU Time(s)  CPU Time(%)  CPU/Elap Max elap(%)  Min elap(%)  Freq
sbc_cpl_rcv         0.6109458E+02   14.41            60.73      15.58        0.99         14.41     14.40     160.00
tra_adv_muscl       0.5152139E+02   12.15            51.43      13.19        1.00         12.20     12.10     160.00
tra_ldf_iso         0.4024424E+02    9.49            40.18      10.31        1.00          9.96      9.08     320.00
trc_sms             0.2209001E+02    5.21            22.02       5.65        1.00          8.75      2.00     160.00
trc_stp             0.1792136E+02    4.23            16.09       4.13        0.90          5.52      2.92     160.00
sbc_ice_cice        0.1746102E+02    4.12            17.21       4.41        0.99          4.15      4.06     160.00
trc_sbc             0.1728718E+02    4.08            17.19       4.41        0.99          7.72      0.04     160.00
trc_init            0.1427096E+02    3.37             5.46       1.40        0.38          3.51      3.13       1.00
istate_init         0.7695191E+01    1.81             3.94       1.01        0.51          1.84      1.76       1.00
tra_zdf_imp         0.7475653E+01    1.76             7.46       1.91        1.00          1.87      1.49     320.00
trc_nxt             0.6103307E+01    1.44             6.09       1.56        1.00          1.85      1.20     160.00
trc_dta             0.4928072E+01    1.16             0.97       0.25        0.20          1.19      1.12       6.00
tra_ldf             0.4769230E+01    1.12             4.76       1.22        1.00          1.14      1.11     160.00

This indicates that the coupled model is badly load balanced (the biggest single cost is in sbc_cpl_rcv… presumably waiting for the atmos to catch up). That means that any optimisations to the ocean won't show up in total model elapsed time unless we rearrange the PE balance to make the ocean faster than the atmosphere.

We also note that tra_adv_muscl is the second highest cost. This is a MEDUSA-only routine and is one of the things we've optimised so we would expect to see some improvement in that.

However we can't use this suite directly since it uses an old version of the MEDUSA branch which clashes with up to date versions of the GO6 branch! Are there any jobs around which use up to date versions of the GO6 AND MEDUSA branches because we can't test things without that?