Setting up and testing optimisations for #1821

== Branch Creation ==

'''Here's how I created the branch:'''

  * Created branch [log:branches/UKMO/dev_r5518_optim_GO6_alloc branches/UKMO/dev_r5518_optim_GO6_alloc] from head of package branch (r7573 at time of writing). 

  * Stripped out svn keywords. (I'm surprised we need to do this at all since we're creating a branch of a branch which has already been stripped - there are only a subset of files affected by this too... mostly AGRIF related) 

  * Merge in the optimisations from our MEDUSA based branch and add further changes in the same style which should be of benefit to GO6 more generally. Revisions r7581:r7602 show the actual code changes made. 

== Testing ==

'''Initial testing in a copy of the GO6 standard job.'''

  * Took copy of standard suite u-ah494/trunk@29996 to u-aj380Optim to act as control run

  * Created working copy of this to act as test run. u-aj380Optim

  * Replaced branches/UKMO/dev_r5518_GO6_package@7573 in working copy with branches/UKMO/dev_r5518_optim_GO6_alloc@7602

  * Ran both jobs for a 10 day NRUN with NEMO timings activated. Repeated both runs to try to iron out any random variations in run time on the XC40. 

  * Checking solver.stat at 10 days, we have bit comparison. (I've not done a rigorous comparison of restart files)

  * Comparing NEMO timer output, it seems that there are indications that the optimised run may be ~1-2% faster overall, looking at total elapsed and total CPU, though these differences
    lie well within the noise of variabilities in run time which can be 10% or more.   

  * Here's some output;

  * Control run:
{{{
Elapsed Time (s)  CPU Time (s)
       593483.386   584478.452
 
Averaged timing on all processors :
-----------------------------------
Section             Elap. Time(s)  Elap. Time(%)  CPU Time(s)  CPU Time(%)  CPU/Elap Max elap(%)  Min elap(%)  Freq
sbc_ice_cice        0.1751989E+03   14.17           174.41      14.32        1.00         14.50     13.76     640.00
tra_adv_tvd         0.7552396E+02    6.11            75.33       6.19        1.00          7.42      5.27     640.00
tra_nxt             0.6302550E+02    5.10            62.89       5.17        1.00          7.65      2.14     640.00
tra_ldf_iso         0.5789208E+02    4.68            57.77       4.74        1.00          5.08      4.06     640.00
sol_pcg             0.5625327E+02    4.55            56.08       4.61        1.00          4.55      4.55     640.00
dia_wri             0.4800178E+02    3.88            47.84       3.93        1.00          9.16      0.04     640.00
tra_bbc             0.4565193E+02    3.69            45.39       3.73        0.99         11.38      0.09     640.00
nonosc              0.4531063E+02    3.66            45.23       3.71        1.00          4.11      3.44    1280.00
zps_hde             0.3959416E+02    3.20            39.35       3.23        0.99          7.05      0.08    1281.00
ldf_slp             0.3823219E+02    3.09            38.15       3.13        1.00          3.25      2.98     640.00

}}}

  * Test Run:
{{{
Total timing (sum) :
 --------------------
Elapsed Time (s)  CPU Time (s)
       583015.673   573647.291
 
Averaged timing on all processors :
-----------------------------------
Section             Elap. Time(s)  Elap. Time(%)  CPU Time(s)  CPU Time(%)  CPU/Elap Max elap(%)  Min elap(%)  Freq
sbc_ice_cice        0.1728438E+03   14.23           172.01      14.39        1.00         14.56     13.80     640.00
tra_adv_tvd         0.7375908E+02    6.07            73.52       6.15        1.00          7.29      5.22     640.00
tra_nxt             0.6276407E+02    5.17            62.63       5.24        1.00          7.68      2.33     640.00
tra_ldf_iso         0.5655850E+02    4.66            56.48       4.73        1.00          4.86      3.43     640.00
sol_pcg             0.4990999E+02    4.11            49.77       4.16        1.00          4.11      4.11     640.00
dia_wri             0.4792860E+02    3.95            47.80       4.00        1.00          9.32      0.05     640.00
tra_bbc             0.4585938E+02    3.78            45.57       3.81        0.99         11.72      0.10     640.00
zps_hde             0.3905950E+02    3.22            38.80       3.25        0.99          7.08      0.09    1281.00
ldf_slp             0.3817614E+02    3.14            38.10       3.19        1.00          3.30      3.03     640.00
nonosc              0.3657865E+02    3.01            36.47       3.05        1.00          3.48      2.77    1280.00


}}}

We should take direct comparisons with a pinch of salt since these vary from run to run, but we do seem to see a consistent reduction in the elapsed time of sol_pcg and possibly tra_ldf_iso.

There's certainly no suggestion of adverse effects in any of these tests. 

So as far as GO6 is concerned it would seem viable. 

We need to test in a coupled model with MEDUSA too ideally.  What do we have?  Well I have u-ai927 and variants thereof. A 5 day run of this with nemo timings switched on shows
{{{
Elapsed Time (s)  CPU Time (s)
        38158.114    35083.969
 
Averaged timing on all processors :
-----------------------------------
Section             Elap. Time(s)  Elap. Time(%)  CPU Time(s)  CPU Time(%)  CPU/Elap Max elap(%)  Min elap(%)  Freq
sbc_cpl_rcv         0.6109458E+02   14.41            60.73      15.58        0.99         14.41     14.40     160.00
tra_adv_muscl       0.5152139E+02   12.15            51.43      13.19        1.00         12.20     12.10     160.00
tra_ldf_iso         0.4024424E+02    9.49            40.18      10.31        1.00          9.96      9.08     320.00
trc_sms             0.2209001E+02    5.21            22.02       5.65        1.00          8.75      2.00     160.00
trc_stp             0.1792136E+02    4.23            16.09       4.13        0.90          5.52      2.92     160.00
sbc_ice_cice        0.1746102E+02    4.12            17.21       4.41        0.99          4.15      4.06     160.00
trc_sbc             0.1728718E+02    4.08            17.19       4.41        0.99          7.72      0.04     160.00
trc_init            0.1427096E+02    3.37             5.46       1.40        0.38          3.51      3.13       1.00
istate_init         0.7695191E+01    1.81             3.94       1.01        0.51          1.84      1.76       1.00
tra_zdf_imp         0.7475653E+01    1.76             7.46       1.91        1.00          1.87      1.49     320.00
trc_nxt             0.6103307E+01    1.44             6.09       1.56        1.00          1.85      1.20     160.00
trc_dta             0.4928072E+01    1.16             0.97       0.25        0.20          1.19      1.12       6.00
tra_ldf             0.4769230E+01    1.12             4.76       1.22        1.00          1.14      1.11     160.00
}}}

This indicates that the coupled model is badly load balanced (the biggest single cost is in sbc_cpl_rcv... presumably waiting for the atmos to catch up). That means that any optimisations to the ocean won't show up in total model elapsed time unless we rearrange the PE balance to make the ocean faster than the atmosphere. 

We also note that tra_adv_muscl is the second highest cost. This is a MEDUSA-only routine and is one of the things we've optimised so we would expect to see some improvement in that. 

However we can't use this suite directly since it uses an old version of the MEDUSA branch which clashes with up to date versions of the GO6 branch! Are there any jobs around which use up to date versions of the GO6 AND MEDUSA branches because we can't test things without that? 

Well we don't seem to have anything coupled-wise which fits the bill but Marc has u-aj369 which is an ocean-only run with MEDUSA using reasonably up to date versions of the MEDUSA and GO6 branches. So I take working copies of that - u-aj369control and u-aj369optalloc. These are then set up for 10 day runs with NEMO timer switched on and our optimisation branch applied in u-aj369optalloc. 

Here's the results:

The first thing to check is that both runs give the same results. tkdiff of solver.stat indicates bit comp at 10 days. 

Control run:
{{{ 
Total timing (sum) :
 --------------------
Elapsed Time (s)  CPU Time (s)
        84009.892    72252.383
 
Averaged timing on all processors :
-----------------------------------
Section             Elap. Time(s)  Elap. Time(%)  CPU Time(s)  CPU Time(%)  CPU/Elap Max elap(%)  Min elap(%)  Freq
istate_init         0.2690187E+02   18.44            16.12      12.85        0.60         18.55     18.11       1.00
tra_adv_muscl       0.1861153E+02   12.76            18.55      14.79        1.00         12.87     11.91     320.00
trc_stp             0.8253362E+01    5.66             6.04       4.82        0.73          6.83      4.07     320.00
sbc_ice_cice        0.8216911E+01    5.63             8.07       6.44        0.98          6.94      5.39     320.00
tra_ldf_iso         0.7819895E+01    5.36             7.81       6.23        1.00          5.42      3.63     640.00
trc_sms             0.7681783E+01    5.27             7.65       6.10        1.00          9.13      1.96     320.00
sbc                 0.6609725E+01    4.53             4.82       3.84        0.73          8.93      3.17     320.00
sol_pcg             0.5979100E+01    4.10             5.95       4.74        0.99          4.10      3.22     320.00
trc_sbc             0.5684169E+01    3.90             5.65       4.50        0.99          7.42      0.41     320.00
cice_sbc_init       0.3166722E+01    2.17             2.82       2.25        0.89          2.19      2.16       1.00
trc_nxt             0.2791707E+01    1.91             2.78       2.21        1.00          2.11      0.97     320.00
cice_sbc_out        0.2142690E+01    1.47             2.13       1.70        0.99          1.71      0.16     320.00
dia_wri             0.1469471E+01    1.01             1.46       1.17        1.00          2.10      0.14     320.00
ldf_slp             0.1370465E+01    0.94             1.36       1.09        0.99          0.97      0.92     320.00

}}}

Test Run

{{{
 Total timing (sum) :
 --------------------
Elapsed Time (s)  CPU Time (s)
        67852.649    59359.526
 
Averaged timing on all processors :
-----------------------------------
Section             Elap. Time(s)  Elap. Time(%)  CPU Time(s)  CPU Time(%)  CPU/Elap Max elap(%)  Min elap(%)  Freq
tra_adv_muscl       0.1672964E+02   14.20            16.67      16.18        1.00         14.32     13.08     320.00
sbc_ice_cice        0.8397770E+01    7.13             8.22       7.98        0.98          7.97      6.82     320.00
trc_sms             0.7469511E+01    6.34             7.43       7.21        0.99         10.11      2.43     320.00
trc_stp             0.6297795E+01    5.35             5.82       5.65        0.92          8.72      3.81     320.00
trc_sbc             0.5423700E+01    4.60             5.37       5.21        0.99          9.02      0.28     320.00
sbc                 0.5226975E+01    4.44             3.39       3.29        0.65          9.18      3.56     320.00
sol_pcg             0.5086075E+01    4.32             5.06       4.91        0.99          4.32      3.23     320.00
tra_ldf_iso         0.4719996E+01    4.01             4.71       4.57        1.00          4.09      1.83     640.00
istate_init         0.3775559E+01    3.21             1.30       1.26        0.34          3.35      2.82       1.00
trc_nxt             0.2831918E+01    2.40             2.82       2.73        0.99          2.59      1.40     320.00
cice_sbc_init       0.2793284E+01    2.37             2.50       2.42        0.89          2.39      2.36       1.00
cice_sbc_out        0.2308216E+01    1.96             2.29       2.23        0.99          2.26      1.10     320.00
dom_vvl_rst         0.1516852E+01    1.29             0.35       0.34        0.23          1.45      0.83       3.00

}}}

So the overall elapsed time for our test run appears to drop by 20%. That's probably far enough outside the normal XC40 variability to suggest a positive impact. 
A detailed look at the top routines above suggest wild variance in start up time (istate_init is the slowest routine in control but only about 20% of the cost in 
the test run - that clearly is not a result of our optimisations!)

tra_adv_muscl claims to be 20% speed up in elapsed time but slightly slower in CPU! Go figure. 
sol_pcg claims to be slower in elapsed but faster in CPU!
tra_ldf_iso claims ~25% speed up in elapsed and 35% speed up in CPU. 

So as ever it's all pretty hazy but there's no evidence here to suggest things are any worse so it should be safe to go ahead with these changes..