New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
2020WP/ENHANCE-10_acc_fix_traqsr (diff) – NEMO

Changes between Version 4 and Version 5 of 2020WP/ENHANCE-10_acc_fix_traqsr


Ignore:
Timestamp:
2020-05-14T16:24:51+02:00 (4 years ago)
Author:
acc
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • 2020WP/ENHANCE-10_acc_fix_traqsr

    v4 v5  
    3030The current code is structured thus: 
    3131 
    32 {{{ 
     32{{{#!f 
    3333      CASE( np_RGB , np_RGBc )         !==  R-G-B fluxes  ==! 
    3434         ! 
     
    8484 * rename the zchl3d array to ztmp3d (since it is now used for two purposes) 
    8585 * only allocate ztmp3d to nksr+1; values below this are not used and nksr + 1 is likely << jpk 
    86  * calculate and store the attenuation coefficient look-up table index as soon as the sub-surface chlorophyll value is known. This keeps all LOG operations in one loop. 
    87  
    88 {{{ 
     86 * calculate and store the attenuation coefficient look-up table index as soon as the sub-surface chlorophyll value is known. This keeps all LOG operations in one loop and, in the case of constant chlorophyll, removes the LOG from the loop altogether. 
     87 
     88{{{#!f 
    8989      CASE( np_RGB , np_RGBc )         !==  R-G-B fluxes  ==! 
    9090         ! 
     
    154154 
    155155=== Option 2: Low memory use (retain loop order). 
    156 A compromise solution, which reduces memory use and should perform better is to remove all unnecessary full-depth arrays but maintain loop order by keeping a few 2D arrays.  
    157 {{{ 
     156A compromise solution, which reduces memory use and should perform better is to remove all unnecessary full-depth arrays but maintain loop order by keeping a few 2D arrays. The same additional changes listed above are also made. 
     157{{{#!f 
    158158       CASE( np_RGB , np_RGBc )         !==  R-G-B fluxes  ==! 
    159159         ! 
     
    222222Both these options produce identical results to the original code (based on an ORCA2_ICE_PISCES test using SETTE (which includes variable surface chlorophyll inputs). ln_timing was activated and the CPU time (averaged across all processors) spent in tra_qsr used as a simple measure of performance. Unfortunately, variations in runtime between successive tests (even with the same code) on the NOC cluster were almost as great as any difference arising from algorithmic differences. Each test was repeated 6 times with the following results: 
    223223 
    224 || code option    ||  
    225 || original code  ||  0.34      ||  0.34        ||  0.35        ||  0.35        ||  0.34        ||  0.34  ||     
    226 ||  minimum memory option ||  0.36      ||  0.36        ||  0.37        ||  0.36        ||  0.36        ||  0.37  ||  
    227 || low memory option ||  0.35   ||  0.35        ||  0.35        ||  0.36        ||  0.36        ||  0.35  || 
     224|| code option    |||||||||||| CPU seconds spent in tra_qsr || Average || 
     225|| original code  ||  0.34      ||  0.34        ||  0.35        ||  0.35        ||  0.34        ||  0.34  || '''0.3433''' ||     
     226||  minimum memory option ||  0.36      ||  0.36        ||  0.37        ||  0.36        ||  0.36        ||  0.37  || '''0.3633 ''' ||  
     227|| low memory option ||  0.35   ||  0.35        ||  0.35        ||  0.36        ||  0.36        ||  0.35  || ''' 0.3533 ''' || 
    228228 
    229229from which the tentative conclusion is that the minimum memory option does perform consistently worst but the low memory option appears to be a suitable replacement to the original code. More stringent tests are require to confirm this. 
    230230 
     231These initial tests were performed using the standard 32 processor SETTE test for ORCA2_ICE_PISCES. To search for a  better distinction between the options further tests were made by varying the number of processors. Tests with 2, 8, 32 and 60 processors were performed (3 for each option at each core count). The following table shows the percentage of CPU time spent in tra_qsr and the rank of the tra_qsr routine in the CPU time-sorted list of routines (a higher rank means tra_qsr is taking proportionally less of the overall CPU time). In each case the average of the 3 samples is given.: 
     232 
     233||||||||  '''% CPU spent in tra_qsr''' || 
     234|| #CPUs || original || min-mem || low-mem || 
     235|| 2 || 1.76 || 1.82 || 1.83 || 
     236|| 8 || 1.38 || 1.48 || 1.46 || 
     237|| 32 || 0.48 || 0.49 || 0.5 || 
     238|| 60 || 0.24 || 0.26 || 0.26 || 
     239 
     240 
     241\\ 
     242||||||||  '''Rank in sorted list of routines by CPU usage ''' || 
     243|| #CPUs || original || min-mem || low-mem || 
     244|| 2 || 14 || 12.67 || 12 || 
     245|| 8 || 16.33 || 15.67 || 15 || 
     246|| 32 || 22.33 || 21.33 || 23.33 || 
     247|| 60 || 26 || 25 || 25 || 
     248 
     249Unfortunately the message is still mixed 
     250 
     251[[Image()]] 
    231252''...'' 
    232253