= Name and subject of the action

Last edition: '''[[Wikinfo(changed_ts)]]''' by '''[[Wikinfo(changed_by)]]'''

The PI is responsible to closely follow the progress of the action, 
and especially to contact NEMO project manager if 
the delay on preview (or review) are longer than the 2 weeks expected.

[[PageOutline(2, , inline)]]

== Summary

||=Action       ||ENHANCE-10_acc_fix_traqsr                        ||
||=PI(S)        || acc                                                 ||
||=Digest       || Reduce use of 3D allocatable arrays in RGB light penetration schemes    ||
||=Dependencies || If any                                                ||
||=Branch       || source:/NEMO/branches/{YEAR}/dev_r{REV}_{ACTION_NAME} ||
||=Previewer(s) || Names                                                 ||
||=Reviewer(s)  || Names                                                 ||
||=Ticket       || #XXXX                                                 ||

=== Description

The current implementation of RGB light penetration in traqsr (either varying or constant chlorophyll) uses 6, domain-sized 3D, temporary arrays which can be reduced to a few 2D arrays. The impact of the current implementation is most evident at lower processor counts where the impact of the extra 3D arrays can cause cache-misses and memory band-width issues. In an extreme case traqsr can switch from consuming 2% of run-time to 68% (comparing ORCA025 running on 200 cores vs 48 cores).

A simple redesign of the algorithm should remove this behaviour.

=== Implementation

The current code is structured thus:

{{{
      CASE( np_RGB , np_RGBc )         !==  R-G-B fluxes  ==!
         !
         ALLOCATE( zekb(jpi,jpj)     , zekg(jpi,jpj)     , zekr  (jpi,jpj)     , &
            &      ze0 (jpi,jpj,jpk) , ze1 (jpi,jpj,jpk) , ze2   (jpi,jpj,jpk) , &
            &      ze3 (jpi,jpj,jpk) , zea (jpi,jpj,jpk) , zchl3d(jpi,jpj,jpk)   )
         !
         ! code to set zchl3d(:,:,1:nskr+1)
         !
         !
         zcoef  = ( 1. - rn_abs ) / 3._wp    !* surface equi-partition in R-G-B
         DO_2D_00_00
            ze0(ji,jj,1) = rn_abs * qsr(ji,jj)
            ze1(ji,jj,1) = zcoef  * qsr(ji,jj)
            ze2(ji,jj,1) = zcoef  * qsr(ji,jj)
            ze3(ji,jj,1) = zcoef  * qsr(ji,jj)
            zea(ji,jj,1) =          qsr(ji,jj)
         END_2D
         !
         DO jk = 2, nksr+1                   !* interior equi-partition in R-G-B depending of vertical profile of Chl
            DO_2D_00_00
               zchl = MIN( 10. , MAX( 0.03, zchl3d(ji,jj,jk) ) )
               irgb = NINT( 41 + 20.*LOG10(zchl) + 1.e-15 )
               zekb(ji,jj) = rkrgb(1,irgb)
               zekg(ji,jj) = rkrgb(2,irgb)
               zekr(ji,jj) = rkrgb(3,irgb)
            END_2D

            DO_2D_00_00
               zc0 = ze0(ji,jj,jk-1) * EXP( - e3t(ji,jj,jk-1,Kmm) * xsi0r       )
               zc1 = ze1(ji,jj,jk-1) * EXP( - e3t(ji,jj,jk-1,Kmm) * zekb(ji,jj) )
               zc2 = ze2(ji,jj,jk-1) * EXP( - e3t(ji,jj,jk-1,Kmm) * zekg(ji,jj) )
               zc3 = ze3(ji,jj,jk-1) * EXP( - e3t(ji,jj,jk-1,Kmm) * zekr(ji,jj) )
               ze0(ji,jj,jk) = zc0
               ze1(ji,jj,jk) = zc1
               ze2(ji,jj,jk) = zc2
               ze3(ji,jj,jk) = zc3
               zea(ji,jj,jk) = ( zc0 + zc1 + zc2 + zc3 ) * wmask(ji,jj,jk)
            END_2D
         END DO
         !
         DO_3D_00_00( 1, nksr )
            qsr_hc(ji,jj,jk) = r1_rho0_rcp * ( zea(ji,jj,jk) - zea(ji,jj,jk+1) )
         END_3D
         !
         DEALLOCATE( zekb , zekg , zekr , ze0 , ze1 , ze2 , ze3 , zea , zchl3d )
         !
}}}
Where most of the temporary, full-depth arrays are not necessary because only two vertical levels are required at any one time. In fact even the zea array is unnecessary since the zchl3d array could be repurposed once its value has been used.

=== Option 1: Minmum memory usage
By rearranging the loop order and placing the vertical loop innermost then the code can be greatly simplified to an equivalent using minimal temporary storage:

{{{
      CASE( np_RGB , np_RGBc )         !==  R-G-B fluxes  ==!
         !
         ALLOCATE( zchl3d(jpi,jpj,jpk)   )
         !
         ! code to set zchl3d(:,:,1:nskr+1)
         !
         !
         !
         zcoef  = ( 1. - rn_abs ) / 3._wp    !* surface equi-partition in R-G-B
         ! store the surface SW radiation;
         ! re-use the surface zchl3d array since the surface chl value is not used
         zchl3d(:,:,1) = qsr(:,:)
         !
         !* interior equi-partition in R-G-B depending of vertical profile of Chl
         DO_2D_00_00
            zc0 = rn_abs * qsr(ji,jj)
            zc1 = zcoef  * qsr(ji,jj)
            zc2 = zc1
            zc3 = zc1
            zc4 = e3t(ji,jj,1,Kmm)
            DO jk = 2, nksr+1
               zchl = MIN( 10. , MAX( 0.03, zchl3d(ji,jj,jk) ) )
               irgb = NINT( 41 + 20.*LOG10(zchl) + 1.e-15 )
               zc0 = zc0 * EXP( - zc4 * xsi0r       )
               zc1 = zc1 * EXP( - zc4 * rkrgb(1,irgb) )
               zc2 = zc2 * EXP( - zc4 * rkrgb(2,irgb) )
               zc3 = zc3 * EXP( - zc4 * rkrgb(3,irgb) )
               zc4 = e3t(ji,jj,jk,Kmm)
               ! store the SW radiation penetrating to this location
               ! re-use the zchl3d array since the chl value at this point will not be needed again
               zchl3d(ji,jj,jk) = ( zc0 + zc1 + zc2 + zc3 ) * wmask(ji,jj,jk)
            END DO
         END_2D
         !
         DO_3D_00_00( 1, nksr )
            qsr_hc(ji,jj,jk) = r1_rho0_rcp * ( zchl3d(ji,jj,jk) - zchl3d(ji,jj,jk+1) )
         END_3D
         !
         DEALLOCATE( zchl3d )
}}}

This is code and memory efficient but will perform poorly due to non-contiguous access to the array elements (see performance section below).

=== Option 2: Reduce full-depth arrays to single level arrays where possible.
A compromise solution, which reduces memory use and maintains performance is to remove all unnecessary full-depth arrays but maintain loop order. 
{{{
       CASE( np_RGB , np_RGBc )         !==  R-G-B fluxes  ==!
         !
         ALLOCATE( zeka(jpi,jpj)       , zekb(jpi,jpj)   ,            &
            &      zekg(jpi,jpj)       , zekr(jpi,jpj)   ,            &
            &      ze0 (jpi,jpj)       , ze1 (jpi,jpj) ,            &
            &      ze2 (jpi,jpj)       , ze3 (jpi,jpj) ,            &
            &      zchl3d(jpi,jpj,jpk)   )
         !
         ! code to set zchl3d(:,:,1:nskr+1)
         !
         !
         zcoef  = ( 1. - rn_abs ) / 3._wp    !* surface equi-partition in R-G-B
         DO_2D_00_00
            ze0(ji,jj) = rn_abs * qsr(ji,jj)
            ze1(ji,jj) = zcoef  * qsr(ji,jj)
            ze2(ji,jj) = zcoef  * qsr(ji,jj)
            ze3(ji,jj) = zcoef  * qsr(ji,jj)
            ! store the surface SW radiation
            ! re-use the surface zchl3d array since the surface chl is not used
            zchl3d(ji,jj,1) =       qsr(ji,jj)
         END_2D
         !
         DO jk = 2, nksr+1                   !* interior equi-partition in R-G-B depending of vertical profile of Chl
            DO_2D_00_00
               zchl = MIN( 10. , MAX( 0.03, zchl3d(ji,jj,jk) ) )
               irgb = NINT( 41 + 20.*LOG10(zchl) + 1.e-15 )
               ze3t = e3t(ji,jj,jk-1,Kmm)
               zeka(ji,jj) = EXP( - ze3t * xsi0r )
               zekb(ji,jj) = EXP( - ze3t * rkrgb(1,irgb) )
               zekg(ji,jj) = EXP( - ze3t * rkrgb(2,irgb) )
               zekr(ji,jj) = EXP( - ze3t * rkrgb(3,irgb) )
            END_2D

            DO_2D_00_00
               ze0(ji,jj) = ze0(ji,jj) * zeka(ji,jj)
               ze1(ji,jj) = ze1(ji,jj) * zekb(ji,jj)
               ze2(ji,jj) = ze2(ji,jj) * zekg(ji,jj)
               ze3(ji,jj) = ze3(ji,jj) * zekr(ji,jj)
               ! store the SW radiation penetrating to this location
               ! re-use the zchl3d array since the chl value at this point will
               ! not be needed again
               zchl3d(ji,jj,jk) = ( ze0(ji,jj) + ze1(ji,jj) + ze2(ji,jj) + ze3(ji,jj) ) * wmask(ji,jj,jk)
            END_2D
         END DO
         !
         DO_3D_00_00( 1, nksr )
            qsr_hc(ji,jj,jk) = r1_rho0_rcp * ( zchl3d(ji,jj,jk) - zchl3d(ji,jj,jk+1) )
         END_3D
         !
         DEALLOCATE( zeka, zekb , zekg , zekr , ze0 , ze1 , ze2 , ze3 , zchl3d )
}}}
''...''

=== Documentation updates

{{{#!box width=55em help
Using previous parts, define the main changes to be done in the NEMO literature 
(manuals, guide, web pages, …).
}}}

''...''

== Preview 

{{{#!box width=50em info
[[Include(wiki:Developers/DevProcess#preview_)]]
}}}

''...''

== Tests

{{{#!box width=50em info
[[Include(wiki:Developers/DevProcess#tests)]]
}}}

''...''

== Review

{{{#!box width=50em info
[[Include(wiki:Developers/DevProcess#review)]]
}}}

''...''