Changes between Version 11 and Version 12 of 2020WP/ENHANCE10_acc_fix_traqsr
 Timestamp:
 20200519T11:51:35+02:00 (4 years ago)
Legend:
 Unmodified
 Added
 Removed
 Modified

2020WP/ENHANCE10_acc_fix_traqsr
v11 v12 253 253 == Option 2 revisited == 254 254 255 Following discussions with the previewer, it was decided that lowmemory option should be the best approach but the slight deterioration in performance over the original code may be down to the overzealous replacement of temporary scalars within the second 3D loop. On reflection there are also opportunities to reduce the number of floating point operations and load and store instructions within the first 3D loop. 256 257 Although significant variation between identical runs on the NOC cluster means the evidence is not conclusive; this second version of the lowmemory option does appear to improve on the original code and is certainly no worse whilst using less storage. Here are the tables with the new results added. Graphs are shown below the code differences. 255 Following discussions with the previewer, it was decided that lowmemory option should be the best approach but the slight deterioration in performance over the original code may be down to the overzealous replacement of temporary scalars within the second 3D loop. On reflection there are also opportunities to reduce the number of floating point operations and load and store instructions within the first 3D loop. Importantly, there are also opportunities to avoid some expensive operations by performing some calculations in log space. Here are the mathematically equivalent alternatives: 256 257 {{{#!diff 258  traqsr.F90 20200519 10:28:06.858457146 +0100 259 +++ LOMEM3/traqsr.F90 20200515 15:44:11.652736539 +0100 260 @@ 111,7 +111,6 @@ 261 REAL(wp) :: zzc0, zzc1, zzc2, zzc3 !   262 REAL(wp) :: zz0 , zz1 , ze3t, zlui !   263 REAL(wp) :: zCb, zCmax, zze, zpsi, zpsimax, zdelpsi, zCtot, zCze 264  REAL(wp) :: zlogze, zlogCtot, zlogCze 265 REAL(wp) :: zlogc 266 REAL(wp), ALLOCATABLE, DIMENSION(:,:) :: ze0, ze1, ze2, ze3 267 REAL(wp), ALLOCATABLE, DIMENSION(:,:,:) :: ztrdt, zetot, ztmp3d 268 @@ 168,21 +167,18 @@ 269 ! Separation in RGB depending of the surface Chl 270 DO_3D_00_00 ( 1, nksr + 1 ) 271 zchl = MIN( 10. , MAX( 0.03, sf_chl(1)%fnow(ji,jj,1) ) ) 272 + zCze = 1.12 * zchl**0.803 273 + zCtot = 40.6 * zchl**0.459 274 zlogc = LOG( zchl ) 275  zlogCze = 0.113328685307 + 0.803 * zlogc ! log(zCze = 1.12 * zchl**0.803) 276  zlogCtot= 3.703768066608 + 0.459 * zlogc ! log(zCtot = 40.6 * zchl**0.459) 277 ! 278 zCb = 0.768 + zlogc * ( 0.087  zlogc * ( 0.179 + zlogc * 0.025 ) ) 279 zCmax = 0.299  zlogc * ( 0.289  zlogc * 0.579 ) 280 zpsimax = 0.6  zlogc * ( 0.640  zlogc * ( 0.021 + zlogc * 0.115 ) ) 281 zdelpsi = 0.710 + zlogc * ( 0.159 + zlogc * 0.021 ) 282 ! 283  zlogze = 6.34247346942  0.746 * zlogCtot ! log(zze = 568.2 * zCtot**(0.746)) 284  IF( zlogze > 4.62497281328 ) zlogze = 5.298317366548  0.293 * zlogCtot 285  ! log(IF( zze > 102. ) zze = 200.0 * zCtot**(0.293)) 286  zze = EXP( zlogze ) 287  zpsi = gdepw(ji,jj,jk,Kmm) / zze 288  zCze = EXP( zlogCze ) 289 + zze = 568.2 * zCtot**(0.746) 290 + IF( zze > 102. ) zze = 200.0 * zCtot**(0.293) 291 + zpsi = gdepw(ji,jj,jk,Kmm) / zze 292 ! 293 ! NB. make sure zchl value is such that: zchl = MIN( 10. , MAX( 0.03, zchl ) ) 294 zchl = MIN( 10. , MAX( 0.03, zCze * ( zCb + zCmax * EXP( ( (zpsi  zpsimax) / zdelpsi )**2 ) ) ) ) 295 }}} 296 297 Despite significant variation between identical runs on the NOC cluster there is evidence that this second version of the lowmemory option improves on the original code and is certainly no worse whilst using less storage. Here are the tables with the new results added. Graphs are shown below the code differences. 258 298 259 299  '''% CPU spent in tra_qsr'''  260 300  #CPUs  original  minmem  lowmem  lowmen v2  261  2  1.76  1.82  1.83  1. 68262  8  1.38  1.48  1.46  1.14263  32  0.48  0.49  0.5  0. 44264  60  0.24  0.26  0.26  0. 13301  2  1.76  1.82  1.83  1.19  302  8  1.38  1.48  1.46  0.56  303  32  0.48  0.49  0.5  0.3  304  60  0.24  0.26  0.26  0.08  265 305 266 306 … … 268 308  '''Rank in sorted list of routines by CPU usage '''  269 309  #CPUs  original  minmem  lowmem  lowmen v2  270  2  14  12.67  12  1 4271  8  16.33  15.67  15  17.33272  32  22.33  21.33  23.33  2 3273  60  26  25  25  26310  2  14  12.67  12  18.33  311  8  16.33  15.67  15  21 312  32  22.33  21.33  23.33  27  313  60  26  25  25  30  274 314 275 315 Here is the final set of differences between this improved lowmemory solution and the original traqsr.F90: … … 277 317 {{{#!diff 278 318  ORG/traqsr.F90 20200513 11:37:57.094258396 +0100 279 +++ traqsr.F90 2020051 5 14:48:00.138206859+0100280 @@ 109,12 +109,1 1@@319 +++ traqsr.F90 20200519 10:28:06.858457146 +0100 320 @@ 109,12 +109,12 @@ 281 321 REAL(wp) :: zchl, zcoef, z1_2 ! local scalars 282 322 REAL(wp) :: zc0 , zc1 , zc2 , zc3 !   … … 289 329  REAL(wp), ALLOCATABLE, DIMENSION(:,:,:) :: ze0, ze1, ze2, ze3, zea, ztrdt 290 330  REAL(wp), ALLOCATABLE, DIMENSION(:,:,:) :: zetot, zchl3d 331 + REAL(wp) :: zlogze, zlogCtot, zlogCze 291 332 + REAL(wp) :: zlogc 292 333 + REAL(wp), ALLOCATABLE, DIMENSION(:,:) :: ze0, ze1, ze2, ze3 … … 295 336 ! 296 337 IF( ln_timing ) CALL timing_start('tra_qsr') 297 @@ 159,77 +15 8,75@@338 @@ 159,77 +159,78 @@ 298 339 ! 299 340 CASE( np_RGB , np_RGBc ) !== RGB fluxes ==! … … 311 352 + DO_3D_00_00 ( 1, nksr + 1 ) 312 353 + zchl = MIN( 10. , MAX( 0.03, sf_chl(1)%fnow(ji,jj,1) ) ) 313 + zCze = 1.12 * zchl**0.803314 + zCtot = 40.6 * zchl**0.459315 354 + zlogc = LOG( zchl ) 355 + zlogCze = 0.113328685307 + 0.803 * zlogc ! log(zCze = 1.12 * zchl**0.803) 356 + zlogCtot= 3.703768066608 + 0.459 * zlogc ! log(zCtot = 40.6 * zchl**0.459) 316 357 + ! 317 358 + zCb = 0.768 + zlogc * ( 0.087  zlogc * ( 0.179 + zlogc * 0.025 ) ) … … 320 361 + zdelpsi = 0.710 + zlogc * ( 0.159 + zlogc * 0.021 ) 321 362 + ! 322 + zze = 568.2 * zCtot**(0.746) 323 + IF( zze > 102. ) zze = 200.0 * zCtot**(0.293) 324 + zpsi = gdepw(ji,jj,jk,Kmm) / zze 363 + zlogze = 6.34247346942  0.746 * zlogCtot ! log(zze = 568.2 * zCtot**(0.746)) 364 + IF( zlogze > 4.62497281328 ) zlogze = 5.298317366548  0.293 * zlogCtot 365 + ! log(IF( zze > 102. ) zze = 200.0 * zCtot**(0.293)) 366 + zze = EXP( zlogze ) 367 + zpsi = gdepw(ji,jj,jk,Kmm) / zze 368 + zCze = EXP( zlogCze ) 325 369 + ! 326 370 + ! NB. make sure zchl value is such that: zchl = MIN( 10. , MAX( 0.03, zchl ) ) … … 426 470 ! 427 471 CASE( np_2BD ) !== 2bands fluxes ==! 428 !429 472 }}} 430 473 [[Image(percent_cpu_qsr.2.png)]]