Changes between Version 11 and Version 12 of 2020WP/ENHANCE-10_acc_fix_traqsr
- Timestamp:
- 2020-05-19T11:51:35+02:00 (4 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
2020WP/ENHANCE-10_acc_fix_traqsr
v11 v12 253 253 == Option 2 revisited == 254 254 255 Following discussions with the previewer, it was decided that low-memory option should be the best approach but the slight deterioration in performance over the original code may be down to the over-zealous replacement of temporary scalars within the second 3D loop. On reflection there are also opportunities to reduce the number of floating point operations and load and store instructions within the first 3D loop. 256 257 Although significant variation between identical runs on the NOC cluster means the evidence is not conclusive; this second version of the low-memory option does appear to improve on the original code and is certainly no worse whilst using less storage. Here are the tables with the new results added. Graphs are shown below the code differences. 255 Following discussions with the previewer, it was decided that low-memory option should be the best approach but the slight deterioration in performance over the original code may be down to the over-zealous replacement of temporary scalars within the second 3D loop. On reflection there are also opportunities to reduce the number of floating point operations and load and store instructions within the first 3D loop. Importantly, there are also opportunities to avoid some expensive operations by performing some calculations in log space. Here are the mathematically equivalent alternatives: 256 257 {{{#!diff 258 --- traqsr.F90 2020-05-19 10:28:06.858457146 +0100 259 +++ LOMEM3/traqsr.F90 2020-05-15 15:44:11.652736539 +0100 260 @@ -111,7 +111,6 @@ 261 REAL(wp) :: zzc0, zzc1, zzc2, zzc3 ! - - 262 REAL(wp) :: zz0 , zz1 , ze3t, zlui ! - - 263 REAL(wp) :: zCb, zCmax, zze, zpsi, zpsimax, zdelpsi, zCtot, zCze 264 - REAL(wp) :: zlogze, zlogCtot, zlogCze 265 REAL(wp) :: zlogc 266 REAL(wp), ALLOCATABLE, DIMENSION(:,:) :: ze0, ze1, ze2, ze3 267 REAL(wp), ALLOCATABLE, DIMENSION(:,:,:) :: ztrdt, zetot, ztmp3d 268 @@ -168,21 +167,18 @@ 269 ! Separation in R-G-B depending of the surface Chl 270 DO_3D_00_00 ( 1, nksr + 1 ) 271 zchl = MIN( 10. , MAX( 0.03, sf_chl(1)%fnow(ji,jj,1) ) ) 272 + zCze = 1.12 * zchl**0.803 273 + zCtot = 40.6 * zchl**0.459 274 zlogc = LOG( zchl ) 275 - zlogCze = 0.113328685307 + 0.803 * zlogc ! log(zCze = 1.12 * zchl**0.803) 276 - zlogCtot= 3.703768066608 + 0.459 * zlogc ! log(zCtot = 40.6 * zchl**0.459) 277 ! 278 zCb = 0.768 + zlogc * ( 0.087 - zlogc * ( 0.179 + zlogc * 0.025 ) ) 279 zCmax = 0.299 - zlogc * ( 0.289 - zlogc * 0.579 ) 280 zpsimax = 0.6 - zlogc * ( 0.640 - zlogc * ( 0.021 + zlogc * 0.115 ) ) 281 zdelpsi = 0.710 + zlogc * ( 0.159 + zlogc * 0.021 ) 282 ! 283 - zlogze = 6.34247346942 - 0.746 * zlogCtot ! log(zze = 568.2 * zCtot**(-0.746)) 284 - IF( zlogze > 4.62497281328 ) zlogze = 5.298317366548 - 0.293 * zlogCtot 285 - ! log(IF( zze > 102. ) zze = 200.0 * zCtot**(-0.293)) 286 - zze = EXP( zlogze ) 287 - zpsi = gdepw(ji,jj,jk,Kmm) / zze 288 - zCze = EXP( zlogCze ) 289 + zze = 568.2 * zCtot**(-0.746) 290 + IF( zze > 102. ) zze = 200.0 * zCtot**(-0.293) 291 + zpsi = gdepw(ji,jj,jk,Kmm) / zze 292 ! 293 ! NB. make sure zchl value is such that: zchl = MIN( 10. , MAX( 0.03, zchl ) ) 294 zchl = MIN( 10. , MAX( 0.03, zCze * ( zCb + zCmax * EXP( -( (zpsi - zpsimax) / zdelpsi )**2 ) ) ) ) 295 }}} 296 297 Despite significant variation between identical runs on the NOC cluster there is evidence that this second version of the low-memory option improves on the original code and is certainly no worse whilst using less storage. Here are the tables with the new results added. Graphs are shown below the code differences. 258 298 259 299 |||||||||| '''% CPU spent in tra_qsr''' || 260 300 || #CPUs || original || min-mem || low-mem || low-men v2 || 261 || 2 || 1.76 || 1.82 || 1.83 || 1. 68||262 || 8 || 1.38 || 1.48 || 1.46 || 1.14||263 || 32 || 0.48 || 0.49 || 0.5 || 0. 44||264 || 60 || 0.24 || 0.26 || 0.26 || 0. 13||301 || 2 || 1.76 || 1.82 || 1.83 || 1.19 || 302 || 8 || 1.38 || 1.48 || 1.46 || 0.56 || 303 || 32 || 0.48 || 0.49 || 0.5 || 0.3 || 304 || 60 || 0.24 || 0.26 || 0.26 || 0.08 || 265 305 266 306 … … 268 308 |||||||||| '''Rank in sorted list of routines by CPU usage ''' || 269 309 || #CPUs || original || min-mem || low-mem || low-men v2 || 270 || 2 || 14 || 12.67 || 12 || 1 4||271 || 8 || 16.33 || 15.67 || 15 || 17.33||272 || 32 || 22.33 || 21.33 || 23.33 || 2 3||273 || 60 || 26 || 25 || 25 || 26||310 || 2 || 14 || 12.67 || 12 || 18.33 || 311 || 8 || 16.33 || 15.67 || 15 || 21|| 312 || 32 || 22.33 || 21.33 || 23.33 || 27 || 313 || 60 || 26 || 25 || 25 || 30 || 274 314 275 315 Here is the final set of differences between this improved low-memory solution and the original traqsr.F90: … … 277 317 {{{#!diff 278 318 --- ORG/traqsr.F90 2020-05-13 11:37:57.094258396 +0100 279 +++ traqsr.F90 2020-05-1 5 14:48:00.138206859+0100280 @@ -109,12 +109,1 1@@319 +++ traqsr.F90 2020-05-19 10:28:06.858457146 +0100 320 @@ -109,12 +109,12 @@ 281 321 REAL(wp) :: zchl, zcoef, z1_2 ! local scalars 282 322 REAL(wp) :: zc0 , zc1 , zc2 , zc3 ! - - … … 289 329 - REAL(wp), ALLOCATABLE, DIMENSION(:,:,:) :: ze0, ze1, ze2, ze3, zea, ztrdt 290 330 - REAL(wp), ALLOCATABLE, DIMENSION(:,:,:) :: zetot, zchl3d 331 + REAL(wp) :: zlogze, zlogCtot, zlogCze 291 332 + REAL(wp) :: zlogc 292 333 + REAL(wp), ALLOCATABLE, DIMENSION(:,:) :: ze0, ze1, ze2, ze3 … … 295 336 ! 296 337 IF( ln_timing ) CALL timing_start('tra_qsr') 297 @@ -159,77 +15 8,75@@338 @@ -159,77 +159,78 @@ 298 339 ! 299 340 CASE( np_RGB , np_RGBc ) !== R-G-B fluxes ==! … … 311 352 + DO_3D_00_00 ( 1, nksr + 1 ) 312 353 + zchl = MIN( 10. , MAX( 0.03, sf_chl(1)%fnow(ji,jj,1) ) ) 313 + zCze = 1.12 * zchl**0.803314 + zCtot = 40.6 * zchl**0.459315 354 + zlogc = LOG( zchl ) 355 + zlogCze = 0.113328685307 + 0.803 * zlogc ! log(zCze = 1.12 * zchl**0.803) 356 + zlogCtot= 3.703768066608 + 0.459 * zlogc ! log(zCtot = 40.6 * zchl**0.459) 316 357 + ! 317 358 + zCb = 0.768 + zlogc * ( 0.087 - zlogc * ( 0.179 + zlogc * 0.025 ) ) … … 320 361 + zdelpsi = 0.710 + zlogc * ( 0.159 + zlogc * 0.021 ) 321 362 + ! 322 + zze = 568.2 * zCtot**(-0.746) 323 + IF( zze > 102. ) zze = 200.0 * zCtot**(-0.293) 324 + zpsi = gdepw(ji,jj,jk,Kmm) / zze 363 + zlogze = 6.34247346942 - 0.746 * zlogCtot ! log(zze = 568.2 * zCtot**(-0.746)) 364 + IF( zlogze > 4.62497281328 ) zlogze = 5.298317366548 - 0.293 * zlogCtot 365 + ! log(IF( zze > 102. ) zze = 200.0 * zCtot**(-0.293)) 366 + zze = EXP( zlogze ) 367 + zpsi = gdepw(ji,jj,jk,Kmm) / zze 368 + zCze = EXP( zlogCze ) 325 369 + ! 326 370 + ! NB. make sure zchl value is such that: zchl = MIN( 10. , MAX( 0.03, zchl ) ) … … 426 470 ! 427 471 CASE( np_2BD ) !== 2-bands fluxes ==! 428 !429 472 }}} 430 473 [[Image(percent_cpu_qsr.2.png)]]