Changes between Version 6 and Version 7 of 2019WP/ENHANCE-04_AndrewC-reporting
- Timestamp:
- 2019-01-16T15:26:45+01:00 (6 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
2019WP/ENHANCE-04_AndrewC-reporting
v6 v7 8 8 9 9 == Summary 10 10 11 Investigate ways of improving the code's reporting facilities. Currently, errors away from the lead process are not fully reported and the ln_ctl mechanism produces an overwhelming volume of output for large processor counts. Options are needed to produce more selective output (either by output type or processor range). This is an un-started task from the 2018WP (formerly ROBUST-06_AndrewC-reporting) that has been carried forward to 2019. 11 12 ''#2167'' 12 13 13 14 == Preview 14 15 '''(version 2 following initial preview by Sebastien ( see his comments below))''' 15 16 {{{#!box help 16 17 [[Include(wiki:Developers/DevProcess#preview_)]] … … 25 26 mpp.output_XXXX 26 27 mpp.top.output_XXX <---- correct this to 4-digits for consistency 27 EMPave.dat_XXXX 28 EMPave.dat_XXXX <---- These are all identical, suppress writing to write-master only 28 29 icebergs.stat_XXXX 29 30 }}} … … 36 37 3. OCE/stpctl.F90 37 38 4. TOP/prtctl_trc.F90 TOP/trcini.F90 TOP/trcstp.F90 38 5. OCE/LBC/mppini.F90 39 5. OCE/LBC/mppini.F90 OCE/SBC/sbcfwb.F90 39 40 6. sette/sette.sh 40 41 }}} 42 43 To reiterate, these changes allow run.stat and tracer.stat to be produced even when ln_ctl is .false.. They also introduce additional controls such as updating run.stat, tracer.stat and time.step at integer multiples of time step rather than every time step and for restricting the production of some types of output (e.g. layout.dat) to a subset of processing regions. There are good arguments for separating out the multiple uses that ln_ctl has grown to support (see Sebastien's comments on its history) but I propose these changes , as a quick, pre-release addition. 44 41 45 ''' 1. OCE/IOM/in_out_manager.F90 cfgs/SHARED/namelist_ref ''' 42 46 … … 48 52 * l_config Activates use of the settings in the rest of the structure (specifically for when ln_ctl is false) 49 53 * l_runstat, l_trcstat Activates production of global stats files. Only a single file of each of these is ever produced. 50 * l_oceout, l_layout , l_EMPaveNormal operation is to produce a single version of each of these. If true then a version for each area is produced54 * l_oceout, l_layout Normal operation is to produce a single version of each of these. If true then a version for each area is produced 51 55 * l_mppout, l_mpptrc, l_icbstat Suppressed if false, otherwise produce a version for each area. 52 56 * procmin, procmax, procincr Allow subsetting of areas when producing output in the previous two categories. Default values will ensure all areas report. 57 * ptimincr Timestep increment for outputting of time step status information less frequently (affects run.stat, tracer.stat and time.step only) 53 58 }}} 54 59 … … 56 61 Index: OCE/IOM/in_out_manager.F90 57 62 =================================================================== 58 --- OCE/IOM/in_out_manager.F90 (revision 10 459)63 --- OCE/IOM/in_out_manager.F90 (revision 10530) 59 64 +++ OCE/IOM/in_out_manager.F90 (working copy) 60 65 @@ -99,6 +99,27 @@ … … 71 76 + LOGICAL :: l_oceout = .FALSE. !: Produce all ocean.outputs (T) or just one (F) 72 77 + LOGICAL :: l_layout = .FALSE. !: Produce all layout.dat files (T) or just one (F) 73 + LOGICAL :: l_EMPave = .FALSE. !: Produce all EMPave.dat files (T) or just one (F) (if active)74 78 + LOGICAL :: l_mppout = .FALSE. !: Produce/do not produce mpp.output_XXXX files (T/F) 75 79 + LOGICAL :: l_mpptop = .FALSE. !: Produce/do not produce mpp.top.output_XXXX files (T/F) … … 81 85 + INTEGER :: procmax = 1000000 !: Maximum narea to output 82 86 + INTEGER :: procincr = 1 !: narea increment to output 87 + INTEGER :: ptimincr = 1 !: timestep increment to output (time.step and run.stat) 83 88 + END TYPE 84 89 + TYPE (sn_ctl) :: sn_cfctl !: run control structure for selective output … … 89 94 Index: cfgs/SHARED/namelist_ref 90 95 =================================================================== 91 --- cfgs/SHARED/namelist_ref (revision 10 459)96 --- cfgs/SHARED/namelist_ref (revision 10530) 92 97 +++ cfgs/SHARED/namelist_ref (working copy) 93 @@ -130 1,7 +1301,19 @@98 @@ -1303,7 +1303,19 @@ 94 99 !----------------------------------------------------------------------- 95 100 &namctl ! Control prints (default: OFF) … … 102 107 + sn_cfctl%l_oceout = .FALSE. ! that all areas report. 103 108 + sn_cfctl%l_layout = .FALSE. ! 104 + sn_cfctl%l_EMPave = .FALSE. !105 109 + sn_cfctl%l_mppout = .FALSE. ! 106 110 + sn_cfctl%l_mpptop = .FALSE. ! … … 109 113 + sn_cfctl%procmax = 1000000 ! Maximum area number for reporting [default:1000000] 110 114 + sn_cfctl%procincr = 1 ! Increment for optional subsetting of areas [default:1] 115 + sn_cfctl%ptimincr = 1 ! Timestep increment for writing time step progress info 111 116 nn_print = 0 ! level of print (0 no extra print) 112 117 nn_ictls = 0 ! start i indice of control sum (use to compare mono versus 113 118 nn_ictle = 0 ! end i indice of control sum multi processor runs 119 114 120 }}} 115 121 … … 121 127 Index: OCE/nemogcm.F90 122 128 =================================================================== 123 --- OCE/nemogcm.F90 (revision 10 459)129 --- OCE/nemogcm.F90 (revision 10530) 124 130 +++ OCE/nemogcm.F90 (working copy) 125 131 @@ -256,8 +256,8 @@ … … 134 140 NAMELIST/namcfg/ ln_read_cfg, cn_domcfg, ln_closea, ln_write_cfg, cn_domcfg_out, ln_use_jattr 135 141 !!---------------------------------------------------------------------- 136 @@ -327,6 +327,1 7@@137 142 @@ -327,6 +327,18 @@ 143 138 144 narea = narea + 1 ! mynode return the rank of proc (0 --> jpnij -1 ) 139 145 140 146 + IF( sn_cfctl%l_config ) THEN 141 147 + ! Activate finer control of report outputs … … 146 152 + & CALL nemo_set_cfctl( sn_cfctl, .FALSE., .FALSE. ) 147 153 + ELSE 154 + ! Use ln_ctl to turn on or off all options. 148 155 + CALL nemo_set_cfctl( sn_cfctl, ln_ctl, .TRUE. ) 149 156 + ENDIF … … 151 158 lwm = (narea == 1) ! control of output namelists 152 159 lwp = (narea == 1) .OR. ln_ctl ! control of all listing output print 153 154 @@ - 489,6 +500,18 @@160 161 @@ -503,6 +515,18 @@ 155 162 WRITE(numout,*) '~~~~~~~~' 156 163 WRITE(numout,*) ' Namelist namctl' … … 161 168 + WRITE(numout,*) ' sn_cfctl%l_oceout = ', sn_cfctl%l_oceout 162 169 + WRITE(numout,*) ' sn_cfctl%l_layout = ', sn_cfctl%l_layout 163 + WRITE(numout,*) ' sn_cfctl%l_EMPave = ', sn_cfctl%l_EMPave164 170 + WRITE(numout,*) ' sn_cfctl%l_mppout = ', sn_cfctl%l_mppout 165 171 + WRITE(numout,*) ' sn_cfctl%l_mpptop = ', sn_cfctl%l_mpptop … … 168 174 + WRITE(numout,*) ' sn_cfctl%procmax = ', sn_cfctl%procmax 169 175 + WRITE(numout,*) ' sn_cfctl%procincr = ', sn_cfctl%procincr 176 + WRITE(numout,*) ' sn_cfctl%ptimincr = ', sn_cfctl%ptimincr 170 177 WRITE(numout,*) ' level of print nn_print = ', nn_print 171 178 WRITE(numout,*) ' Start i indice for SUM control nn_ictls = ', nn_ictls 172 179 WRITE(numout,*) ' End i indice for SUM control nn_ictle = ', nn_ictle 173 @@ -6 35,6 +658,35@@180 @@ -649,6 +673,34 @@ 174 181 ! 175 182 END SUBROUTINE nemo_alloc 176 183 177 184 + SUBROUTINE nemo_set_cfctl(sn_cfctl, setto, for_all ) 178 185 + !!---------------------------------------------------------------------- … … 198 205 + sn_cfctl%l_oceout = setto 199 206 + sn_cfctl%l_layout = setto 200 + sn_cfctl%l_EMPave = setto201 207 + sn_cfctl%l_mppout = setto 202 208 + sn_cfctl%l_mpptop = setto … … 215 221 216 222 {{{#!diff 217 218 223 Index: OCE/stpctl.F90 219 224 =================================================================== 220 --- OCE/stpctl.F90 (revision 10 459)225 --- OCE/stpctl.F90 (revision 10530) 221 226 +++ OCE/stpctl.F90 (working copy) 222 @@ -33,7 +33,7 @@ 223 PUBLIC stp_ctl ! routine called by step.F90 224 225 INTEGER :: idrun, idtime, idssh, idu, ids1, ids2, idt1, idt2, idc1, idw1, istatus 226 - LOGICAL :: lsomeoce 227 + LOGICAL :: lsomeoce, lcolruns, lwrtruns 228 !!---------------------------------------------------------------------- 229 !! NEMO/OCE 4.0 , NEMO Consortium (2018) 230 !! $Id$ 231 @@ -69,6 +69,8 @@ 227 @@ -66,17 +66,22 @@ 228 INTEGER, DIMENSION(3) :: iu, is1, is2 ! min/max loc indices 229 REAL(wp) :: zzz ! local real 230 REAL(wp), DIMENSION(9) :: zmax 231 + LOGICAL :: ll_wrtstp, ll_colruns, ll_wrtruns 232 232 CHARACTER(len=20) :: clname 233 233 !!---------------------------------------------------------------------- 234 234 ! 235 + IF( kt == nit000 ) lcolruns = ln_ctl .OR. ( sn_cfctl%l_config .AND. sn_cfctl%l_runstat ) 236 + IF( kt == nit000 ) lwrtruns = lcolruns .AND. lwm 235 + ll_wrtstp = ( MOD( kt, sn_cfctl%ptimincr ) == 0 ) .OR. ( kt == nitend ) 236 + ll_colruns = ll_wrtstp .AND. ( ln_ctl .OR. sn_cfctl%l_runstat ) 237 + ll_wrtruns = ll_colruns .AND. lwm 237 238 IF( kt == nit000 .AND. lwp ) THEN 238 WRITE(numout,*) 239 WRITE(numout,*) 239 240 WRITE(numout,*) 'stp_ctl : time-stepping control' 240 @@ -76,7 +78,7 @@ 241 WRITE(numout,*) '~~~~~~~' 241 242 ! ! open time.step file 242 243 IF( lwm ) CALL ctl_opn( numstp, 'time.step', 'REPLACE', 'FORMATTED', 'SEQUENTIAL', -1, numout, lwp, narea ) 243 244 - ! ! open run.stat file 244 245 - IF( ln_ctl .AND. lwm ) THEN 245 + IF( lwrtruns ) THEN 246 + ! ! open run.stat file(s) at start whatever 247 + ! ! the value of sn_cfctl%ptimincr 248 + IF( lwm .AND. ( ln_ctl .OR. sn_cfctl%l_runstat ) ) THEN 246 249 CALL ctl_opn( numrun, 'run.stat', 'REPLACE', 'FORMATTED', 'SEQUENTIAL', -1, numout, lwp, narea ) 247 clname = 'run.stat.nc' 250 clname = 'run.stat.nc' 248 251 IF( .NOT. Agrif_Root() ) clname = TRIM(Agrif_CFixed())//"_"//TRIM(clname) 249 @@ -120,12 +122,12 @@ 252 @@ -98,7 +103,7 @@ 253 ENDIF 254 IF( kt == nit000 ) lsomeoce = COUNT( ssmask(:,:) == 1._wp ) > 0 255 ! 256 - IF(lwm) THEN !== current time step ==! ("time.step" file) 257 + IF(lwm .AND. ll_wrtstp) THEN !== current time step ==! ("time.step" file) 258 WRITE ( numstp, '(1x, i8)' ) kt 259 REWIND( numstp ) 260 ENDIF 261 @@ -120,12 +125,12 @@ 250 262 zmax(9) = MAXVAL( Cu_adv(:,:,:) , mask = tmask(:,:,:) == 1._wp ) ! cell Courant no. max 251 263 ENDIF 252 264 ! 253 265 - IF( ln_ctl ) THEN 254 + IF( l colruns ) THEN266 + IF( ll_colruns ) THEN 255 267 CALL mpp_max( "stpctl", zmax ) ! max over the global domain 256 268 nstop = NINT( zmax(7) ) ! nstop indicator sheared among all local domains … … 258 270 ! !== run statistics ==! ("run.stat" files) 259 271 - IF( ln_ctl .AND. lwm ) THEN 260 + IF( l wrtruns ) THEN272 + IF( ll_wrtruns ) THEN 261 273 WRITE(numrun,9500) kt, zmax(1), zmax(2), -zmax(3), zmax(4) 262 274 istatus = NF90_PUT_VAR( idrun, idssh, (/ zmax(1)/), (/kt/), (/1/) ) … … 266 278 ''' 4. TOP/prtctl_trc.F90 TOP/trcini.F90 TOP/trcstp.F90 ''' 267 279 268 Changes to control the production of tracer.stat follow similar lines with the introduction of a lltrcstat local logical. Note also changes to prtctl_trc.F90 to make mpp.top.output filenames compatible with other similar filenames (i.e. use I4.4 for area number). 280 Changes to control the production of tracer.stat follow similar lines with the introduction of a lltrcstat local logical. Note also changes to prtctl_trc.F90 to make mpp.top.output filenames compatible with other similar filenames (i.e. use I4.4 for area number). This one is a little off message because the control for this TOP output has been read in OCE/nemogcm.F90. I think this makes sense and moving it to the TOP name list just for the sake of keeping OCE and TOP fully independent seems unnecessary; TBD. 269 281 {{{#!diff 270 282 Index: TOP/prtctl_trc.F90 … … 309 321 Index: TOP/trcstp.F90 310 322 =================================================================== 311 --- TOP/trcstp.F90 (revision 10 459)323 --- TOP/trcstp.F90 (revision 10530) 312 324 +++ TOP/trcstp.F90 (working copy) 313 @@ - 31,6 +31,7 @@314 PUBLIC trc_stp ! called by step315 316 LOGICAL :: llnew ! ???317 + LOGICAL :: lltrcstat ! ???318 REAL(wp) :: rdt_sampl ! ???319 INTEGER :: nb_rec_per_day, ktdcy ! ???320 REAL(wp) :: rsecfst, rseclast ! ???321 @@ -67,6 +68, 7@@325 @@ -56,6 +56,7 @@ 326 ! 327 INTEGER :: jk, jn ! dummy loop indices 328 REAL(wp):: ztrai ! local scalar 329 + LOGICAL :: ll_trcstat ! local logical 330 CHARACTER (len=25) :: charout ! 331 !!------------------------------------------------------------------- 332 ! 333 @@ -67,6 +68,8 @@ 322 334 r2dttrc = 2. * rdttrc ! = 2 rdttrc (leapfrog) 323 335 ENDIF 324 336 ! 325 + IF( kt == nittrc000 ) lltrcstat = ln_ctl .OR. (sn_cfctl%l_config .AND. sn_cfctl%l_trcstat) 337 + ll_trcstat = ( ln_ctl .OR. sn_cfctl%l_trcstat ) .AND. & 338 + & ( ( MOD( kt, sn_cfctl%ptimincr ) == 0 ) .OR. ( kt == nitend ) ) 326 339 IF( kt == nittrc000 .AND. lk_trdmxl_trc ) CALL trd_mxl_trc_init ! trends: Mixed-layer 327 340 ! 328 341 IF( .NOT.ln_linssh ) THEN ! update ocean volume due to ssh temporal evolution 329 @@ -108,7 +11 0,7 @@342 @@ -108,7 +111,7 @@ 330 343 ! 331 344 ENDIF 332 345 ! 333 346 - IF (ln_ctl ) THEN 334 + IF (ll trcstat) THEN347 + IF (ll_trcstat) THEN 335 348 ztrai = 0._wp ! content of all tracers 336 349 DO jn = 1, jptra … … 340 353 ''' 5. OCE/LBC/mppini.F90 ''' 341 354 342 Control over the creation of multiple layout.dat files is easy to implement and so has been done as an example of how output in this category will be handled in future. For this category, the area subsetting can be used to restrict which areas produce files. The standard layout.dat file, produced by narea = 1 is always produced. 355 Control over the creation of multiple layout.dat files is easy to implement and so has been done as an example of how output in this category will be handled in future. For this category, the area subsetting can be used to restrict which areas produce files. The standard layout.dat file, produced by narea = 1 is always produced. Also, OCE/SBC/sbcfwb.F90 has been fixed so that only one EMPave.dat file is ever read or produced. Ultimately, only one processor should read this file and MPI_BCAST the values to all others as a more scalable solution (for name lists too?). 343 356 344 357 {{{#!diff … … 399 412 DEALLOCATE(iin, ijn, ii_nono, ii_noea, ii_noso, ii_nowe, & 400 413 & iimppt, ijmppt, ibondi, ibondj, ipproc, ipolj, & 414 415 416 Index: OCE/SBC/sbcfwb.F90 417 =================================================================== 418 --- OCE/SBC/sbcfwb.F90 (revision 10530) 419 +++ OCE/SBC/sbcfwb.F90 (working copy) 420 @@ -143,7 +143,7 @@ 421 qns(:,:) = qns(:,:) - zcoef * sst_m(:,:) * tmask(:,:,1) ! account for change to the heat budget due to fw correction 422 ENDIF 423 ! 424 - IF( kt == nitend .AND. lwp ) THEN ! save fwfold value in a file 425 + IF( kt == nitend .AND. lwm ) THEN ! save fwfold value in a file (only one required) 426 CALL ctl_opn( inum, 'EMPave.dat', 'REPLACE', 'FORMATTED', 'SEQUENTIAL', -1, numout, .FALSE., narea ) 427 WRITE( inum, "(24X,I8,2ES24.16)" ) nyear, a_fwb_b, a_fwb 428 CLOSE( inum ) 401 429 }}} 402 430 … … 423 451 set_namelist namelist_cfg ln_cdgw .true. 424 452 }}} 453 454 == Sebastien's comments on the first draft, many of which have already been incorporated into version2 above: == 455 Maybe you know this story, but it case you don’t know… :-) 456 457 1) A long time ago - in a galaxy far far away - ... ln_ctl was there (since ever?, at least in rev3) and was a control print of the trend over the first core. 458 459 2) At rev258 https://forge.ipsl.jussieu.fr/nemo/changeset/258 the concept was modified and extended to be able to debug mpi reproducibility. 460 prtctl was added but, to my knowledge, never documented. 461 The idea is to compare the mean value of every trend in the code over (1) the local mpi domain and (2) the same domain but in a 1 core simulation (or n core as soon as the domain you want to test is included in your mpi subdomain). You compare you mpp.output_xxxx and mono.output_xxxx files and the first difference tells you which core and which trend is the first to diverge. 462 463 3) Next, as prtctl was dealing with mpp.output_xxxx and mono.output_xxxx files, we again extended the use of ln_ctl to control the creation of any file trough the definition of lwp: https://forge.ipsl.jussieu.fr/nemo/changeset/1579 464 This was a bad idea as, with 1 variable (ln_ctl), we control 2 different concepts : (1) mpp debug which requires the production of mpp.output_xxxx and mono.output_xxxx files and (2) the control of all other outputs files (ocean.output, run.stat, layout.dat…) 465 466 4) For Performance issues, with Eric, we wanted to be able to remove the globsum in stpctl. At some point we thought that we should introduce a new namelist variable something like in_fast or ln_prod or ln_debug or nn_debug which would allow us to reduce the printed informations and use faster code (for example without globsum in stpctl). 467 After some discussions, we thought that our idea was not clear enough to be introduced in the code before the release in December. We therefore decided to postpone this and simply use ln_ctl to switch on/off the globsum (because if ln_ctl = T, you don’t care of the performances so you can do globsum in stpctl). Once you switch off the globsum, the data in run.stat are useless, so we also switch off run.stat files. 468 469 470 So Today, ln_ctl is mixing different functionalities coming from theses 4 layers of developments. 471 I think we should take advantage of your development to clean-up this mess! 472 473 One proposition (to be discussed) could be to split ln_ctl into 3 functionalities: 474 - mpp debug associated to prtctl. That could be renamed, for example in prttrd or trdprt for "trend print". Or mppdbg for mpp debug? this part requires the creation of the mpp.output_xxxx and mono.output_xxxx files which, to me, should be controlled only by the activation or not of prtctl 475 - something (a logical, an integer, a structure) related to the balance between verbosity-control-debug and performance. This would control/replace/be linked with the creation of run.stat files, nn_print, ln_timing maybe also layout.dat 476 - use your sn_ctl to control other files 477 478 479 Other minor points : 480 - EMPave.dat : I had a quick look but it seems that this file is the same for each core (contains only the year and global mean). So we should not offer the possibility to create EMPave.dat_xxxx files. Either all cores read the same file (as we do for the nameliste) or only core 0 read it and use mpi_broadcast to send the informations to all other cores (as, I think, we should also do for the name lists). '''First part done, will consider use of mpi_bcast in 2019''' 481 - We have files created by oce, si3 and top. Maybe, It is strange if a part of OCE, control the creation of top files like tracer.stat, no? '''To be discussed ''' 482 - instead of having to test sn_cfctl%l_config everywhere as for example in "sn_cfctl%l_config .AND. sn_cfctl%l_trcstat”, I would, in nemo_set_cfctl, do sn_cfctl%l_trcstat = setto .AND. sn_cfctl%l_config ''' Actually the test of l_config isn't needed at the lower levels, removed ''' 483 - there is also time.step file ''' Not sure we ever want to switch this off but it can now be updated less frequently according to ptimincr ''' 484 - do we really need the sn_ctl structure? why no a simple list of logical and integer? ''' To be discussed ''' 485 - I would like to “promote” the use of ln_timing, except in production mode, so people become used to look at this file to see the main bottlenecks in computational coast and communications… ''' Think this is a separate issue unless multiple files are produced? ''' 425 486 == Tests 426 487