Changes between Version 6 and Version 7 of 2019WP/ENHANCE-04_AndrewC-reporting


Ignore:
Timestamp:
2019-01-16T15:26:45+01:00 (15 months ago)
Author:
acc
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • 2019WP/ENHANCE-04_AndrewC-reporting

    v6 v7  
    88 
    99== Summary 
     10 
    1011Investigate ways of improving the code's reporting facilities. Currently, errors away from the lead process are not fully reported and the ln_ctl mechanism produces an overwhelming volume of output for large processor counts. Options are needed to produce more selective output (either by output type or processor range). This is an un-started task from the 2018WP (formerly ROBUST-06_AndrewC-reporting) that has been carried forward to 2019. 
    1112''#2167'' 
    1213 
    1314== Preview  
    14  
     15'''(version 2 following initial preview by Sebastien ( see his comments below))''' 
    1516{{{#!box help 
    1617[[Include(wiki:Developers/DevProcess#preview_)]] 
     
    2526mpp.output_XXXX 
    2627mpp.top.output_XXX     <---- correct this to 4-digits for consistency 
    27 EMPave.dat_XXXX 
     28EMPave.dat_XXXX        <---- These are all identical, suppress writing to write-master only 
    2829icebergs.stat_XXXX 
    2930}}} 
     
    36373. OCE/stpctl.F90 
    37384. TOP/prtctl_trc.F90 TOP/trcini.F90 TOP/trcstp.F90 
    38 5. OCE/LBC/mppini.F90 
     395. OCE/LBC/mppini.F90  OCE/SBC/sbcfwb.F90 
    39406. sette/sette.sh 
    4041}}} 
     42 
     43To reiterate, these changes allow run.stat and tracer.stat to be produced even when ln_ctl is .false.. They also introduce additional controls such as updating run.stat, tracer.stat and time.step at integer multiples of time step rather  than every time step and for restricting the production of some types of output (e.g. layout.dat) to a subset of processing regions. There are good arguments for separating out the multiple uses that ln_ctl has grown to support (see Sebastien's comments on its history) but I propose these changes , as a quick, pre-release addition. 
     44 
    4145''' 1. OCE/IOM/in_out_manager.F90 cfgs/SHARED/namelist_ref ''' 
    4246 
     
    4852* l_config                        Activates use of the settings in the rest of the structure (specifically for when ln_ctl is false) 
    4953* l_runstat, l_trcstat            Activates production of global stats files. Only a single file of each of these is ever produced. 
    50 * l_oceout, l_layout, l_EMPave    Normal operation is to produce a single version of each of these. If true then a version for each area is produced 
     54* l_oceout, l_layout              Normal operation is to produce a single version of each of these. If true then a version for each area is produced 
    5155* l_mppout, l_mpptrc, l_icbstat   Suppressed if false, otherwise produce a version for each area. 
    5256* procmin, procmax, procincr      Allow subsetting of areas when producing output in the previous two categories. Default values will ensure all areas report. 
     57* ptimincr                        Timestep increment for outputting of time step status information less frequently (affects run.stat, tracer.stat and time.step only) 
    5358}}} 
    5459 
     
    5661Index: OCE/IOM/in_out_manager.F90 
    5762=================================================================== 
    58 --- OCE/IOM/in_out_manager.F90  (revision 10459) 
     63--- OCE/IOM/in_out_manager.F90  (revision 10530) 
    5964+++ OCE/IOM/in_out_manager.F90  (working copy) 
    6065@@ -99,6 +99,27 @@ 
     
    7176+      LOGICAL :: l_oceout  = .FALSE.  !: Produce all ocean.outputs    (T) or just one (F) 
    7277+      LOGICAL :: l_layout  = .FALSE.  !: Produce all layout.dat files (T) or just one (F) 
    73 +      LOGICAL :: l_EMPave  = .FALSE.  !: Produce all EMPave.dat files (T) or just one (F) (if active) 
    7478+      LOGICAL :: l_mppout  = .FALSE.  !: Produce/do not produce mpp.output_XXXX files (T/F) 
    7579+      LOGICAL :: l_mpptop  = .FALSE.  !: Produce/do not produce mpp.top.output_XXXX files (T/F) 
     
    8185+      INTEGER :: procmax   = 1000000  !: Maximum narea to output 
    8286+      INTEGER :: procincr  = 1        !: narea increment to output 
     87+      INTEGER :: ptimincr  = 1        !: timestep increment to output (time.step and run.stat) 
    8388+   END TYPE 
    8489+   TYPE (sn_ctl) :: sn_cfctl     !: run control structure for selective output 
     
    8994Index: cfgs/SHARED/namelist_ref 
    9095=================================================================== 
    91 --- cfgs/SHARED/namelist_ref    (revision 10459) 
     96--- cfgs/SHARED/namelist_ref    (revision 10530) 
    9297+++ cfgs/SHARED/namelist_ref    (working copy) 
    93 @@ -1301,7 +1301,19 @@ 
     98@@ -1303,7 +1303,19 @@ 
    9499 !----------------------------------------------------------------------- 
    95100 &namctl        !   Control prints                                       (default: OFF) 
     
    102107+       sn_cfctl%l_oceout  = .FALSE. ! that  all areas report. 
    103108+       sn_cfctl%l_layout  = .FALSE. ! 
    104 +       sn_cfctl%l_EMPave  = .FALSE. ! 
    105109+       sn_cfctl%l_mppout  = .FALSE. ! 
    106110+       sn_cfctl%l_mpptop  = .FALSE. ! 
     
    109113+       sn_cfctl%procmax   = 1000000 ! Maximum area number for reporting [default:1000000] 
    110114+       sn_cfctl%procincr  = 1       ! Increment for optional subsetting of areas [default:1] 
     115+       sn_cfctl%ptimincr  = 1       ! Timestep increment for writing time step progress info 
    111116    nn_print    =    0      !  level of print (0 no extra print) 
    112117    nn_ictls    =    0      !  start i indice of control sum (use to compare mono versus 
    113118    nn_ictle    =    0      !  end   i indice of control sum        multi processor runs 
     119 
    114120}}} 
    115121 
     
    121127Index: OCE/nemogcm.F90 
    122128=================================================================== 
    123 --- OCE/nemogcm.F90     (revision 10459) 
     129--- OCE/nemogcm.F90     (revision 10530) 
    124130+++ OCE/nemogcm.F90     (working copy) 
    125131@@ -256,8 +256,8 @@ 
     
    134140       NAMELIST/namcfg/ ln_read_cfg, cn_domcfg, ln_closea, ln_write_cfg, cn_domcfg_out, ln_use_jattr 
    135141       !!---------------------------------------------------------------------- 
    136 @@ -327,6 +327,17 @@ 
    137  
     142@@ -327,6 +327,18 @@ 
     143  
    138144       narea = narea + 1                                     ! mynode return the rank of proc (0 --> jpnij -1 ) 
    139  
     145  
    140146+      IF( sn_cfctl%l_config ) THEN 
    141147+         ! Activate finer control of report outputs 
     
    146152+           &   CALL nemo_set_cfctl( sn_cfctl, .FALSE., .FALSE. ) 
    147153+      ELSE 
     154+         ! Use ln_ctl to turn on or off all options. 
    148155+         CALL nemo_set_cfctl( sn_cfctl, ln_ctl, .TRUE. ) 
    149156+      ENDIF 
     
    151158       lwm = (narea == 1)                                    ! control of output namelists 
    152159       lwp = (narea == 1) .OR. ln_ctl                        ! control of all listing output print 
    153  
    154 @@ -489,6 +500,18 @@ 
     160  
     161@@ -503,6 +515,18 @@ 
    155162          WRITE(numout,*) '~~~~~~~~' 
    156163          WRITE(numout,*) '   Namelist namctl' 
     
    161168+         WRITE(numout,*) '                              sn_cfctl%l_oceout  = ', sn_cfctl%l_oceout 
    162169+         WRITE(numout,*) '                              sn_cfctl%l_layout  = ', sn_cfctl%l_layout 
    163 +         WRITE(numout,*) '                              sn_cfctl%l_EMPave  = ', sn_cfctl%l_EMPave 
    164170+         WRITE(numout,*) '                              sn_cfctl%l_mppout  = ', sn_cfctl%l_mppout 
    165171+         WRITE(numout,*) '                              sn_cfctl%l_mpptop  = ', sn_cfctl%l_mpptop 
     
    168174+         WRITE(numout,*) '                              sn_cfctl%procmax   = ', sn_cfctl%procmax   
    169175+         WRITE(numout,*) '                              sn_cfctl%procincr  = ', sn_cfctl%procincr  
     176+         WRITE(numout,*) '                              sn_cfctl%ptimincr  = ', sn_cfctl%ptimincr  
    170177          WRITE(numout,*) '      level of print                  nn_print   = ', nn_print 
    171178          WRITE(numout,*) '      Start i indice for SUM control  nn_ictls   = ', nn_ictls 
    172179          WRITE(numout,*) '      End i indice for SUM control    nn_ictle   = ', nn_ictle 
    173 @@ -635,6 +658,35 @@ 
     180@@ -649,6 +673,34 @@ 
    174181       ! 
    175182    END SUBROUTINE nemo_alloc 
    176  
     183  
    177184+   SUBROUTINE nemo_set_cfctl(sn_cfctl, setto, for_all ) 
    178185+      !!---------------------------------------------------------------------- 
     
    198205+      sn_cfctl%l_oceout  = setto 
    199206+      sn_cfctl%l_layout  = setto 
    200 +      sn_cfctl%l_EMPave  = setto 
    201207+      sn_cfctl%l_mppout  = setto 
    202208+      sn_cfctl%l_mpptop  = setto 
     
    215221 
    216222{{{#!diff 
    217   
    218223Index: OCE/stpctl.F90 
    219224=================================================================== 
    220 --- OCE/stpctl.F90      (revision 10459) 
     225--- OCE/stpctl.F90      (revision 10530) 
    221226+++ OCE/stpctl.F90      (working copy) 
    222 @@ -33,7 +33,7 @@ 
    223     PUBLIC stp_ctl           ! routine called by step.F90 
    224  
    225     INTEGER  ::   idrun, idtime, idssh, idu, ids1, ids2, idt1, idt2, idc1, idw1, istatus 
    226 -   LOGICAL  ::   lsomeoce 
    227 +   LOGICAL  ::   lsomeoce, lcolruns, lwrtruns 
    228     !!---------------------------------------------------------------------- 
    229     !! NEMO/OCE 4.0 , NEMO Consortium (2018) 
    230     !! $Id$ 
    231 @@ -69,6 +69,8 @@ 
     227@@ -66,17 +66,22 @@ 
     228       INTEGER, DIMENSION(3)  ::   iu, is1, is2        ! min/max loc indices 
     229       REAL(wp)               ::   zzz                 ! local real  
     230       REAL(wp), DIMENSION(9) ::   zmax 
     231+      LOGICAL                ::   ll_wrtstp, ll_colruns, ll_wrtruns 
    232232       CHARACTER(len=20) :: clname 
    233233       !!---------------------------------------------------------------------- 
    234234       ! 
    235 +      IF( kt == nit000 )   lcolruns = ln_ctl .OR. ( sn_cfctl%l_config .AND. sn_cfctl%l_runstat ) 
    236 +      IF( kt == nit000 )   lwrtruns = lcolruns .AND. lwm 
     235+      ll_wrtstp  = ( MOD( kt, sn_cfctl%ptimincr ) == 0 ) .OR. ( kt == nitend ) 
     236+      ll_colruns = ll_wrtstp .AND. ( ln_ctl .OR. sn_cfctl%l_runstat ) 
     237+      ll_wrtruns = ll_colruns .AND. lwm 
    237238       IF( kt == nit000 .AND. lwp ) THEN 
    238           WRITE(numout,*)  
     239          WRITE(numout,*) 
    239240          WRITE(numout,*) 'stp_ctl : time-stepping control' 
    240 @@ -76,7 +78,7 @@ 
     241          WRITE(numout,*) '~~~~~~~' 
    241242          !                                ! open time.step file 
    242243          IF( lwm ) CALL ctl_opn( numstp, 'time.step', 'REPLACE', 'FORMATTED', 'SEQUENTIAL', -1, numout, lwp, narea ) 
    243           !                                ! open run.stat file 
     244-         !                                ! open run.stat file 
    244245-         IF( ln_ctl .AND. lwm ) THEN 
    245 +         IF( lwrtruns ) THEN 
     246+         !                                ! open run.stat file(s) at start whatever 
     247+         !                                ! the value of sn_cfctl%ptimincr 
     248+         IF( lwm .AND. ( ln_ctl .OR. sn_cfctl%l_runstat ) ) THEN 
    246249             CALL ctl_opn( numrun, 'run.stat', 'REPLACE', 'FORMATTED', 'SEQUENTIAL', -1, numout, lwp, narea ) 
    247              clname = 'run.stat.nc'  
     250             clname = 'run.stat.nc' 
    248251             IF( .NOT. Agrif_Root() )   clname = TRIM(Agrif_CFixed())//"_"//TRIM(clname) 
    249 @@ -120,12 +122,12 @@ 
     252@@ -98,7 +103,7 @@ 
     253       ENDIF 
     254       IF( kt == nit000 )   lsomeoce = COUNT( ssmask(:,:) == 1._wp ) > 0 
     255       ! 
     256-      IF(lwm) THEN                        !==  current time step  ==!   ("time.step" file) 
     257+      IF(lwm .AND. ll_wrtstp) THEN        !==  current time step  ==!   ("time.step" file) 
     258          WRITE ( numstp, '(1x, i8)' )   kt 
     259          REWIND( numstp ) 
     260       ENDIF 
     261@@ -120,12 +125,12 @@ 
    250262          zmax(9) = MAXVAL(   Cu_adv(:,:,:)   , mask = tmask(:,:,:) == 1._wp ) !       cell Courant no. max 
    251263       ENDIF 
    252264       ! 
    253265-      IF( ln_ctl ) THEN 
    254 +      IF( lcolruns ) THEN 
     266+      IF( ll_colruns ) THEN 
    255267          CALL mpp_max( "stpctl", zmax )          ! max over the global domain 
    256268          nstop = NINT( zmax(7) )                 ! nstop indicator sheared among all local domains 
     
    258270       !                                   !==  run statistics  ==!   ("run.stat" files) 
    259271-      IF( ln_ctl .AND. lwm ) THEN 
    260 +      IF( lwrtruns ) THEN 
     272+      IF( ll_wrtruns ) THEN 
    261273          WRITE(numrun,9500) kt, zmax(1), zmax(2), -zmax(3), zmax(4) 
    262274          istatus = NF90_PUT_VAR( idrun, idssh, (/ zmax(1)/), (/kt/), (/1/) ) 
     
    266278''' 4. TOP/prtctl_trc.F90 TOP/trcini.F90 TOP/trcstp.F90 ''' 
    267279 
    268 Changes to control the production of tracer.stat follow similar lines with the introduction of a lltrcstat local logical. Note also changes to prtctl_trc.F90 to make mpp.top.output filenames compatible  with other similar filenames (i.e. use I4.4 for area number). 
     280Changes to control the production of tracer.stat follow similar lines with the introduction of a lltrcstat local logical. Note also changes to prtctl_trc.F90 to make mpp.top.output filenames compatible  with other similar filenames (i.e. use I4.4 for area number). This one is a little off message because the control for this TOP output has been read in OCE/nemogcm.F90. I think this makes sense and moving it to the TOP name list just for the sake of keeping OCE and TOP fully independent seems unnecessary; TBD. 
    269281{{{#!diff 
    270282Index: TOP/prtctl_trc.F90 
     
    309321Index: TOP/trcstp.F90 
    310322=================================================================== 
    311 --- TOP/trcstp.F90      (revision 10459) 
     323--- TOP/trcstp.F90      (revision 10530) 
    312324+++ TOP/trcstp.F90      (working copy) 
    313 @@ -31,6 +31,7 @@ 
    314     PUBLIC   trc_stp    ! called by step 
    315  
    316     LOGICAL  ::   llnew                   ! ??? 
    317 +   LOGICAL  ::   lltrcstat               ! ??? 
    318     REAL(wp) ::   rdt_sampl               ! ??? 
    319     INTEGER  ::   nb_rec_per_day, ktdcy   ! ??? 
    320     REAL(wp) ::   rsecfst, rseclast       ! ??? 
    321 @@ -67,6 +68,7 @@ 
     325@@ -56,6 +56,7 @@ 
     326       ! 
     327       INTEGER ::   jk, jn   ! dummy loop indices 
     328       REAL(wp)::   ztrai    ! local scalar 
     329+      LOGICAL ::   ll_trcstat ! local logical 
     330       CHARACTER (len=25) ::   charout   ! 
     331       !!------------------------------------------------------------------- 
     332       ! 
     333@@ -67,6 +68,8 @@ 
    322334          r2dttrc = 2. * rdttrc       ! = 2 rdttrc (leapfrog) 
    323335       ENDIF 
    324336       ! 
    325 +      IF( kt == nittrc000 )  lltrcstat = ln_ctl .OR. (sn_cfctl%l_config .AND. sn_cfctl%l_trcstat) 
     337+      ll_trcstat  = ( ln_ctl .OR. sn_cfctl%l_trcstat ) .AND. & 
     338+     &              ( ( MOD( kt, sn_cfctl%ptimincr ) == 0 ) .OR. ( kt == nitend ) ) 
    326339       IF( kt == nittrc000 .AND. lk_trdmxl_trc )  CALL trd_mxl_trc_init    ! trends: Mixed-layer 
    327340       ! 
    328341       IF( .NOT.ln_linssh ) THEN                                           ! update ocean volume due to ssh temporal evolution 
    329 @@ -108,7 +110,7 @@ 
     342@@ -108,7 +111,7 @@ 
    330343          ! 
    331344       ENDIF 
    332345       ! 
    333346-      IF (ln_ctl ) THEN 
    334 +      IF (lltrcstat) THEN 
     347+      IF (ll_trcstat) THEN 
    335348          ztrai = 0._wp                                                   !  content of all tracers 
    336349          DO jn = 1, jptra 
     
    340353''' 5. OCE/LBC/mppini.F90 ''' 
    341354 
    342 Control over the creation of multiple layout.dat files is easy to implement and so has been done as an example of how output in this category will be handled in future. For this category, the area subsetting can be used to restrict which areas produce files. The standard layout.dat file, produced by narea = 1 is always produced. 
     355Control over the creation of multiple layout.dat files is easy to implement and so has been done as an example of how output in this category will be handled in future. For this category, the area subsetting can be used to restrict which areas produce files. The standard layout.dat file, produced by narea = 1 is always produced. Also, OCE/SBC/sbcfwb.F90 has been fixed so that only one EMPave.dat file is ever read or produced. Ultimately, only one processor should read this file and MPI_BCAST the values to all others as a more scalable solution (for name lists too?). 
    343356 
    344357{{{#!diff 
     
    399412       DEALLOCATE(iin, ijn, ii_nono, ii_noea, ii_noso, ii_nowe,    & 
    400413          &       iimppt, ijmppt, ibondi, ibondj, ipproc, ipolj,   & 
     414 
     415 
     416Index: OCE/SBC/sbcfwb.F90 
     417=================================================================== 
     418--- OCE/SBC/sbcfwb.F90  (revision 10530) 
     419+++ OCE/SBC/sbcfwb.F90  (working copy) 
     420@@ -143,7 +143,7 @@ 
     421             qns(:,:) = qns(:,:) - zcoef * sst_m(:,:) * tmask(:,:,1) ! account for change to the heat budget due to fw correction 
     422          ENDIF 
     423          ! 
     424-         IF( kt == nitend .AND. lwp ) THEN            ! save fwfold value in a file 
     425+         IF( kt == nitend .AND. lwm ) THEN            ! save fwfold value in a file (only one required) 
     426             CALL ctl_opn( inum, 'EMPave.dat', 'REPLACE', 'FORMATTED', 'SEQUENTIAL', -1, numout, .FALSE., narea ) 
     427             WRITE( inum, "(24X,I8,2ES24.16)" ) nyear, a_fwb_b, a_fwb 
     428             CLOSE( inum ) 
    401429}}} 
    402430 
     
    423451     set_namelist namelist_cfg ln_cdgw .true. 
    424452}}} 
     453 
     454== Sebastien's comments on the first draft, many of which have already been incorporated into version2 above: == 
     455Maybe you know this story, but it case you don’t know… :-) 
     456 
     4571) A long time ago  - in a galaxy far far away - ... ln_ctl was there (since ever?, at least in rev3) and was a control print of the trend over the first core. 
     458 
     4592) At rev258 https://forge.ipsl.jussieu.fr/nemo/changeset/258 the concept was modified and extended to be able to debug mpi reproducibility.  
     460prtctl was added but, to my knowledge, never documented.  
     461The idea is to compare the mean value of every trend in the code over (1) the local mpi domain and (2) the same domain but in a 1 core simulation (or n core as soon as the domain you want to test is included in your mpi subdomain). You compare you mpp.output_xxxx and mono.output_xxxx files and the first difference tells you which core and which trend is the first to diverge. 
     462 
     4633) Next, as prtctl was dealing with mpp.output_xxxx and mono.output_xxxx files, we again extended the use of ln_ctl to control the creation of any file trough the definition of lwp: https://forge.ipsl.jussieu.fr/nemo/changeset/1579 
     464This was a bad idea as, with 1 variable (ln_ctl), we control 2 different concepts : (1) mpp debug which requires the production of mpp.output_xxxx and mono.output_xxxx files and (2) the control of all other outputs files (ocean.output, run.stat, layout.dat…) 
     465 
     4664) For Performance issues, with Eric, we wanted to be able to remove the globsum in stpctl. At some point we thought that we should introduce a new namelist variable something like in_fast or ln_prod or ln_debug or nn_debug which would allow us to reduce the printed informations and use faster code (for example without globsum in stpctl).  
     467After some discussions, we thought that our idea was not clear enough to be introduced in the code before the release in December. We therefore decided to postpone this and simply use ln_ctl to switch on/off the globsum (because if ln_ctl = T, you don’t care of the performances so you can do globsum in stpctl). Once you switch off the globsum, the data in run.stat are useless, so we also switch off run.stat files. 
     468 
     469 
     470So Today, ln_ctl is mixing different functionalities coming from theses 4 layers of developments. 
     471I think we should take advantage of your development to clean-up this mess! 
     472 
     473One proposition (to be discussed) could be to split ln_ctl into 3 functionalities:  
     474 - mpp debug associated to prtctl. That could be renamed, for example in prttrd or trdprt for "trend print". Or mppdbg for mpp debug? this part requires the creation of the mpp.output_xxxx and mono.output_xxxx files which, to me, should be controlled only by the activation or not of prtctl 
     475 - something (a logical, an integer, a structure) related to the balance between verbosity-control-debug and performance. This would control/replace/be linked with the creation of run.stat files, nn_print, ln_timing maybe also layout.dat 
     476 - use your sn_ctl to control other files  
     477 
     478 
     479Other minor points : 
     480 - EMPave.dat : I had a quick look but it seems that this file is the same for each core (contains only the year and global mean). So we should not offer the possibility to create EMPave.dat_xxxx files. Either all cores read the same file (as we do for the nameliste) or only core 0 read it and use mpi_broadcast to send the informations to all other cores (as, I think, we should also do for the name lists). '''First part done, will consider use of mpi_bcast in 2019''' 
     481 - We have files created by oce, si3 and top. Maybe, It is strange if a part of OCE, control the creation of top files like tracer.stat, no? '''To be discussed ''' 
     482 - instead of having to test sn_cfctl%l_config everywhere as for example in "sn_cfctl%l_config .AND. sn_cfctl%l_trcstat”, I would, in nemo_set_cfctl, do sn_cfctl%l_trcstat = setto .AND. sn_cfctl%l_config ''' Actually the test of l_config isn't needed at the lower levels, removed ''' 
     483 - there is also time.step file ''' Not sure we ever want to switch this off but it can now be updated less frequently according to ptimincr ''' 
     484 - do we really need the sn_ctl structure? why no a simple list of logical and integer? ''' To be discussed ''' 
     485 - I would like to “promote” the use of ln_timing, except in production mode, so people become used to look at this file to see the main bottlenecks in computational coast and communications… ''' Think this is a separate issue unless multiple files are produced? ''' 
    425486== Tests 
    426487