Version 35 (modified by frrh, 4 years ago) (diff)

Bugs in NEMO-CICE-MEDUSA

Testing on the Met Office CrayXC40, using working copy u-aj777RHdebug and its variants we start by trying a 2-day (2x1-day cycles) set up with array bounds testing. Using -Rb for both NEMO and CICE code compilation we find:

  • TOP_SRC/MEDUSA/sms_medusa.F90 doesn't compile - ierr is OOB - it needs to be declared with a dimension of 8.
  • CICE generates:
      lib-4961 : WARNING 
      Subscript -330 is out of range for dimension 2 for array
      'array_g' at line 2206 in file 'ice_gather_scatter.F90' with bounds 1:332.
    
       ARRAY_G 2nd index is OOB in;
    
                  msg_buffer(i,j) = ARRAY_G(this_block%i_glob(i)+nghost,&
                                          this_block%j_glob(j)+nghost)
    
    Heaven knows why we have a negative number here!?

This doesn't cause a failure. although subsequent attempts to alter the code did cause failures. It seems failure is dependent on the location of the page of memory you're on versus where you're trying to access.

  • We then get a failure in trc_nam_medusa.F90:
      lib-4213 : UNRECOVERABLE library error 
      A pointer or allocatable array in an I/O list has not been associated
      or allocated.
    
       Encountered during a namelist WRITE to unit 27
       Fortran unit 27 is connected to a sequential formatted text file:
         "output.namelist.pis"
    

This error message doesn't lead us to any particular statement or line number which is a bit rubbish… the compiler must know but it doesn't bother to tell us.

  • trcnam_medusa.F90 has a section where it's initialising variables from the

natbio namelist. However it initialises jdms_input twice thus…

jdms_input = 0 jdms_input = 3

Why?

jdms_model is not initialised at all - is the 2nd occurrence supposed to refer to that?

jq10 is not initialised.

Some variables are declared twice in natbio. e.g. vsed, xhr

  • Writing of natbio causes the above error. Suggesting something in that namelist is unset.

Skipping that, we get a similar error writing natroam! Skip that and natopt seems to be OK but it's the only one of the three namelists that is. The model then goes on to complete (and completes a 2nd 1-day cycle OK).

  • The code also refers to a namelist named "nammeddia", but we have no such namelist. Our namelists refer to something called "nammedia" (only one "d") Presumably that's a typo. JP says this is not currently used in our configurations (if it was, it would crash looking for a missing namelist!)
  • Checking job.err, the only warning we have is the one about the ARRAY_G reference in CICE. This is present in both the NRUN and teh CRUN (why wouldn't it be?)

So we have a number of things to do:

  1. Correct ierr dimension to 8 in sms_medusa.F90
  2. Remove duplicate variable declarations in natbio
  3. Ensure missing fields are given default values in natbio
  4. Replace the 2nd occurrence of jdms_input with jdms_model, presumably
  5. Investigate why the namelist writes fail.

Checking with JP, he confirms that jdms_model is what is meant to be used in the above instead of the 2nd occurrence of jdms_input, with a default of 3, and jq10 should be given a default of 1.5. I've applied these to my branch and JP has also applied them to the main MEDUSA_stable branch. Ditto the ierr dimensioning fix. So that should only leave the issues with printing natbio and natroam.

Further investigation into why printing natbio contents fails reveals that the FRIVER_DEP array is unallocated at time of the attempted write. natroam seems to have similar issues with other fields. It seems that these arrays are only initialised some time after the writing of the namelist contents (if at all) in sms_medusa_alloc. The various arrays involved seem to be dimensioned by jpi, jpj and jpk. so it seems doubtful that these would ever be given values by the namelist input file! They'd hvae to be massive (over a million different values in an ORCA1 res). So is there really any point in having these as part of the namelist definition?

The fields apparently unallocated in natroam at the time of output appear to be:

 F3_PH = , 
 F3_H2CO3 = , 
 F3_HCO3 = , 
 F3_CO3 = , 
 F3_OMCAL = , 
 F3_OMARG = , 
 F2_CCD_CAL = , 
 F2_CCD_ARG = 

So is it important to have any of these in the namelists given that it seems highly unlikely they'll ever be given values from them?

FRIVER_DEP is actually set in trc_ini_medusa. It's unprotected and therefor if anything was supplied vioa a namelist it would be irrelevant.

F3_PH seems to be initialised in trcini_medusa and set in trcbio_medusa and read in via iom_bet in trcrst.F90. The same goes for all the other F3 and F2 arrays listed above. SO I don't understand why they're listed in namelists if they don't get their values from there.

Contacted JP to see if we can remove these things from namelists. He'll get back to me.

CICE OOB error

The CICE OOB error arises from trying to access a negative J index in the global array in ice_gather_scatter. That negative index (-331) arises from ice_blocks where for tripole and tripoleT it looks to see if j_global(j,n) > ny_global and if it is it does:

     j_global(j,n) = -j_global(j,n)

That seems crazy…. how can giving it a negative index which is clearly OOB help?

The fact that it doesn't crash seems like pot luck. However trying to point at alternative locations within bounds fails because it either causes the model to diverge (why… this is supposed to be a halo row) or crashes if I get the index wrong and send it OOB with a POSITIVE value. i.e. there are two completely different behaviours when OOB is encountered - a negative index generates a warning and the code carries on but a positive index causes a crash. JCR says it's the luck of the draw depending on where the OOB pointer refers - if you stay within a page, you're OK but if not you crash.

So the fact that our original code runs at all is pure blind luck! This all looks very unsafe but I can't find a way to keep it in bounds when setting up the indexes w/o altering results….! The only way I can do it is by introducing a check in scatter_global_ext (which is the variant issuing the error) to explicitly exclude negative J indices. That works OK but it seems a bit of a cludge.

Coupled models

Repeating these tests with a coupled model set up. MS provides u-aj599 as the latest job. This features JP's MEDUSA_Stable branch at revision 7709 which contains SOME OF the fixes identified above. However it does not address the issue with the seemingly needless inclusion of unallocated arrays in namelists.

So my WC is u-ak282debug. This immediately fails on compilation with another medusa related OOB issue in the dimensioning of ierr in oce.F90. However although this is related to MEDUSA the relevant code actually lives in the GO6 package branch. So I've got to create a bug fix branch for that as well as a new one for the up-to-date MEDUSA branch.

So my branches are:

  • branches/UKMO/dev_r5518_MEDUSA_stable_fixes_part2
  • branches/UKMO/dev_r5518_GO6_package_fixes.

We then have:

  • Failure in sms_medusa_alloc - ierr should be dimensioned as 8. JP said he was going to apply this to his branch bus appears not to have done. So I apply it again to branches/UKMO/dev_r5518_MEDUSA_stable_fixes_part2

  • We then fail at run time because of the unallocated arrays in the namelist writes. Lets fix that by removing those arrays from the namelists since I can see no reason to leave them there. (They do not get values from the namelist input and if the output works, the output.namelist.pis file will be so huge that nobody is going to trawl through it!)

  • Also remove duplicate vars from natbio.
  • Interestingly I'm not seeing a report of the CICE OOB error/warning…. even though I'm using 12 PE's N-S - exactly the same as before in the non coupled case. This might suggest it's a field dependent thing relating only to reading in/initialising some input field which is not employed in the coupled case!
  • This ran OK (1-day!)
  • So we now know for sure that removing those arrays from namelists is a good thing to do.

Coupled model CRUNs

We compare a 2day NRUN with a 2x1-day CRUN in u-ak282debug and u-ak282debugCRUN.

Inspecting the solver.stat file, this reveals that NMEO seems to diverge on the first TS immediately after the restart i.e. TS 33! Why? We have other models (e.g. the stand-alone NEMO-CICE-MEDUSA which don't.

So this could arise from the coupling fields coming from the atmos at TS0, but the atmos dumps are identical at day 1 and checking the most likely contenders (PCO2 and dust) shows they're fine at teh start of day 2.

Comparing the ocean.output files at the start of day 2 leads us to fishy looking things related to TOP/MEDUSA since we have a message ' New shortwave to sample for TOP at time kt = ' occurring at different timesteps for each run. In the NRUN case this occurs at TS 36 and in the CRUN case it occurs at TS 38! This leads us to trcstp.F90 which contains a flaky looking bit of code:

     iseclast = nsec_year + nsec1jan000
     llnew   = ( iseclast - isecfst )  > INT( rdt_sampl )   !   new shortwave to store

     IF( kt /= nittrc000 .AND. llnew ) THEN
          IF( lwp ) WRITE(numout,*) ' New shortwave to sample for TOP at time kt = ', kt, &
             &                      ' time = ', (iseclast+rdt*nn_dttrc/2.)/3600.,'hours '
          isecfst = iseclast
          DO jn = 1, nb_rec_per_days - 1
             qsr_arr(:,:,jn) = qsr_arr(:,:,jn+1)
          ENDDO
          qsr_arr (:,:,nb_rec_per_days) = qsr(:,:)
          qsr_mean(:,:                ) = SUM( qsr_arr(:,:,:), 3 ) / nb_rec_per_days
      ENDIF

That can never give the same results for NRUNS and CRUNS! This condition arises at the 6th timestep after a restart and every 5 timesteps thereafter. So in the restart case it occurs at TS=38, 43, 48, 53 etc while in the NRUN case is arises at TS=36, 41, 46, 51 etc.

This comes from the vn3.6 NEMO_stable branch via the GO6 branch, and is not specific to the MEDUSA branch. But the logic looks completely inconsistent.

The description says "In coupled mode, the sampling is done at every coupling frequency", which is nothing like what happens at all because, with a 45 minute TS and a coupling frequency of 3 hours which we have in our model:

  • The first sample is taken at TS 6 after a start i.e. 4.5 hours
  • subsequent samples occur at 3.75 hours not the 3 hours it should do!

I think this should be sampling, consistently at TS = 0, 4, 8…. 32…. (though 32 shouldn't matter in the case of day 1 of the CRUN) or something like that.

iseclast changes as things progress and is identical in the two runs isecfst changes as things progress but is different due to the offset at the times it is updated! nsec_year = 20834550 nsec1jan000 = 0 rdt_sampl = 10800 nb_rec_per_day = 8

So what does this all mean?

Well llnew is true if iseclast - isecfst > 10800

I think that statement should use ≥ for starters because by that logic if iseclast is 1010800 and isecfst 1000000 then we're at a whole coupling frequency since the previous TRUE condition and should be doing a new calculation - but we don't because (1010800-1000000) = 10800 which is not greater than 10800, it's equal to it!

Try that, but not worth pursuing further since it seems that the things calculated here are for diagnostic purposes only….. except they're also written to the TOP restart file as qsr_mean! It doesn't look like this is important for model evolution, but I'd better check.

Checking with AY and JP, they say that qsr_mean IS VERY important but that the value which appears in the TOP restart is not the one which comes from here (so why is this still in the code?) but is calculated in trc_avg_medusa. This has me confused because the NEMO base code version is definitely being called… is it really getting the value from elewhere? And when it's calculated elsewhere does it use the same mangled logic as the NEMO base code?

We also find that that switching ln_ctl on causes failure with a reference to a unit number of -1. Bugs aplenty! This does not happen in the non coupled version, so something to do with MEDUSA coupling specific code?

We need to bear in mind that lwm is only ever set to true on PE0 and is used to control may of the namelist prints (though not all it seems (e.g. output.namelist.pis).

It looks like this bug comes from my fix to prevent the expensive global tracer sums - I'm still writing to numstr based on lwp when I should be restricting this write to pe zero which I could do with lwm, although it's not meant for that, or (if narea==1) or (if numstr > 0) ! These flags are so confusing.

Fixed this and notified MS.

Now back to the qsr_mean thing.

AY says the qsr_mean calculated in trcstp is not the one which goes into the TOP restart file.

Unfortunately it very definitely IS.

I've just proved it by overriding the value written by stuffing it full of a fixed value (15.0 as it happens). When I look in the restart_trc files qsr_mean contains nothing but 15.0 !

So that nails that for starters - the value appearing in the dump is entirely dependent on the restart frequency (not the frequency of dump writing but the frequency of stopping and restarting.)

Now we need to see if this actually makes a difference to the rest of the model evolution or if this is just a stand-alone effect. So comparing a 2x1-day run with qsr_mean set to 15.0 and a 2x1-day run with it set to 95.0, we compare restart_trc files at day 2.

The only differences are in the qsr_mean values which are uniformly 15 in one file and 95 in the other. All other fields are identical. So this really is a pointless thing to have in the restarts and is just confusing matters by being no reproducible. Ideally we'd ditch it, especially as the NEMO trunk has subsequently completely rewritten this routine, including correcting the setting of llnew.

Extensive excruciatingly long winded and painstaking use of glob_sub in NEMO leads us to the fact that tsa diverges after the call to sbc_tsc.

Examining contributing things in sbc_tsc leads us to sfx (salt flux due to melting/freezing) and qns (non solar heat flux).

These in turn seems to lead us to sbcice_cice which seems to set these up. That in turn suggests CICE or CICE-NEMO coupling is implicated.

|Now the XCS has crashed everything, so while waiting for that to be sorted out I look at the differences in CICE branches between the coupled and stand alone cases - and there are several! Far too many for comfort or to ignore:

Coupled model CICE branches;

branches/dev/frbe/vn5.1.2_sal_dep_frz_fix@1126
branches/dev/frrh/vn5.1.2_tripole_fix@1237

branches/dev/frsy/vn5.1.2_netcdf4_output@1072
branches/dev/hadax/r1205_SIMIP_GSI8@1273
branches/dev/hadci/vn5.1.2_land_suppression@1122
branches/dev/hadjs/vn5.1.2_avoid_NEMO_units@1059
branches/dev/hadjs/vn5.1.2_fix_mpond_init_bug@1065
branches/dev/hadjs/vn5.1.2_sea_ice_salinity@1061
branches/dev/mandrej/vn5.1.2_mpi_update@1243
branches/test/hadjs/vn5.1.2_dragio_change@1152

Non coupled model CICE branches:

branches/dev/frbe/vn5.1.2_GSI7_CCSM3_meltponds@1114
branches/dev/frbe/vn5.1.2_sal_dep_frz_fix@1126
branches/dev/frrh/vn5.1.2_tripole_fix@1237
branches/dev/frsy/vn5.1.2_fix_GSI6_constants_forced_ocean_ice@1070
branches/dev/frsy/vn5.1.2_netcdf4_output@1072
branches/dev/hadax/vn5.1.2_GSI8@1141
branches/dev/hadci/vn5.1.2_land_suppression@1122
branches/dev/hadjs/vn5.1.2_avoid_NEMO_units@1059
branches/dev/hadjs/vn5.1.2_fix_mpond_init_bug@1065
branches/dev/hadjs/vn5.1.2_sea_ice_salinity@1061

So the coupled case has extra branches:

branches/dev/hadax/r1205_SIMIP_GSI8@1273
branches/dev/mandrej/vn5.1.2_mpi_update@1243
branches/test/hadjs/vn5.1.2_dragio_change@1152

WHile it does not contain

branches/dev/frbe/vn5.1.2_GSI7_CCSM3_meltponds@1114
branches/dev/frsy/vn5.1.2_fix_GSI6_constants_forced_ocean_ice@1070
branches/dev/hadax/vn5.1.2_GSI8@1141

WHY ??? Why are things so radically different?

Well most obvious thing is the discrepancy between branches/dev/hadax/r1205_SIMIP_GSI8@1273 and branches/dev/hadax/vn5.1.2_GSI8@1141. What's the difference here?