Bugs in NEMO-CICE-MEDUSA

Testing on the Met Office CrayXC40, using working copy u-aj777RHdebug and its variants we start by trying a 2-day (2x1-day cycles) set up with array bounds testing.
Using -Rb for both NEMO and CICE code compilation we find:

   * TOP_SRC/MEDUSA/sms_medusa.F90 doesn't compile - ierr is OOB - it needs to be declared with a dimension of 8. 


   * CICE generates:
{{{
  lib-4961 : WARNING 
  Subscript -330 is out of range for dimension 2 for array
  'array_g' at line 2206 in file 'ice_gather_scatter.F90' with bounds 1:332.

   ARRAY_G 2nd index is OOB in;

              msg_buffer(i,j) = ARRAY_G(this_block%i_glob(i)+nghost,&
                                      this_block%j_glob(j)+nghost)
}}}
   Heaven knows why we have a negative number here!?

   This doesn't cause a failure. although subsequent attempts to alter the code did cause failures. It seems failure is dependent on the location of the page of memory you're on versus where you're trying to access. 

   *  We then get a failure in trc_nam_medusa.F90:
{{{
  lib-4213 : UNRECOVERABLE library error 
  A pointer or allocatable array in an I/O list has not been associated
  or allocated.

   Encountered during a namelist WRITE to unit 27
   Fortran unit 27 is connected to a sequential formatted text file:
     "output.namelist.pis"
}}}

   This error message doesn't lead us to any particular statement or line number which is a bit rubbish... the compiler must know but it doesn't bother to tell us. 

   *  trcnam_medusa.F90 has a section where it's initialising variables from the 
natbio namelist. However it initialises jdms_input twice thus...
      jdms_input  = 0
      jdms_input  = 3

   Why?

   jdms_model is not initialised at all - is the 2nd occurrence supposed to refer to that?

   jq10 is not initialised.

   Some variables are declared twice in natbio. e.g. vsed, xhr 

   *  Writing of natbio causes the above error. Suggesting something in that namelist is unset. 
      Skipping that, we get a similar error writing natroam!
      Skip that and natopt seems to be OK but it's the only one of the three namelists that is. 
      The model then goes on to complete (and completes a 2nd 1-day cycle OK). 

   * The code also refers to a namelist named "nammeddia", but we have no such namelist. Our namelists refer to something called "nammedia" (only one "d") Presumably that's a typo. JP says this is not currently used in our configurations (if it was, it would crash looking for a missing namelist!)  


   * Checking job.err, the only warning we have is the one about the ARRAY_G reference in CICE. 
     This is present in both the NRUN and teh CRUN (why wouldn't it be?)


So we have a number of things to do:

   1. Correct ierr dimension to 8 in sms_medusa.F90
   1. Remove duplicate variable declarations in natbio
   1. Ensure missing fields are given default values in natbio
   1. Replace the 2nd occurrence of jdms_input with jdms_model, presumably
   1. Investigate why the namelist writes fail.

Checking with JP, he confirms that jdms_model is what is meant to be used in the above instead of the 2nd occurrence of jdms_input, with a default of 3, and jq10 should be given a default of 1.5. I've applied these to my branch and JP has also applied them to the main MEDUSA_stable branch. Ditto the ierr dimensioning fix. So that should only leave the issues with printing natbio and natroam. 

Further investigation into why printing natbio contents fails reveals that the FRIVER_DEP array is unallocated at time of the attempted write.  natroam seems to have similar issues with other fields. 
It seems that these arrays are only initialised some time after the writing of the namelist contents (if at all) in sms_medusa_alloc. The various arrays involved seem to be dimensioned by jpi, jpj and jpk. 
so it seems doubtful that these would ever be given values by the namelist input file! They'd hvae to be massive (over a million different values in an ORCA1 res). So is there really any point in having these
as part of the namelist definition? 

The fields apparently unallocated in natroam at the time of output appear to be:
{{{
 F3_PH = , 
 F3_H2CO3 = , 
 F3_HCO3 = , 
 F3_CO3 = , 
 F3_OMCAL = , 
 F3_OMARG = , 
 F2_CCD_CAL = , 
 F2_CCD_ARG = 
}}}

So is it important to have any of these in the namelists given that it seems highly unlikely they'll ever be given values from them?

FRIVER_DEP is actually set in trc_ini_medusa. It's unprotected and therefor if anything was supplied vioa a namelist it would be irrelevant. 

F3_PH seems to be initialised in trcini_medusa and set in trcbio_medusa and read in via iom_bet in trcrst.F90.
The same goes for all the other F3 and F2 arrays listed above. SO I don't understand why they're listed in namelists if they don't get their values from there. 

Contacted JP to see if we can remove these things from namelists. He'll get back to me. 


'''CICE OOB error'''

The CICE OOB error arises from trying to access a negative J index in the global array in ice_gather_scatter. That negative index (-331) arises from ice_blocks where for tripole and tripoleT  it looks to see if j_global(j,n) > ny_global and if it is it does:
{{{
     j_global(j,n) = -j_global(j,n)
}}}

That seems crazy.... how can giving it a negative index which is clearly OOB help?

The fact that it doesn't crash seems like pot luck. However trying to point at alternative locations within bounds fails because it either causes the model to diverge (why... this is supposed to be a halo row)
or crashes if I get the index wrong and send it OOB with a POSITIVE value. i.e. there are two completely different behaviours when OOB is encountered - a negative index generates a warning and the code carries on but a positive index causes a crash. JCR says it's the luck of the draw depending on where the OOB pointer refers - if you stay within a page, you're OK but if not you crash. 

So the fact that our original code runs at all is pure blind luck!
This all looks very unsafe but I can't find a way to keep it in bounds when setting up the indexes w/o altering results....!
The only way I can do it is by introducing a check in scatter_global_ext (which is the variant issuing the error) to explicitly exclude negative J indices. That works OK but it seems a bit of a cludge.  



== Coupled models ==

Repeating these tests with a coupled model set up. MS provides u-aj599 as the latest job. This features JP's MEDUSA_Stable branch at revision 7709 which contains '''SOME OF''' the fixes identified above. 
However it does not address the issue with the seemingly needless inclusion of unallocated arrays in namelists.

So my WC is u-ak282debug. This immediately fails on compilation with another medusa related OOB issue in the dimensioning of ierr in oce.F90. However although this is related to MEDUSA the relevant code actually lives in the GO6 package branch. So I've got to create a bug fix branch for that as well as a new one for the up-to-date MEDUSA branch. 

So my branches are: 
   * branches/UKMO/dev_r5518_MEDUSA_stable_fixes_part2
   * branches/UKMO/dev_r5518_GO6_package_fixes.

We then have:
 
  * Failure in sms_medusa_alloc - ierr should be dimensioned as 8. JP said he was going to apply this to his branch bus appears not to have done. So I apply it again to branches/UKMO/dev_r5518_MEDUSA_stable_fixes_part2
    

  * We then fail at run time because of the unallocated arrays in the namelist writes. Lets fix that by removing those arrays from the namelists since I can see no reason to leave them there.
   (They do not get values from the namelist input and if the output works, the output.namelist.pis file will be so huge that nobody is going to trawl through it!) 
  
  * Also remove duplicate vars from natbio.

  * Interestingly I'm not seeing a report of the CICE OOB error/warning.... even though I'm using 12 PE's N-S - exactly the same as before in the non coupled case. This might suggest it's a field dependent thing relating only to reading in/initialising some input field which is not employed in the coupled case! 

  * This ran OK (1-day!)

  * So we now know for sure that removing those arrays from namelists is a good thing to do. 
  


== Coupled model CRUNs ==

We compare a 2day NRUN with a 2x1-day CRUN in u-ak282debug and u-ak282debugCRUN. 

Inspecting the solver.stat file, this reveals that NMEO seems to diverge on the first TS immediately after the restart i.e. TS 33!
Why?
We have other models (e.g. the stand-alone NEMO-CICE-MEDUSA which don't.

So this could arise from the coupling fields coming from the atmos at TS0, but the atmos dumps are identical at day 1 and checking the most likely contenders (PCO2 and dust) shows they're fine at teh start of day 2. 

Comparing the ocean.output files at the start of day 2 leads us to fishy looking things related to TOP/MEDUSA since we have a message
' New shortwave to sample for TOP at time kt = ' occurring at different timesteps for each run. In the NRUN case this occurs at TS 36 and in the CRUN case it occurs at TS 38!  
This leads us to trcstp.F90 which contains a flaky looking 
bit of code:
{{{
     iseclast = nsec_year + nsec1jan000
     llnew   = ( iseclast - isecfst )  > INT( rdt_sampl )   !   new shortwave to store

     IF( kt /= nittrc000 .AND. llnew ) THEN
          IF( lwp ) WRITE(numout,*) ' New shortwave to sample for TOP at time kt = ', kt, &
             &                      ' time = ', (iseclast+rdt*nn_dttrc/2.)/3600.,'hours '
          isecfst = iseclast
          DO jn = 1, nb_rec_per_days - 1
             qsr_arr(:,:,jn) = qsr_arr(:,:,jn+1)
          ENDDO
          qsr_arr (:,:,nb_rec_per_days) = qsr(:,:)
          qsr_mean(:,:                ) = SUM( qsr_arr(:,:,:), 3 ) / nb_rec_per_days
      ENDIF
}}}
 
That can never give the same results for NRUNS and CRUNS!
This condition arises at the 6th timestep after a restart and every 5 timesteps thereafter. 
So in the restart case it occurs at TS=38, 43, 48, 53 etc while in the NRUN case is arises at TS=36, 41, 46, 51 etc.

This comes from the vn3.6 NEMO_stable branch via the GO6 branch, and is not specific to the MEDUSA branch. But the logic looks completely inconsistent. 

The description says "In coupled mode, the sampling is done at every coupling frequency", which is nothing like what happens at all because, 
with a 45 minute TS and a coupling frequency of 3 hours which we have in our model:
 
   * The first sample is taken at TS 6 after a start i.e. 4.5 hours
   * subsequent samples occur at 3.75 hours not the 3 hours it should do!

I think this should be sampling, consistently at TS = 0, 4, 8.... 32.... (though 32 shouldn't matter in the case of day 1 of the CRUN) or something like that. 


iseclast changes as things progress and is identical in the two runs
isecfst changes as things progress but is different due to the offset at the times it is updated!
nsec_year = 20834550
nsec1jan000 = 0
rdt_sampl = 10800
nb_rec_per_day = 8

So what does this all mean?



Well llnew is true if iseclast - isecfst >  10800

I think that statement should use >= for starters because by that logic if iseclast is 1010800 and isecfst 1000000 then we're at a whole coupling frequency since the previous TRUE condition and should be doing a new calculation - but we don't because (1010800-1000000) = 10800 which is not greater than 10800, it's equal to it!

Try that, but not worth pursuing further since it seems that the things calculated here are for diagnostic purposes only..... except they're also written to the TOP restart file as qsr_mean!
It doesn't look like this is important for model evolution, but I'd better check. 

Checking with AY and JP, they say that qsr_mean IS VERY important but that the value which appears in the TOP restart is not the one which comes from here (so why is this still in the code?)
but is calculated in trc_avg_medusa. This has me confused because the NEMO base code version is definitely being called... is it really getting the value from elewhere?
And when it's calculated elsewhere does it use the same mangled logic as the NEMO base code?

We also find that that switching ln_ctl on causes failure with a reference to a unit number of -1. Bugs aplenty!
This does not happen in the non coupled version, so something to do with MEDUSA coupling specific code?

We need to bear in mind that lwm is only ever set to true on PE0 and is used to control may of the namelist prints (though not all it seems (e.g. output.namelist.pis). 

It looks like this bug comes from my fix to prevent the expensive global tracer sums - I'm still writing to  numstr based on lwp when I should be restricting this write to pe zero
which I could do with lwm, although it's not meant for that, or (if narea==1) or (if numstr > 0) !
These flags are so confusing. 

Fixed this and notified MS. 

Now back to the qsr_mean thing. 

AY says the qsr_mean calculated in trcstp is not the one which goes into the TOP restart file. 

Unfortunately it very definitely IS. 

I've just proved it by overriding the value written by stuffing it full of a fixed value (15.0 as it happens). When I look in the restart_trc files 
qsr_mean contains nothing but 15.0 !

So that nails that for starters - the value appearing in the dump is entirely dependent on the restart frequency (not the frequency of dump writing 
but the frequency of stopping and restarting.)

Now we need to see if this actually makes a difference to the rest of the model evolution or if this is just a stand-alone effect.
So comparing a 2x1-day run with qsr_mean set to 15.0 and a 2x1-day run with it set to 95.0, we compare restart_trc files at day 2. 

The only differences are in the qsr_mean values which are uniformly 15 in one file and 95 in the other. All other fields are identical. So this really is a pointless thing to have in the restarts and is just confusing matters by being no reproducible. Ideally we'd ditch it, especially as the NEMO trunk has subsequently completely rewritten this routine, including correcting the setting of llnew. 

Extensive excruciatingly long winded and painstaking use of glob_sub in NEMO leads us to the fact that tsa diverges after the call to sbc_tsc. 

Examining contributing things in sbc_tsc leads us to sfx (salt flux due to melting/freezing) and qns (non solar heat flux). 

These in turn seems to lead us to sbcice_cice which seems to set these up. That in turn suggests CICE or CICE-NEMO coupling is implicated. 

|Now the XCS has crashed everything, so while waiting for that to be sorted out I look at the differences in CICE branches between the coupled and stand alone cases - and there are several! Far too many for comfort or to ignore:

Coupled model CICE branches;
{{{

branches/dev/frbe/vn5.1.2_sal_dep_frz_fix@1126
branches/dev/frrh/vn5.1.2_tripole_fix@1237

branches/dev/frsy/vn5.1.2_netcdf4_output@1072
branches/dev/hadax/r1205_SIMIP_GSI8@1273
branches/dev/hadci/vn5.1.2_land_suppression@1122
branches/dev/hadjs/vn5.1.2_avoid_NEMO_units@1059
branches/dev/hadjs/vn5.1.2_fix_mpond_init_bug@1065
branches/dev/hadjs/vn5.1.2_sea_ice_salinity@1061
branches/dev/mandrej/vn5.1.2_mpi_update@1243
branches/test/hadjs/vn5.1.2_dragio_change@1152

}}}  
 
Non coupled model CICE branches:
{{{
branches/dev/frbe/vn5.1.2_GSI7_CCSM3_meltponds@1114
branches/dev/frbe/vn5.1.2_sal_dep_frz_fix@1126
branches/dev/frrh/vn5.1.2_tripole_fix@1237
branches/dev/frsy/vn5.1.2_fix_GSI6_constants_forced_ocean_ice@1070
branches/dev/frsy/vn5.1.2_netcdf4_output@1072
branches/dev/hadax/vn5.1.2_GSI8@1141
branches/dev/hadci/vn5.1.2_land_suppression@1122
branches/dev/hadjs/vn5.1.2_avoid_NEMO_units@1059
branches/dev/hadjs/vn5.1.2_fix_mpond_init_bug@1065
branches/dev/hadjs/vn5.1.2_sea_ice_salinity@1061
}}}
 

So the coupled case has extra branches:
{{{
branches/dev/hadax/r1205_SIMIP_GSI8@1273
branches/dev/mandrej/vn5.1.2_mpi_update@1243
branches/test/hadjs/vn5.1.2_dragio_change@1152
}}}

WHile it does not contain
{{{
branches/dev/frbe/vn5.1.2_GSI7_CCSM3_meltponds@1114
branches/dev/frsy/vn5.1.2_fix_GSI6_constants_forced_ocean_ice@1070
branches/dev/hadax/vn5.1.2_GSI8@1141
}}}

WHY ???
Why are things different?

Well most obvious thing is the discrepancy between branches/dev/hadax/r1205_SIMIP_GSI8@1273 and branches/dev/hadax/vn5.1.2_GSI8@1141. 
What's the difference here?

branches/dev/frsy/vn5.1.2_fix_GSI6_constants_forced_ocean_ice@1070 in the stand-alone model seems fair enough. 

Alex says these things should be equivalent but the r1205_SIMIP_GSI8 branch is just the same as vn5.1.2_GSI8 with extra diagnostics in it for CMIP6 purposes.

So I asked why we have 2 separate versions of the same branch and if the r1205_SIMIP_GSI8 one can be safely used in the stand-alone model. He says yes! So why aren't we doing that!? We need to get all this as seamless as we can, not have different branches when we don't need them. He says I SHOULD be able to use this in the stand alone model w/o any issues and get the same results! May try that just to be sure. 

We have branches/dev/frbe/vn5.1.2_GSI7_CCSM3_meltponds@1114 in the stand-alone model only... why?
I cant see from the code that it's something which only applies to stand-alone cases, but maybe it is?  It doesn't seem to be doing anything relevant to restartability. Ed says this shouldn't have any effect in the coupled model but that it could be included safely (need to confirm that in practice). I'd be inclined to have everything from the stand-alone case included regardless (in so far as we are able... there will be different foring related stuff for stand alone, but frankly it should ALL be written so that the various stuff is controllable at run time) so that at least we all work from a common code base
from GO6-GSI8 upwards (I'm not suggesting coupled stuff should be cascaded back down, but there should be no reason why it shouldn't if it's written properly!) 

More global sums lead us eventually to fresh_ai (fresh water flux to ocean) and fsalt_ai (salt flux to ocean) coming out of CICE in cice2nemo! 
These are declared in ice_flux and set in coupling_prep as:
{{{
      ! Store grid box mean albedos and fluxes before scaling by aice
      !----------------------------------------------------------------

            alvdf_ai  (i,j,iblk) = alvdf  (i,j,iblk)
            alidf_ai  (i,j,iblk) = alidf  (i,j,iblk)
            alvdr_ai  (i,j,iblk) = alvdr  (i,j,iblk)
            alidr_ai  (i,j,iblk) = alidr  (i,j,iblk)
            fresh_ai  (i,j,iblk) = fresh  (i,j,iblk)
            fsalt_ai  (i,j,iblk) = fsalt  (i,j,iblk)
            fhocn_ai  (i,j,iblk) = fhocn  (i,j,iblk)
            fswthru_ai(i,j,iblk) = fswthru(i,j,iblk)

}}}

So our actual CICE variables seem to be the nightmarishly named "fresh" ... try searching the code for that! and fsalt which should at least be slightly easier to find!

Ed says that Mireks fix branch ought to be included in everything routinely, regardless of whether it's coupled or not! But its already in the coupled set up but not in the stand-alone!
 
Try adding branches/dev/frsy/vn5.1.2_fix_GSI6_constants_forced_ocean_ice@1070 to coupled job - we don't expect any differences.... and we don't get any!

Back to global sums and write statements in CICE.

Trying to apply thee through my aray bounds fix working copy at /data/local/frrh/cice_br/cice_trunk_vn512_debug causes clashes with branches/dev/mandrej/vn5.1.2_mpi_update@1243
This is because Mirek's branch attempts to solve the OOB error in a different way and clashes with my method. So:
1) Why wasn't this in the stand-alone set-up?
2) Does it really solve the OOB issues I had in the stand-alone?
3) Does it preserve answers?

For now use a WC of Mireks branch for debugging. 
 

So after 2 days more glacial progress we arrive at zqsn and zqin diverging after the call to temperature_changes in thermo_vertical. 
Attempts to put in more write statements just before the call to temperature_changes results in a clash with CICE/branches/dev/hadax/r1205_SIMIP_GSI8. 
So now we're in exactly the same situation I had when trying to debug NEMO before the package branch was created, where I an unable to effectively debug things
due to the extensive list of CICE branches. We have to have a package branch. This cannot be put off any longer.

SO now I'm forced to create my own package branch - that in itself is a huge pain because  r1205_SIMIP_GSI8  is a branch of a branch so I cannot merge that directly into my debug branch, I have to first get hold of its ancestor   CICE/branches/dev/hadax/vn5.1.2_GSI8  ! 

Do this (but don't) commit anything!

Well that was short lived.... I can't merge the two stages of r1205_SIMIP_GSI8 without committing something because the 2nd stage (BOB) of that branch does not share a common ancestor!

SO basically I've got to create a brand new branch.
Then merge in these two stages and commit, then merge all the debugging stuff from my branch somehow! That will take ages to unpick.   
 

OK so I have a partial package branch which i can use for debugging, for now.
CICE/branches/dev/frrh/vn5.1.2_GSI8_pkg_debug with my write statements added via a WC at /data/local/frrh/cice_br/vn5.1.2_GSI8_pkg_debug

Another day of write statements leads us further and further towards variables named iswabsn, iswabs and iswab. This seems to me to be the first thing which diverges. 
But I can't find the point at which it diverges or is even set on the initial TS of the restart. 

Close examination reveals that this variable only has a non zero value on the very first timestep after a restart. Why? It has a sister variable Swabsn, Sswabs,Sswab which never seems to get anything but zeros. I really don't understand that. 

Talking to Alex, he says this is expected and that Iswab SHOULDN'T affect model evolution IN COUPLED cases because it's not used. Really? So  why does it have a non zero value after a restart, where does that value come from and is it really so benign?

If that's true then I've been chasing a red herring for the last 2 days solid and that will lead us back to my CICE A2 differences in zqsn and zqin and fcondbot.which points at the temperature_changes routine. in the bl99 module.   

Well setting it to zero is a fat lot of good... it causes the model to crash! How can that be if it has no bearing on model evolution?

Getting Alex to look at some of my output he notes that my later differences in zqsn and zqin are higher in the restart case which would be consistent with a non zero contribution from iswabs. 

So why am I crashing when I zero it?

He also says that iswabs is dependent on fswint. I have apparently consistent values of fswint throughout the model - it's zero. Which may explain why iswabs ends up as zero on subsequent timesteps but if it's initialised in a different way then we have junk or at least unwanted values.

This takes us back yet again to ice_shortwave and the line
   Iswabs(i,j,k) = fswpen(i,j) * (trantop(i,j)-tranbot(i,j))
in routine absorbed_solar. 

So fswpen, trantop and tranbot definitely contribute to the initial value. Print these out and at the same time initialise Iswabs to zero here and see what happens. 

It crashes again!!!

Why does zeroing a supposedly unimportant field which can legitimately have zero values and does have zero values after the 1st TS cause a crash during initialisation/1st TS?

What the heck is going on?

Well it's because there's an iterative process in temperature_changes which then fails to converge on several PEs if iswabs is zero!

Alex says that to get round this we also need to be zeroing fswint (or preventing it from getting non zero values. Doing that just by commenting out the setting of fswint and iswabs in absobed_solar allows things to run OK and the CICE fields seem generally OK for the first timestep after the restart but we get back to NEMO now and it all diverges again at   "RSRH H tsa temp" in NEMOGCM/NEMO/OPA_SRC/step.F90.

That's immediately after the lbc_lnk I added which comes immediately after the call to tra_qsr ( penetrative solar radiation qsr)
This leads us to OPA_SRC/TRA/traqsr.F90. 

Looking at that there's a bit in there headed      
   IF( lk_qsr_bio .AND. ln_qsr_bio ) THEN      !  bio-model fluxes  : all vertical coordinates  !

Which brings MEDUSA back into play. 

Stephen will test my hack in a GC3 model w/o medusa to see if it solves the general case. 

It won't! I've now traced this back to qsr and qns which come out of cice2nemo so CICE is still not right even after fixing the iswabs thing. What's more, 
on the restart we have qns as zero whereas in the NRUN it has non-zero numbers. qsr is just different!
What a mess. 

Further examionation reveals that the qns zero thing is just because it's starting up. Once the term hyas been added in at 
{{{
      IF (ksbc == jp_purecpl) THEN
         qsr(:,:)= qsr_tot(:,:)
         qns(:,:)= qns_tot(:,:)
      ELSE
         qns(:,:)= qns(:,:)-sprecip(:,:)*Lfresh*(1.0-fr_i(:,:))
      ENDIF
}}}

it bit compares again so it really does seem to be just qsr that's still at fault. 

Further examination reveals that the extra contribution coming from CICE in field fswthru_ai, via ztmp1 is always zero in the NRUN (except on TS 1 when the bug occurs) but not ZERO in the CRUN. 

OK so we explicitly comment out the setting of fswthru in ice_shortwave.F90. 
That seems to sort out the qsr value coming from CICE! (Alex thinks there's some root cause in a variable further back in cice which getting set but which if zero would prevent us having to bodge this.)

However, now we're diverging further on in NEMO at H8 ! Sheesh!. But this indicates any of bdy_tra_dmp, tra_adv (my money's on this), tra_kpp, tra_ldf or dia_ptr. Narrow it down...

I was right about tra_adv this then leads us to zwn, then wn coming out of routine wzv. That then shows differences in fse3t_a and hdivn. 

The first of these seems like it may lead back to CICE yet again... more writes and global sums.   

This leads to domvvl and z_scale in the setting of fse3t_a.

z_scale depends on ssha, sshb, ht_0, ssmask etc. 

What's different in these?

Now we arrive at the computation of ssha in ssh_nxt (sshwzv.F90). But printing global sums of all pertinent input arrays shows no differences except in the initial value of ssha but that's irrelevant since it's completely overwritten in this routine. Which really seems to leave use with the following:
{{{
      IF( neuler == 0 .AND. kt == nit000 )   z2dt = rdt
}}}

But does that really come into play? I thought euler was set to 1 on the first TS after our restart so this condition shouldn't arise???

Add more writes to make certain. Also write out all other pertinent scalars. 

OK well all inputs to this routine are identical (or should be overwritten) and the scalars are indeed OK and teh neuler condition is false as expected. But what we've missed is a call 
to "div_cur" lurking in amongst the apparently benign code. This hides a multitude of horrific possibilities since it does calculations of its own and potentially calls at least 4 other 
subroutines which can modify hdivn with a myriad of other arrays and scalars which are completely invisible. So as usual, having spent days narrowing things down to one or two particular arrays
the whole thing just explodes in our face again with dozens of other potential sources of trouble.  

Next we arrive at sbc_rnf in SBC/sbcrnf.F90 where it seems our rnf_b diverges. 

There's a suspicious looking bit of code in here which does;

{{{
      IF( kt /= nit000 ) THEN                      !          Swap of forcing fields          !
         !                                         ! ---------------------------------------- !
         rnf_b    (:,:  ) = rnf    (:,:  )               ! Swap the ocean forcing fields except at nit000
         rnf_tsc_b(:,:,:) = rnf_tsc(:,:,:)               ! where before fields are set at the end of the routine
         !
      ENDIF
}}}

And later on we have...

{{{

      IF( kt == nit000 ) THEN                          !   set the forcing field at nit000 - 1    !
         !                                             ! ---------------------------------------- !
         IF( ln_rstart .AND.    &                               !* Restart: read in restart file
            & iom_varid( numror, 'rnf_b', ldstop = .FALSE. ) > 0 ) THEN
            IF(lwp) WRITE(numout,*) '          nit000-1 runoff forcing fields red in the restart file'
            CALL iom_get( numror, jpdom_autoglo, 'rnf_b', rnf_b )     ! before runoff
            CALL iom_get( numror, jpdom_autoglo, 'rnf_hc_b', rnf_tsc_b(:,:,jp_tem) )   ! before heat content of runoff
            CALL iom_get( numror, jpdom_autoglo, 'rnf_sc_b', rnf_tsc_b(:,:,jp_sal) )   ! before salinity content of runoff
         ELSE                                                   !* no restart: set from nit000 values
            IF(lwp) WRITE(numout,*) '          nit000-1 runoff forcing fields set to nit000'
            rnf_b    (:,:  ) = rnf    (:,:  )
            rnf_tsc_b(:,:,:) = rnf_tsc(:,:,:)
         ENDIF
      ENDIF
}}}

So it means on every TS except a restart it's setting rnf_b = rnf and rnf_tsc_b = rnf_tsc. 
A restart is the only occasion when these things emerge from sbc_rnf with different values. The rest of the time the _b and standard values are identical. Talking to Tim he agrees this is fishy. He says that the reading of rnf_b from the restart is legitimate because we need the value from the last TS, but our rnf value should be different. So it's the CRUN at TS 33 which is at fault. In fact all non-restart timesteps will have incorrectly identical values for rnf and rnf_b. 

We need to ensure that this swapping is done before the new rnf value is obtained (from the coupling), not after!

This appears to have been in code for a very long time. Tim confirms that it was present in GC2! 

The only reason we're able to identify this (and the CICE restart bug) now is because we now have a restartable atmosphere. Until we had that (some time in the last year) we would never 
have been able to spot these sorts of bugs, w/o some sort of dummy interface. 
     
I've tried knocking up a fix by adding the rnf_b = rnf assignment to sbcmod.F90 with all the other similar TS related shuffling and protecting it with IF (ln_rnf) and protecting the original shuffle in the wrong place as far as coupling is concerned with IF (.NOT.ln_rnf). That works well and we finally have solver.stat agreeing at 2 days. 
However Tim has an even simpler solution which also seems fine: branches/UKMO/dev_r5518_GO6_package_fix_rnf@7782

We need to get this reviewed and included asap. Tim has created a separate ticket #1866 since this is a general NEMO issue, not just something peculiar to MO systems. 

Someone (probably I) also needs to come up with a viable solution to the CICE problem. 

With all these hacks in place we can finally see that although NEMO and CICE NRUN v CRUNS agree at 2 days, the CO2 field coming from MEDUSA to be used in coupling is diverging on day 2. 
But it doesn't affect evolution.... why? MS confirms that in this set up the CO2 field is effectively only a diagnostic, but it will be employed as a prognostic in other runs. 

So we need to get that right! 

Well we already know that MEDUSA diverges in stand alone runs too so it'll be easier to debug in that context probably. 

In fact MEDUSA developers really ought to be best placed to do that.  

I also ought to be updating things to be using the vary latest GO6 and MEDUSA branches since things have moved on a lot even in the ~3 weeks since I started this. 


Poking into the MEDUSA code with glob_sum on key variables leads us to CO2FLUX_out_cpl then zn_co2_flx then fgco2 which diverges on TS 34. This takes us into the massive trcbio_medusa where calculations are done in an enormous loop on a cel by cell basis - so we don't gobal arrays to examine and the numbers of fields involved renders it impossible to examine each scalar. So I have to introduce special new arrays to store scalars of interest and sum them outside the loop for comparison. This leads to the call to mocsy_interface at around line 1800. Examining all declared "input" fields shows diffs in zpho, zdic and zalk. 

The get values from:
{{{
               zpho = max(0.,trn(ji,jj,jk,jpdin)) / 16.0 !! phosphate via DIN and Redfiel              
               zdic = max(0.,trn(ji,jj,jk,jpdic)) !! dissolved inorganic carbon
               zalk = max(0.,trn(ji,jj,jk,jpalk)) !! alkalinity
}}}


So the trn array is implicated - but interesting only some fields in it. 

Examining the zalk trail....

This then leads to btra(jpalk) which leads to fa_prod and fa_cons.

fa_prod leads to f_benout_ca(ji,jj) and fthk and f_riv_alk(ji,jj) and friver_dep(jk,jmbathy)

   * f_benout_ca(ji,jj) leads to xsedca * zn_sed_ca(ji,jj)

fa_cons leads to ftempca

Check trn contents at the start, middle and end of this routine. 

We find trn for jpdin (zpho), jpdic and jpalk are identical atthe start and end of TS33 but are different at teh start of TS34. 

So something happens elsewhere to affect these? Best way to find this might be to search on jpalk.  That's useless because jpalk only exists in that routine and is a PARAMETER set to jp_1m + 14. Brilliant!
Similarly jpdin = jp_1m + 7 and jpdic = jp_1m + 13.