Version 23 (modified by frrh, 8 years ago) (diff) |
---|
Bugs in NEMO-CICE-MEDUSA
Testing on the Met Office CrayXC40, using working copy u-aj777RHdebug and its variants we start by trying a 2-day (2x1-day cycles) set up with array bounds testing. Using -Rb for both NEMO and CICE code compilation we find:
- TOP_SRC/MEDUSA/sms_medusa.F90 doesn't compile - ierr is OOB - it needs to be declared with a dimension of 8.
- CICE generates:
lib-4961 : WARNING Subscript -330 is out of range for dimension 2 for array 'array_g' at line 2206 in file 'ice_gather_scatter.F90' with bounds 1:332. ARRAY_G 2nd index is OOB in; msg_buffer(i,j) = ARRAY_G(this_block%i_glob(i)+nghost,& this_block%j_glob(j)+nghost)
Heaven knows why we have a negative number here!?
This doesn't cause a failure because its an OOB read. I suspect it would cause a failure if it was an OOB write.
- We then get a failure in trc_nam_medusa.F90:
lib-4213 : UNRECOVERABLE library error A pointer or allocatable array in an I/O list has not been associated or allocated. Encountered during a namelist WRITE to unit 27 Fortran unit 27 is connected to a sequential formatted text file: "output.namelist.pis"
This error message doesn't lead us to any particular statement or line number which is a bit rubbish... the compiler must know but it doesn't bother to tell us.
- trcnam_medusa.F90 has a section where it's initialising variables from the
natbio namelist. However it initialises jdms_input twice thus...
jdms_input = 0 jdms_input = 3
Why?
jdms_model is not initialised at all - is the 2nd occurrence supposed to refer to that?
jq10 is not initialised.
Some variables are declared twice in natbio. e.g. vsed, xhr
- Writing of natbio causes the above error. Suggesting something in that namelist is unset.
Skipping that, we get a similar error writing natroam! Skip that and natopt seems to be OK but it's the only one of the three namelists that is. The model then goes on to complete (and completes a 2nd 1-day cycle OK).
- The code also refers to a namelist named "nammeddia", but we have no such namelist. Our namelists refer to something called "nammedia" (only one "d") Presumably that's a typo. JP says this is not currently used in our configurations (if it was, it would crash looking for a missing namelist!)
- Checking job.err, the only warning we have is the one about the ARRAY_G reference in CICE. This is present in both the NRUN and teh CRUN (why wouldn't it be?)
So we have a number of things to do:
- Correct ierr dimension to 8 in sms_medusa.F90
- Remove duplicate variable declarations in natbio
- Ensure missing fields are given default values in natbio
- Replace the 2nd occurrence of jdms_input with jdms_model, presumably
- Investigate why the namelist writes fail.
Checking with JP, he confirms that jdms_model is what is meant to be used in the above instead of the 2nd occurrence of jdms_input, with a default of 3, and jq10 should be given a default of 1.5. I've applied these to my branch and JP has also applied them to the main MEDUSA_stable branch. Ditto the ierr dimensioning fix. So that should only leave the issues with printing natbio and natroam.
Further investigation into why printing natbio contents fails reveals that the FRIVER_DEP array is unallocated at time of the attempted write. natroam seems to have similar issues with other fields. It seems that these arrays are only initialised some time after the writing of the namelist contents (if at all) in sms_medusa_alloc. The various arrays involved seem to be dimensioned by jpi, jpj and jpk. so it seems doubtful that these would ever be given values by the namelist input file! They'd hvae to be massive (over a million different values in an ORCA1 res). So is there really any point in having these as part of the namelist definition?
The fields apparently unallocated in natroam at the time of output appear to be:
F3_PH = , F3_H2CO3 = , F3_HCO3 = , F3_CO3 = , F3_OMCAL = , F3_OMARG = , F2_CCD_CAL = , F2_CCD_ARG =
So is it important to have any of these in the namelists given that it seems highly unlikely they'll ever be given values from them?
FRIVER_DEP is actually set in trc_ini_medusa. It's unprotected and therefor if anything was supplied vioa a namelist it would be irrelevant.
F3_PH seems to be initialised in trcini_medusa and set in trcbio_medusa and read in via iom_bet in trcrst.F90. The same goes for all the other F3 and F2 arrays listed above. SO I don't understand why they're listed in namelists if they don't get their values from there.
Contacted JP to see if we can remove these things from namelists. He'll get back to me.
CICE OOB error
The CICE OOB error arises from trying to access a negative J index in the global array in ice_gather_scatter. That negative index (-331) arises from ice_blocks where for tripole and tripoleT it looks to see if j_global(j,n) > ny_global and if it is it does:
j_global(j,n) = -j_global(j,n)
That seems crazy.... how can giving it a negative index which is clearly OOB help?
The fact that it doesn't crash seems like pot luck. However trying to point at alternative locations within bounds fails because it either causes the model to diverge (why... this is supposed to be a halo row) or crashes if I get the index wrong and send it OOB with a POSITIVE value. i.e. there are two completely different behaviours when OOB is encountered - a negative index generates a warning and the code carries on but a positive index causes a crash. JCR says it's the luck of the draw depending on where the OOB pointer refers - if you stay within a page, you're OK but if not you crash.
So the fact that our original code runs at all is pure blind luck! This all looks very unsafe but I can't find a way to keep it in bounds when setting up the indexes w/o altering results....! The only way I can do it is by introducing a check in scatter_global_ext (which is the variant issuing the error) to explicitly exclude negative J indices. That works OK but it seems a bit of a cludge.
Coupled models
Repeating these tests with a coupled model set up. MS provides u-aj599 as the latest job. This features JP's MEDUSA_Stable branch at revision 7709 which contains SOME OF the fixes identified above. However it does not address the issue with the seemingly needless inclusion of unallocated arrays in namelists.
So my WC is u-ak282debug. This immediately fails on compilation with another medusa related OOB issue in the dimensioning of ierr in oce.F90. However although this is related to MEDUSA the relevant code actually lives in the GO6 package branch. So I've got to create a bug fix branch for that as well as a new one for the up-to-date MEDUSA branch.
So my branches are:
- branches/UKMO/dev_r5518_MEDUSA_stable_fixes_part2
- branches/UKMO/dev_r5518_GO6_package_fixes.
We then have:
- Failure in sms_medusa_alloc - ierr should be dimensioned as 8. JP said he was going to apply this to his branch bus appears not to have done. So I apply it again to branches/UKMO/dev_r5518_MEDUSA_stable_fixes_part2
- We then fail at run time because of the unallocated arrays in the namelist writes. Lets fix that by removing those arrays from the namelists since I can see no reason to leave them there. (They do not get values from the namelist input and if the output works, the output.namelist.pis file will be so huge that nobody is going to trawl through it!)
- Also remove duplicate vars from natbio.
- Interestingly I'm not seeing a report of the CICE OOB error/warning.... even though I'm using 12 PE's N-S - exactly the same as before in the non coupled case. This might suggest it's a field dependent thing relating only to reading in/initialising some input field which is not employed in the coupled case!
- This ran OK (1-day!)
- So we now know for sure that removing those arrays from namelists is a good thing to do.
Coupled model CRUNs
We compare a 2day NRUN with a 2x1-day CRUN in u-ak282debug and u-ak282debugCRUN.
Inspecting the solver.stat file, this reveals that NMEO seems to diverge on the first TS immediately after the restart i.e. TS 33! Why? We have other models (e.g. the stand-alone NEMO-CICE-MEDUSA which don't.
So this could arise from the coupling fields coming from the atmos at TS0, but the atmos dumps are identical at day 1 and checking the most likely contenders (PCO2 and dust) shows they're fine at teh start of day 2.
Comparing the ocean.output files at the start of day 2 leads us to fishy looking things related to TOP/MEDUSA since we have a message ' New shortwave to sample for TOP at time kt = ' occurring at different timesteps for each run. In the NRUN case this occurs at TS 36 and in the CRUN case it occurs at TS 38! This leads us to trcstp.F90 which contains a flaky looking bit of code:
iseclast = nsec_year + nsec1jan000 llnew = ( iseclast - isecfst ) > INT( rdt_sampl ) ! new shortwave to store IF( kt /= nittrc000 .AND. llnew ) THEN IF( lwp ) WRITE(numout,*) ' New shortwave to sample for TOP at time kt = ', kt, & & ' time = ', (iseclast+rdt*nn_dttrc/2.)/3600.,'hours ' isecfst = iseclast DO jn = 1, nb_rec_per_days - 1 qsr_arr(:,:,jn) = qsr_arr(:,:,jn+1) ENDDO qsr_arr (:,:,nb_rec_per_days) = qsr(:,:) qsr_mean(:,: ) = SUM( qsr_arr(:,:,:), 3 ) / nb_rec_per_days ENDIF
That can never give the same results for NRUNS and CRUNS! This condition arises at the 6th timestep after a restart and every 5 timesteps thereafter. So in the restart case it occurs at TS=38, 43, 48, 53 etc while in the NRUN case is arises at TS=36, 41, 46, 51 etc.
This comes from the vn3.6 NEMO_stable branch via the GO6 branch, and is not specific to the MEDUSA branch. But the logic looks completely inconsistent.
The description says "In coupled mode, the sampling is done at every coupling frequency", which is nothing like what happens at all because, with a 45 minute TS and a coupling frequency of 3 hours which we have in our model:
- The first sample is taken at TS 6 after a start i.e. 4.5 hours
- subsequent samples occur at 3.75 hours not the 3 hours it should do!
I think this should be sampling, consistently at TS = 0, 4, 8.... 32.... (though 32 shouldn't matter in the case of day 1 of the CRUN) or something like that.
iseclast changes as things progress and is identical in the two runs isecfst changes as things progress but is different due to the offset at the times it is updated! nsec_year = 20834550 nsec1jan000 = 0 rdt_sampl = 10800 nb_rec_per_day = 8
So what does this all mean?
Well isecfst is only set on TS=1. So it never changes and it's value is always the number of seconds of the timestep of TS=1 relative to some reference date.
So llnew is true if iseclast - isecfst > 10800
I think that statement should use >= for starters.
Try that, but not worth pursuing further since it seems that the things calculated here are for diagnostic purposes only.