#1727 closed Bug (fixed)
NEMO-CICE continuation runs do not give same results and normal runs over the same period.
Reported by: | frrh | Owned by: | frrh |
---|---|---|---|
Priority: | low | Milestone: | 2015 release-3.6 |
Component: | OCE | Version: | v3.6 |
Severity: | Keywords: | 2015 OPA v3.6 | |
Cc: |
Description
Context
NEMO-CICE stand-alone models have been observed to produce different results depending on whether they are run as complete end-to-end runs (NRUNs) for the entire run length or in chunks, restarting at regular intervals (CRUNs).
Furthermore, two NRUNS of identical length have been observed to give different results depending on the processor configuration in use.
Analysis
It seems to me that the behaviour exhibited is likely to be to two different artefacts of the same bug.
The point at which results diverge is completely unpredictable since it appears data (and PE configuration) dependent.
Fix
https://forge.ipsl.jussieu.fr/nemo/browser/branches/UKMO/dev_r5518_restart_fix
Commit History (0)
(No commits)
Change History (9)
comment:1 Changed 8 years ago by frrh
comment:2 Changed 8 years ago by frrh
- Owner changed from NEMO team to frrh
comment:3 Changed 8 years ago by frrh
Protracted debugging work suggests that array tsb needs to have a call to lbc_lnk somewhere near the start of the main loop in step.90.
It appears that it is immaterial whether this call is before or after the call to SBC which in itself is interesting because it suggests the problem is not necessarily linked to the presence of CICE... it could be a generic NEMO issue.
I find that with such a call in place I can achieve complete bit comparison between a 6x6 PE 10-day NRUN, a 12x9 PE 10-day NRUN and a 6x6 PE 10 x 1-day CRUN.
I would suggest this fix be checked in other configurations, including suitably long NEMO-only (i.e. non CICE) models in suitably stressful different PE decompositions (i.e. not merely using simple multiples of the original PE count in each direction for comparison.).
comment:4 Changed 8 years ago by frrh
Further to this, while the eORCA1 case seems to restart reproducibly with this fix included, the eORCA025 NEMO-CICE case does not. Investigations with that reveal that the differences emanate from the CICE treatment of stress fields at the North fold which, unlike the eORCA1 case, is a T-fold and therefore requires extra operations on the stress fields. The critical thing seems to be that the fields StrocnxT and StrocnyT require a vector N-fold update before being passed to the NEMO component.
A fix for this is available on the Met Office internal CICE repository, documented under !ticket 75 of that system.
As a side issue, it is noted that in both eORCA1 and eORCA025, neither StrocnxT nor StrocnyT can be visualised in ncview due to the presence of NaNs? over blocks of the Antarctic land mass. This appears to be an artefact of the NEMO grid definitions failing to prescribe meaningful lats and longs at these points.
comment:5 Changed 8 years ago by nicolasmartin
- Keywords 2015 nemo_v3_6* added
comment:6 Changed 7 years ago by frrh
- Resolution set to fixed
- Status changed from new to closed
Closing ticket since work not aimed at NEMO trunk. Full documentation or CRUN reproducibility
and relevant solutions is held on MO GMED trac system.
comment:7 Changed 7 years ago by nemo
- Keywords release-3.6* added; nemo_v3_6* removed
comment:8 Changed 7 years ago by nemo
- Keywords release-3.6* removed
comment:9 Changed 3 years ago by nemo
- Keywords OPA v3.6 added
Extensive tests using eORCA1 NEMO-CICE jobs show that a 10 day NRUN compared with an equivalent CRUN comprising 10 x 1 day resubmissions begins to diverge around day 7.
Furthermore, comparing a 6x6 PE NRUN versus a 12x9 NRUN shows differences around day 4.
This suggests parallelisation issues to me.
Discrepancies were originally unearthed in eORCA025 configurations and tests with that suggested that switching off CICE or reducing optimisation solved the problem.
In my view, since the problem is data dependent, changing optimisation or excluding CICE will change the numbers involved and by definition will change the point of divergence. So I remain to be convinced that either of the above solves anything. I believe they simply move (delay) the onset of the problem.