Opened 3 years ago
Closed 3 years ago
#2418 closed Bug (fixed)
model does not stop properly in stpctl
Reported by: | smasson | Owned by: | systeam |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | MULTIPLE | Version: | trunk |
Severity: | minor | Keywords: | |
Cc: |
Description
Context
If an error is detected in stpctl, the model will or will not stop properly according to the namelist choices used for sn_cfctl
Analysis
For example sn_cfctl%l_glochk = .true. and all others sn_cfctl% defined to .false. the model does not stop because of a dead lock problem. The variable lsomeoce is also a potential source of dead lock.
Fix
Large debugging/cleaning of stpctl is needed...
Commit History (23)
Changeset | Author | Time | ChangeLog |
---|---|---|---|
13136 | smasson | 2020-06-22T08:29:44+02:00 | trunk: fix maxval values on land subdomains for stpctl, see #2418 |
13115 | smasson | 2020-06-16T20:58:06+02:00 | trunk: fix potential deadlock, see #2418 |
13011 | smasson | 2020-06-03T09:56:28+02:00 | trunk: make sure error messages are visible, see #2418 |
12935 | smasson | 2020-05-15T14:15:31+02:00 | delete #2418 branch |
12933 | smasson | 2020-05-15T10:06:25+02:00 | trunk: merge back r12581_ticket2418 branch into the trunk, see #2418 |
12932 | smasson | 2020-05-15T10:01:11+02:00 | r12581_ticket2418: update sette version in svn externals definition, see #2418 |
12931 | smasson | 2020-05-15T09:59:05+02:00 | sette: suppress set_namelist for sn_cfctl%l_config, see #2418 |
12930 | smasson | 2020-05-15T09:51:24+02:00 | r12581_ticket2418: update with trunk @12929, see #2418 |
12858 | smasson | 2020-05-03T11:04:27+02:00 | r12581_ticket2418: bugfix not seen on X64_IRENE, see #2418 |
12856 | smasson | 2020-05-02T11:55:39+02:00 | r12581_ticket2418: stupid bugfix following [12855], see #2418 |
12855 | smasson | 2020-05-01T19:09:33+02:00 | r12581_ticket2418: add check for Infinity, see #2418 |
12853 | smasson | 2020-05-01T18:56:02+02:00 | r12581_ticket2418: merge with trunk@12852, see #2418 |
12846 | smasson | 2020-05-01T14:07:29+02:00 | r12581_ticket2418: merge with trunk@12845, see #2418 |
12844 | smasson | 2020-05-01T12:57:50+02:00 | r12581_ticket2418: merge with trunk@12843, see #2418 |
12840 | smasson | 2020-05-01T10:58:58+02:00 | r12581_ticket2418: improve stpctl error messages and release the max of 9999 MPI tasks in files names, see #2418 |
12835 | smasson | 2020-04-30T08:55:37+02:00 | r12581_ticket2418: suppress l_allon and l_config namelist parameters, see #2418 |
12718 | smasson | 2020-04-08T17:21:05+02:00 | r12581_ticket2418: bugfix for C1D and STATION_ASF, see #2418 |
12685 | smasson | 2020-04-06T11:52:15+02:00 | r12581_ticket2418: end cleaning, see #2418 |
12684 | smasson | 2020-04-05T18:47:37+02:00 | r12581_ticket2418: additional cleaning, see #2418 |
12655 | smasson | 2020-04-03T11:35:09+02:00 | r12581_ticket2418: merge with trunk@12654, see #2418 |
12623 | smasson | 2020-03-28T08:38:26+01:00 | r12581_ticket2418: merge with trunk@12622, see #2418 |
12593 | smasson | 2020-03-24T16:52:17+01:00 | r12581_ticket2418, first commit see #2418 |
12582 | smasson | 2020-03-21T11:58:26+01:00 | r12581_ticket2418: create branch from trunk@12581, see #2418 |
Change History (36)
comment:1 Changed 3 years ago by smasson
comment:2 Changed 3 years ago by smasson
In 12593:
comment:3 Changed 3 years ago by smasson
This is quite a large commit for a bugfix. So I used a branch to share it before merge it back to the trunk.
One of the complexity of this ticket is coming from the duplication of the "same" routines in different directories. Maintaining and synchronizing this directories is not always properly done, especially when these configurations/tests cases are not tested by sette...
There are the main points of this commit
- fix the reported bug
- get back the process number of which the error is found
- add a "CALL SLEEP (60)" in urgent and imperative stop to make sure that all processes have time to write their error messages and their abort file
- some minor bugfixes: arguments order in one call to dia_wri_state
- some minor optimisations: use of llmsk, do not look for min/max if not needed etc...
- general cleaning of stpctl (with more comments to follow which processus is doing what)
- synchronization of nemogcm, step and stpctl in OCE, SAS, C1D. version of OFF seems to be OK.
- add error tests in SAS/stpctl
- partly rewriting stpctl to minimize the differences between stpctl in OCE and SAS.
- suppres sn_cfctl%l_glochk which had no real use (or I miss it)
There are the missing points of this commit
- c1d and STATION_ASF are not in the sette tests and must therefore be tested (and included in the sette test, I guess). One must check that this configurations are still working AND that they are properly stoping when an error has to be detected. This last test must be done by forcing an error in step, just before calling stp_ctl
- I would like to suppress sn_cfctl%l_allon and sn_cfctl%l_config which I find not very usefull
- test on ice temperature has been move from -100 to -101 as errors where detected in ICE_AGRIF (could not get sette reproducibility). This error in ICE_AGRIF should be solved independently of this ticket
- we should try to limit the number of subroutines in STATION_ASF/MY_SRC to make it sustainable...
this branch pass all sette tests and gives the same results as trunk@12563:
-bash-4.2$ ./sette_rpt.sh Current code is : NEMO/branches/2020/r12581_ticket2418 @ r12582 ( last change @ r12582 ) SETTE validation report generated for : NEMO/branches/2020/r12581_ticket2418 @ r12582+ (last changed revision) on X64_IRENE arch file !!---------------1st pass------------------!! !----restart----! WGYRE_PISCES_ST run.stat restartability passed : 12582+ WGYRE_PISCES_ST tracer.stat restartability passed : 12582+ WORCA2_ICE_PISCES_ST run.stat restartability passed : 12582+ WORCA2_ICE_PISCES_ST tracer.stat restartability passed : 12582+ WORCA2_OFF_PISCES_ST tracer.stat restartability passed : 12582+ WAMM12_ST run.stat restartability passed : 12582+ WORCA2_SAS_ICE_ST run.stat restartability passed : 12582+ WAGRIF_DEMO_ST run.stat restartability passed : 12582+ WSPITZ12_ST run.stat restartability passed : 12582+ WISOMIP_ST run.stat restartability passed : 12582+ WOVERFLOW_ST run.stat restartability passed : 12582+ WLOCK_EXCHANGE_ST run.stat restartability passed : 12582+ WVORTEX_ST run.stat restartability passed : 12582+ WICE_AGRIF_ST run.stat restartability passed : 12582+ !----repro----! WGYRE_PISCES_ST run.stat reproducibility passed : 12582+ WGYRE_PISCES_ST tracer.stat reproducibility passed : 12582+ WORCA2_ICE_PISCES_ST run.stat reproducibility passed : 12582+ WORCA2_ICE_PISCES_ST tracer.stat reproducibility passed : 12582+ WORCA2_OFF_PISCES_ST tracer.stat reproducibility passed : 12582+ WAMM12_ST run.stat reproducibility passed : 12582+ WORCA2_SAS_ICE_ST run.stat reproducibility passed : 12582+ WORCA2_ICE_OBS_ST run.stat reproducibility passed : 12582+ WAGRIF_DEMO_ST run.stat reproducibility passed : 12582+ WSPITZ12_ST run.stat reproducibility passed : 12582+ WISOMIP_ST run.stat reproducibility passed : 12582+ WVORTEX_ST run.stat reproducibility passed : 12582+ WICE_AGRIF_ST run.stat reproducibility passed : 12582+ !----agrif check----! ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat unchanged - passed : 12582+ 12582+ !----result comparison check----! check result differences between : VALID directory : /ccc/scratch/cont005/ra0542/massons/r12581_ticket2418/NEMO_VALIDATION at rev 12582+ and REFERENCE directory : /ccc/scratch/cont005/ra0542/massons/trunk/NEMO_VALIDATION at rev 12563 WGYRE_PISCES_ST run.stat files are identical WGYRE_PISCES_ST tracer.stat files are identical WORCA2_ICE_PISCES_ST run.stat files are identical WORCA2_ICE_PISCES_ST tracer.stat files are identical WORCA2_OFF_PISCES_ST tracer.stat files are identical WAMM12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WORCA2_SAS_ICE_ST run.stat files are identical WAGRIF_DEMO_ST run.stat files are identical WSPITZ12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WVORTEX_ST run.stat files are identical WICE_AGRIF_ST run.stat files are identical
comment:4 Changed 3 years ago by smasson
In 12623:
comment:5 Changed 3 years ago by smasson
In 12655:
comment:6 Changed 3 years ago by smasson
In 12684:
comment:7 Changed 3 years ago by smasson
this branch pass all sette tests with GCC and gives the same results as trunk@12650:
Current code is : NEMO/branches/2020/r12581_ticket2418 @ r12683 ( last change @ r12655 ) SETTE validation report generated for : NEMO/branches/2020/r12581_ticket2418 @ r12655+ (last changed revision) on X64_IRENE_GCC arch file !!---------------1st pass------------------!! !----restart----! WGYRE_PISCES_ST run.stat restartability passed : 12655+ WGYRE_PISCES_ST tracer.stat restartability passed : 12655+ WORCA2_ICE_PISCES_ST run.stat restartability passed : 12655+ WORCA2_ICE_PISCES_ST tracer.stat restartability passed : 12655+ WORCA2_OFF_PISCES_ST tracer.stat restartability passed : 12655+ WAMM12_ST run.stat restartability passed : 12655+ WORCA2_SAS_ICE_ST run.stat restartability passed : 12655+ WAGRIF_DEMO_ST run.stat restartability passed : 12655+ WSPITZ12_ST run.stat restartability passed : 12655+ WISOMIP_ST run.stat restartability passed : 12655+ WOVERFLOW_ST run.stat restartability passed : 12655+ WLOCK_EXCHANGE_ST run.stat restartability passed : 12655+ WVORTEX_ST run.stat restartability passed : 12655+ WICE_AGRIF_ST run.stat restartability passed : 12655+ !----repro----! WGYRE_PISCES_ST run.stat reproducibility passed : 12655+ WGYRE_PISCES_ST tracer.stat reproducibility passed : 12655+ WORCA2_ICE_PISCES_ST run.stat reproducibility passed : 12655+ WORCA2_ICE_PISCES_ST tracer.stat reproducibility passed : 12655+ WORCA2_OFF_PISCES_ST tracer.stat reproducibility passed : 12655+ WAMM12_ST run.stat reproducibility passed : 12655+ WORCA2_SAS_ICE_ST run.stat reproducibility passed : 12655+ WORCA2_ICE_OBS_ST run.stat reproducibility passed : 12655+ WAGRIF_DEMO_ST run.stat reproducibility passed : 12655+ WSPITZ12_ST run.stat reproducibility passed : 12655+ WISOMIP_ST run.stat reproducibility passed : 12655+ WVORTEX_ST run.stat reproducibility passed : 12655+ WICE_AGRIF_ST run.stat reproducibility passed : 12655+ !----agrif check----! ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat unchanged - passed : 12655+ 12655+ !----result comparison check----! check result differences between : VALID directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12655+ and REFERENCE directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12650 WGYRE_PISCES_ST run.stat files are identical WGYRE_PISCES_ST tracer.stat files are identical WORCA2_ICE_PISCES_ST run.stat files are identical WORCA2_ICE_PISCES_ST tracer.stat files are identical WORCA2_OFF_PISCES_ST tracer.stat files are identical WAMM12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WORCA2_SAS_ICE_ST run.stat files are identical WAGRIF_DEMO_ST run.stat files are identical WSPITZ12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WVORTEX_ST run.stat files are identical WICE_AGRIF_ST run.stat files are identical
comment:8 Changed 3 years ago by smasson
In 12685:
comment:9 Changed 3 years ago by smasson
In 12718:
comment:10 Changed 3 years ago by smasson
In 12835:
comment:11 Changed 3 years ago by smasson
In 12840:
comment:12 Changed 3 years ago by smasson
This version pass all the sette tests and gives the same results as the trunk at rev 12650.
Note that, to use sette, I had to remove all the following lines
set_namelist namelist_cfg sn_cfctl%l_config .true.
in sette_reference-configurations.sh and sette_test-cases.sh as sn_cfctl%l_config has been removed from the nameliste
-bash-4.2$ ./sette_rpt.sh Current code is : NEMO/branches/2020/r12581_ticket2418 @ r12840 ( last change @ r12840 ) SETTE validation report generated for : NEMO/branches/2020/r12581_ticket2418 @ r12840 (last changed revision) on X64_IRENE arch file !!---------------1st pass------------------!! !----restart----! WGYRE_PISCES_ST run.stat restartability passed : 12840 WGYRE_PISCES_ST tracer.stat restartability passed : 12840 WORCA2_ICE_PISCES_ST run.stat restartability passed : 12840 WORCA2_ICE_PISCES_ST tracer.stat restartability passed : 12840 WORCA2_OFF_PISCES_ST tracer.stat restartability passed : 12840 WAMM12_ST run.stat restartability passed : 12840 WORCA2_SAS_ICE_ST run.stat restartability passed : 12840 WAGRIF_DEMO_ST run.stat restartability passed : 12840 WSPITZ12_ST run.stat restartability passed : 12840 WISOMIP_ST run.stat restartability passed : 12840 WOVERFLOW_ST run.stat restartability passed : 12840 WLOCK_EXCHANGE_ST run.stat restartability passed : 12840 WVORTEX_ST run.stat restartability passed : 12840 WICE_AGRIF_ST run.stat restartability passed : 12840 !----repro----! WGYRE_PISCES_ST run.stat reproducibility passed : 12840 WGYRE_PISCES_ST tracer.stat reproducibility passed : 12840 WORCA2_ICE_PISCES_ST run.stat reproducibility passed : 12840 WORCA2_ICE_PISCES_ST tracer.stat reproducibility passed : 12840 WORCA2_OFF_PISCES_ST tracer.stat reproducibility passed : 12840 WAMM12_ST run.stat reproducibility passed : 12840 WORCA2_SAS_ICE_ST run.stat reproducibility passed : 12840 WORCA2_ICE_OBS_ST run.stat reproducibility passed : 12840 WAGRIF_DEMO_ST run.stat reproducibility passed : 12840 WSPITZ12_ST run.stat reproducibility passed : 12840 WISOMIP_ST run.stat reproducibility passed : 12840 WVORTEX_ST run.stat reproducibility passed : 12840 WICE_AGRIF_ST run.stat reproducibility passed : 12840 !----agrif check----! ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat unchanged - passed : 12840 12840 !----result comparison check----! check result differences between : VALID directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12840 and REFERENCE directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12650 WGYRE_PISCES_ST run.stat files are identical WGYRE_PISCES_ST tracer.stat files are identical WORCA2_ICE_PISCES_ST run.stat files are identical WORCA2_ICE_PISCES_ST tracer.stat files are identical WORCA2_OFF_PISCES_ST tracer.stat files are identical WAMM12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WORCA2_SAS_ICE_ST run.stat files are identical WAGRIF_DEMO_ST run.stat files are identical WSPITZ12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WVORTEX_ST run.stat files are identical WICE_AGRIF_ST run.stat files are identical
comment:13 Changed 3 years ago by smasson
In 12844:
comment:14 Changed 3 years ago by smasson
In 12846:
comment:15 Changed 3 years ago by smasson
In 12853:
comment:16 Changed 3 years ago by smasson
In 12855:
comment:17 Changed 3 years ago by smasson
In 12856:
comment:18 Changed 3 years ago by smasson
this version gives the same results as the trunk@12852
Current code is : NEMO/branches/2020/r12581_ticket2418 @ r12856 ( last change @ r12856 ) SETTE validation report generated for : NEMO/branches/2020/r12581_ticket2418 @ r12856 (last changed revision) on X64_JEANZAY arch file !!---------------1st pass------------------!! !----restart----! WGYRE_PISCES_ST run.stat restartability passed : 12856 WGYRE_PISCES_ST tracer.stat restartability passed : 12856 WORCA2_ICE_PISCES_ST run.stat restartability passed : 12856 WORCA2_ICE_PISCES_ST tracer.stat restartability passed : 12856 WORCA2_OFF_PISCES_ST tracer.stat restartability passed : 12856 WAMM12_ST run.stat restartability passed : 12856 WORCA2_SAS_ICE_ST run.stat restartability passed : 12856 WAGRIF_DEMO_ST run.stat restartability passed : 12856 WSPITZ12_ST run.stat restartability passed : 12856 WISOMIP_ST run.stat restartability passed : 12856 WOVERFLOW_ST run.stat restartability passed : 12856 WLOCK_EXCHANGE_ST run.stat restartability passed : 12856 WVORTEX_ST run.stat restartability passed : 12856 WICE_AGRIF_ST run.stat restartability passed : 12856 !----repro----! WGYRE_PISCES_ST run.stat reproducibility passed : 12856 WGYRE_PISCES_ST tracer.stat reproducibility passed : 12856 WORCA2_ICE_PISCES_ST run.stat reproducibility passed : 12856 WORCA2_ICE_PISCES_ST tracer.stat reproducibility passed : 12856 WORCA2_OFF_PISCES_ST tracer.stat reproducibility passed : 12856 WAMM12_ST run.stat reproducibility passed : 12856 WORCA2_SAS_ICE_ST run.stat reproducibility passed : 12856 WORCA2_ICE_OBS_ST run.stat reproducibility passed : 12856 WAGRIF_DEMO_ST run.stat reproducibility passed : 12856 WSPITZ12_ST run.stat reproducibility passed : 12856 WISOMIP_ST run.stat reproducibility passed : 12856 WVORTEX_ST run.stat reproducibility passed : 12856 WICE_AGRIF_ST run.stat reproducibility passed : 12856 !----agrif check----! ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat unchanged - passed : 12856 12856 !----result comparison check----! check result differences between : VALID directory : /gpfsscratch/rech/fqx/reee217/r12581_ticket2418/NEMO_VALIDATION at rev 12856 and REFERENCE directory : /gpfswork/rech/fqx/reee217/NEMO_VALIDATION/trunk at rev 12852 WGYRE_PISCES_ST run.stat files are identical WGYRE_PISCES_ST tracer.stat files are identical WORCA2_ICE_PISCES_ST run.stat files are identical WORCA2_ICE_PISCES_ST tracer.stat files are identical WORCA2_OFF_PISCES_ST tracer.stat files are identical WAMM12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WORCA2_SAS_ICE_ST run.stat files are identical WAGRIF_DEMO_ST run.stat files are identical WSPITZ12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WVORTEX_ST run.stat files are identical WICE_AGRIF_ST run.stat files are identical
comment:19 Changed 3 years ago by smasson
I think this branch is ready to be merge back into the trunk.
However, there is still 2 points to discuss:
- this branch needs a modification of sette as sn_cfctl%l_config has been removed from the nameliste
- the documentation should also be modified because of the modifications done in namctl. It should be easy as we just have to delete some text. But as the compilation of the documentation is not working in the trunk, maybe it would be better to modify the documentation only once its compilation problem has been fixed...
comment:20 Changed 3 years ago by smasson
In 12858:
comment:21 Changed 3 years ago by smasson
Strange behavior of the WRITE command...
I tested the following program with different compilers:
PROGRAM tst CHARACTER(10) char WRITE(char,*) 'aaa' PRINT *,char WRITE(char,*) 'bbb ', TRIM(char) PRINT *,char END PROGRAM tst
ifort 16.0.4 20160811 and ifort 18.0.5 20180823 print
aaa bbb aaa
which was what I expected...
But ifort 19.0.5.281 20190815, gcc/4.8.5, gcc/9.1.0 and pgi/20.1 will print
aaa bbb bbb
So it looks the syntaxe I used was not good. In any case, it was not a good idea...
comment:22 Changed 3 years ago by smasson
In 12930:
comment:23 Changed 3 years ago by smasson
In 12931:
comment:24 Changed 3 years ago by smasson
In 12932:
comment:25 Changed 3 years ago by smasson
In 12933:
comment:26 Changed 3 years ago by smasson
- Resolution set to fixed
- Status changed from new to closed
fixed in [12933].
[12933] passes all the sette tests and gives the same results as trunk@12925
[reee217@jean-zay3: sette]$ ./sette_rpt.sh Current code is : NEMO/trunk @ r12933 ( last change @ r12933 ) SETTE validation report generated for : NEMO/trunk @ r12933 (last changed revision) on X64_JEANZAY arch file !!---------------1st pass------------------!! !----restart----! WGYRE_PISCES_ST run.stat restartability passed : 12933 WGYRE_PISCES_ST tracer.stat restartability passed : 12933 WORCA2_ICE_PISCES_ST run.stat restartability passed : 12933 WORCA2_ICE_PISCES_ST tracer.stat restartability passed : 12933 WORCA2_OFF_PISCES_ST tracer.stat restartability passed : 12933 WAMM12_ST run.stat restartability passed : 12933 WORCA2_SAS_ICE_ST run.stat restartability passed : 12933 WAGRIF_DEMO_ST run.stat restartability passed : 12933 WSPITZ12_ST run.stat restartability passed : 12933 WISOMIP_ST run.stat restartability passed : 12933 WOVERFLOW_ST run.stat restartability passed : 12933 WLOCK_EXCHANGE_ST run.stat restartability passed : 12933 WVORTEX_ST run.stat restartability passed : 12933 WICE_AGRIF_ST run.stat restartability passed : 12933 !----repro----! WGYRE_PISCES_ST run.stat reproducibility passed : 12933 WGYRE_PISCES_ST tracer.stat reproducibility passed : 12933 WORCA2_ICE_PISCES_ST run.stat reproducibility passed : 12933 WORCA2_ICE_PISCES_ST tracer.stat reproducibility passed : 12933 WORCA2_OFF_PISCES_ST tracer.stat reproducibility passed : 12933 WAMM12_ST run.stat reproducibility passed : 12933 WORCA2_SAS_ICE_ST run.stat reproducibility passed : 12933 WORCA2_ICE_OBS_ST run.stat reproducibility passed : 12933 WAGRIF_DEMO_ST run.stat reproducibility passed : 12933 WSPITZ12_ST run.stat reproducibility passed : 12933 WISOMIP_ST run.stat reproducibility passed : 12933 WVORTEX_ST run.stat reproducibility passed : 12933 WICE_AGRIF_ST run.stat reproducibility passed : 12933 !----agrif check----! ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat unchanged - passed : 12933 12933 !----result comparison check----! check result differences between : VALID directory : /gpfswork/rech/fqx/reee217/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12933 and REFERENCE directory : /gpfswork/rech/fqx/reee217/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12925 WGYRE_PISCES_ST run.stat files are identical WGYRE_PISCES_ST tracer.stat files are identical WORCA2_ICE_PISCES_ST run.stat files are identical WORCA2_ICE_PISCES_ST tracer.stat files are identical WORCA2_OFF_PISCES_ST tracer.stat files are identical WAMM12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WORCA2_SAS_ICE_ST run.stat files are identical WAGRIF_DEMO_ST run.stat files are identical WSPITZ12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WVORTEX_ST run.stat files are identical WICE_AGRIF_ST run.stat files are identical
comment:27 Changed 3 years ago by smasson
In 12935:
comment:28 Changed 3 years ago by smasson
- Resolution fixed deleted
- Status changed from closed to reopened
The solution that was coded is not good as it is machine/compiler dependent...
In ctl_stop when all processes detecting an error are writing in the same ocean.output file, there is a chance that some error messages are erased/overwritten. This is specially the case when 'STOP' is not the first argument of ctl_stop. In this case, many other things can potentially be written in ocean.output file. This can potentially erase some error messages...
Solution :
- each error message are written in its specific ocean.output_xxxx file. If this file is not opened before entering ctl_stop, we create it.
- if 'STOP' is the first argument of ctl_stop: each process detecting an error is also adding the following message in the ocean.output file:
==>>> Look for "E R R O R" messages in all existing *ocean.output* files'
This message can be erased/overwritten by all processes detecting and error, but we know that the last process writing in ocean.output will be a process writing this message. So we will get at least 1 line with the message! - if 'STOP' is not the first argument of ctl_stop, we know that process 0 will be the last writing in ocean.output file at the end of nemo_gcm, when nstop > 0 is tested. The above error message is then written in ocean.output only by process 0.
comment:29 Changed 3 years ago by smasson
In 13011:
comment:30 Changed 3 years ago by smasson
- Resolution set to fixed
- Status changed from reopened to closed
fixed in [13011]
[13012] pass all sette tests and gives the same results as trunk@12925
Current code is : NEMO/trunk @ r13012 ( last change @ r13012 ) SETTE validation report generated for : NEMO/trunk @ r13012 (last changed revision) on X64_JEANZAY arch file !!---------------1st pass------------------!! !----restart----! WGYRE_PISCES_ST run.stat restartability passed : 13012 WGYRE_PISCES_ST tracer.stat restartability passed : 13012 WORCA2_ICE_PISCES_ST run.stat restartability passed : 13012 WORCA2_ICE_PISCES_ST tracer.stat restartability passed : 13012 WORCA2_OFF_PISCES_ST tracer.stat restartability passed : 13012 WAMM12_ST run.stat restartability passed : 13012 WORCA2_SAS_ICE_ST run.stat restartability passed : 13012 WAGRIF_DEMO_ST run.stat restartability passed : 13012 WSPITZ12_ST run.stat restartability passed : 13012 WISOMIP_ST run.stat restartability passed : 13012 WOVERFLOW_ST run.stat restartability passed : 13012 WLOCK_EXCHANGE_ST run.stat restartability passed : 13012 WVORTEX_ST run.stat restartability passed : 13012 WICE_AGRIF_ST run.stat restartability passed : 13012 !----repro----! WGYRE_PISCES_ST run.stat reproducibility passed : 13012 WGYRE_PISCES_ST tracer.stat reproducibility passed : 13012 WORCA2_ICE_PISCES_ST run.stat reproducibility passed : 13012 WORCA2_ICE_PISCES_ST tracer.stat reproducibility passed : 13012 WORCA2_OFF_PISCES_ST tracer.stat reproducibility passed : 13012 WAMM12_ST run.stat reproducibility passed : 13012 WORCA2_SAS_ICE_ST run.stat reproducibility passed : 13012 WORCA2_ICE_OBS_ST run.stat reproducibility passed : 13012 WAGRIF_DEMO_ST run.stat reproducibility passed : 13012 WSPITZ12_ST run.stat reproducibility passed : 13012 WISOMIP_ST run.stat reproducibility passed : 13012 WVORTEX_ST run.stat reproducibility passed : 13012 WICE_AGRIF_ST run.stat reproducibility passed : 13012 !----agrif check----! ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat unchanged - passed : 13012 13012 !----result comparison check----! check result differences between : VALID directory : /gpfswork/rech/fqx/reee217/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 13012 and REFERENCE directory : /gpfswork/rech/fqx/reee217/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12925 WGYRE_PISCES_ST run.stat files are identical WGYRE_PISCES_ST tracer.stat files are identical WORCA2_ICE_PISCES_ST run.stat files are identical WORCA2_ICE_PISCES_ST tracer.stat files are identical WORCA2_OFF_PISCES_ST tracer.stat files are identical WAMM12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WORCA2_SAS_ICE_ST run.stat files are identical WAGRIF_DEMO_ST run.stat files are identical WSPITZ12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WVORTEX_ST run.stat files are identical WICE_AGRIF_ST run.stat files are identical
comment:31 Changed 3 years ago by smasson
- Resolution fixed deleted
- Status changed from closed to reopened
In stpctl, if
- nstop > 0 when entering the routine
- and we don't do collective communication
- and no other error are found in the tests on min/max values
=> We won't call ctl_stop and, once exiting stpctl, some processes will have nstop > 0, others won't.
This create an MPI deadlock.
comment:32 Changed 3 years ago by smasson
In 13115:
comment:33 Changed 3 years ago by smasson
- Resolution set to fixed
- Status changed from reopened to closed
fixed in [13115]
[13115] pass all sette tests and gives the same results as trunk@12925
-bash-4.2$ ./sette_rpt.sh Current code is : NEMO/trunk @ r13115 ( last change @ r13115 ) SETTE validation report generated for : NEMO/trunk @ r13115 (last changed revision) on X64_IRENE arch file !!---------------1st pass------------------!! !----restart----! WGYRE_PISCES_ST run.stat restartability passed : 13115 WGYRE_PISCES_ST tracer.stat restartability passed : 13115 WORCA2_ICE_PISCES_ST run.stat restartability passed : 13115 WORCA2_ICE_PISCES_ST tracer.stat restartability passed : 13115 WORCA2_OFF_PISCES_ST tracer.stat restartability passed : 13115 WAMM12_ST run.stat restartability passed : 13115 WORCA2_SAS_ICE_ST run.stat restartability passed : 13115 WAGRIF_DEMO_ST run.stat restartability passed : 13115 WSPITZ12_ST run.stat restartability passed : 13115 WISOMIP_ST run.stat restartability passed : 13115 WOVERFLOW_ST run.stat restartability passed : 13115 WLOCK_EXCHANGE_ST run.stat restartability passed : 13115 WVORTEX_ST run.stat restartability passed : 13115 WICE_AGRIF_ST run.stat restartability passed : 13115 !----repro----! WGYRE_PISCES_ST run.stat reproducibility passed : 13115 WGYRE_PISCES_ST tracer.stat reproducibility passed : 13115 WORCA2_ICE_PISCES_ST run.stat reproducibility passed : 13115 WORCA2_ICE_PISCES_ST tracer.stat reproducibility passed : 13115 WORCA2_OFF_PISCES_ST tracer.stat reproducibility passed : 13115 WAMM12_ST run.stat reproducibility passed : 13115 WORCA2_SAS_ICE_ST run.stat reproducibility passed : 13115 WORCA2_ICE_OBS_ST run.stat reproducibility passed : 13115 WAGRIF_DEMO_ST run.stat reproducibility passed : 13115 WSPITZ12_ST run.stat reproducibility passed : 13115 WISOMIP_ST run.stat reproducibility passed : 13115 WVORTEX_ST run.stat reproducibility passed : 13115 WICE_AGRIF_ST run.stat reproducibility passed : 13115 !----agrif check----! ORCA2 AGRIF vs ORCA2 NOAGRIF run.stat unchanged - passed : 13115 13115 !----result comparison check----! check result differences between : VALID directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 13115 and REFERENCE directory : /ccc/work/cont005/ra0542/massons/NEMO_ALL_VALIDATIONS/trunk/NEMO_VALIDATION at rev 12925 WGYRE_PISCES_ST run.stat files are identical WGYRE_PISCES_ST tracer.stat files are identical WORCA2_ICE_PISCES_ST run.stat files are identical WORCA2_ICE_PISCES_ST tracer.stat files are identical WORCA2_OFF_PISCES_ST tracer.stat files are identical WAMM12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WORCA2_SAS_ICE_ST run.stat files are identical WAGRIF_DEMO_ST run.stat files are identical WSPITZ12_ST run.stat files are identical WISOMIP_ST run.stat files are identical WVORTEX_ST run.stat files are identical WICE_AGRIF_ST run.stat files are identical
comment:34 Changed 3 years ago by smasson
- Resolution fixed deleted
- Status changed from closed to reopened
MAXVAL with mask check can give back -HUGE value on land processors.
When sn_cfctl%l_runstat = F, these values will generate true for infinity tests:
ABS( zmax(1) + zmax(2) + zmax(3) ) > HUGE(1._wp)
comment:35 Changed 3 years ago by smasson
In 13136:
comment:36 Changed 3 years ago by smasson
- Resolution set to fixed
- Status changed from reopened to closed
In 12582: