Opened 10 years ago
Closed 9 years ago
#1195 closed Bug (fixed)
dev_MERGE_2013: subscript out of array bounds in mpp_lbc_nfd_2d in ORCA2+ln_nogather=T
Reported by: | cbricaud | Owned by: | epico |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | OCE | Version: | v3.6 |
Severity: | Keywords: | ||
Cc: |
Description
CASE U for npolj = 3 or 4
at line 694:
pt2dl(ji,ijpj) = psgn * pt2dr(iju,ijpjm1-1)
with ORCA2_LIM standard configuration, and ln_nogather=T
Without compiler option for arrys bounds checking, NEMO is finishing like that:
===>>> : E R R O R
===========
stpctl: the zonal velocity is larger than 20 m/s
======
kt= 3 max abs(U): 159.3 , i j k: 133 145 29
output of last fields in numwso
===>>> : E R R O R
===========
stp_ctl : NEGATIVE sea surface salinity
=======
kt= 3 min SSS: -219.2 , i j: 134 145
output of last fields in numwso
===>>> : E R R O R
je suis en train de chercher la modif entre les 2 rev qui le fait planter....
Commit History (2)
Changeset | Author | Time | ChangeLog |
---|---|---|---|
4671 | epico | 2014-06-17T17:00:51+02:00 | bug fix in north fold optimization when land-processes are removed. see ticket #1195 |
4645 | epico | 2014-05-20T17:17:03+02:00 | bug fixes for the north-fold optimization. see ticket #1195 |
Attachments (2)
Change History (12)
comment:1 Changed 9 years ago by epico
- Owner changed from NEMO team to epico
- Status changed from new to assigned
comment:2 in reply to: ↑ description Changed 9 years ago by epico
comment:3 Changed 9 years ago by OriolTP
I found the same bug when setting ln_nnogather as true.
I tried to use that patch and now it works .
comment:4 Changed 9 years ago by epico
The problem has been fixed and committed in the trunk rev. 4645.
The fixes have been tested on ORCA2 configuration with and without land-processes for the following decompositions: 8x8 and 16x8
Anyway the simulation freezes with ORCA2 using 30 procs with a decomposition of 8 x 4
Since the problem rises also when the ln_nogather flag is False (i.e. without the optimisation on north-fold) a new ticket will be opened.
comment:5 Changed 9 years ago by epico
- Resolution set to fixed
- Status changed from assigned to closed
comment:6 Changed 9 years ago by molines
Hello All !
Unfortunatly, I am not so positive regarding the fix concerning this ticket. With ORCA025, running the trunk version ( Rev 4645), with elimination of land processors, ( eg 17 x 10 = 145 ) there are still errors (when compiled with Check Bound on). So I try to track the error a bit more, and looked at the lbcnfd.F90 routine. To start with, I think that mpp_lbc_nfd_2/3d routines do not consider the elimination of land processor. As an example, from mpp_lbc_nfd_2d :
.... SELECT CASE ( npolj ) ! CASE ( 3 , 4 ) ! * North fold T-point pivot ! SELECT CASE ( cd_type ) CASE ( 'T' , 'W' ) ! T-, W-point IF (narea .ne. (jpnij - jpni + 1)) THEN startloop = 1 ELSE startloop = 2 ENDIF ...
It seems to me that the test on narea is written assuming that all processes are used, and the statement intercepts the left most domain of the northern row of subdomains. Further on there are similar tests (narea == jpnij) for the right most domain of the northern row of subdomains.
I was wondering if those tests should'nt be replaced by a test on nbondi (-1 or 1). I did it but then I fall into other problems, always linked with some indexed being 0 into arrays (eg pt2dr ) starting at 1. This routine is now quite complex and tricky, but I think it cannot be used safely
when elimination of land processors is active (which by itself may reduce by 30% the computational
burden of high resolution configurations ! ).
This is therefore a major problem for running ORCA high resolution configurations.
Regards,
Jean-Marc Molines
comment:7 Changed 9 years ago by charris
I've not done any testing with the latest versions of the trunk, but just a reminder that at 3.4 I found that activating elimination of land processors changed results in an ORCA025 configuration (see #1163). Presumably this could well be related to issues being discussed here.
comment:8 Changed 9 years ago by epico
- Resolution fixed deleted
- Status changed from closed to reopened
The ticket is reopened because the proposed fix does not fully solve the problem with array indexes in the north fold optimization when land-processes are removed.
Changed 9 years ago by epico
Changed 9 years ago by epico
comment:9 Changed 9 years ago by epico
From the Jean-Marc's email:
Hi Andrew and Italo
Thank you for looking at this tricky (and rather tedious) problem.
Just speaking of ORCA025 in the domain decomposition that I am using for tests (by far not optimal ...) jpni=17, jpnj=10 for jpnij=145.
The northern most row of procs shows like this (sorry for the wide format) .. It start from domain 133 (rank 132) to domain 145 (rank 144) . Domains marked with X are land only and eliminated. The connection lines shows the isendto(1) and isendto(2) -- if not 0 but this is not the main point.
Looking at domain 145 you see that it is not connected to any other proc in the N-S direction. In fact for this domain (145) isendto(1:3) = 0 and by the way produce an Check Bound error when adressing nimppt(isendto(1)) in lbcnfd.F90 ...Next figure shows a close up of domain 144 145 together with a tilted representation of the matching left domains.
We see that it is normal that 145 does not require a north communication. I agree that this is a particular case .. .But this case may happen in other part of the North boundary in particular in the Canadian Archipelago where there are kind of 'inland fingers' of water, and is likely to happen more frequently at higher resolution with more processors. So this case must be adressed.
In an intent to fix this, at the very begining of mpp_lbc_ndf routine, I just return if isendto(1) = 0 ( by the way, only insendto(1) is used in this routine).
This fixes the problem of nimppt(isendto(1) ... but then I fall into another problem when for instance
iju= jpiglo - ji - nimpp - nimppt(isendto(1)) + 3
became 0 and make a fault for
pt2dl(ji,ijpj) = psgn * pt2dr(iju,ijpjm1-1)
At this point I thought that using nbondi instead of a test on narea would have fixed the problem ( as described in the ticket) ... but unfortunatly no. Going ahead, I track the points that gave iju=0 and I found weird that they are not even the last points of the i-loop...
For instance : it happens for 2 domains of the northern row :
domain 144
nimpp = 1189
isendto(1)= 133
nimppt(133) = 171
at
ji = 85
endloop = 86
nlci = 86
nbondi = 0
jpiglo=1442
domain 138
nimpp = 596
isendto(1)= 139
nimppt(139) = 766
at
ji = 83
endloop = 86
nlci = 87
nbondi = 1
jpiglo=1442
So ... Here I am ! I do not know if these comments are usefull for tackling the problem...but in any case I can test easily any modification or give you more relevant information.
comment:10 Changed 9 years ago by epico
- Resolution set to fixed
- Status changed from reopened to closed
The bug has been fixed and committed to the trunk at the revision 4671
Replying to cbricaud:
Dear Clement
we found the bug. Before updating the svn and closing the ticket we wish to be sure our patch is correct.
The problem is in the mpp_lbc_north_3d routine where before applying the folding algorithm (through the call to the lbc_nfd) the communication phase is activated among the processes in the north region. The communication can happen (depending on the ln_nnogather flag) through a collective call or through a point-to-point communication. After this step the folding algorithm is invoked and namely we have two routines: the lbc_nfd is invoked when the ln_nnogather is false (optimization deactivated) and the mpp_lbc_nfd which is called with the optimized code.
During the merge and namely in the dev_LOCEAN_CMCC_INGV_2013 a further invocation of the lbc_nfd (the non-optimized version) has been added at the end of the routine.
Let see the code
The last call of the lbc_nfd (which appears after the IF statement) is not correct when the ln_nnogather is true since the ztab array is not initialized.
This call should be removed.
We have commented out those lines and verified using the ORCA2_LIM configuration with ln_nnogather flag activated/deactivated and with/without land processes. We compared the solver.stat and the tracer.stat achieving the same results.
Moreover we noticed that some modification have been added in the lbc_nfd. Those modifications should be inserted also in the mpp_lbc_nfd routine.
Could you check the patch and give us a confirmation before commit?