Opened 7 years ago

Closed 6 years ago

#1195 closed Bug (fixed)

dev_MERGE_2013: subscript out of array bounds in mpp_lbc_nfd_2d in ORCA2+ln_nogather=T

Reported by: cbricaud Owned by: epico
Priority: low Milestone:
Component: OCE Version: release-3.6
Severity: Keywords:
Cc:

Description

CASE U for npolj = 3 or 4
at line 694:

pt2dl(ji,ijpj) = psgn * pt2dr(iju,ijpjm1-1)

with ORCA2_LIM standard configuration, and ln_nogather=T

Without compiler option for arrys bounds checking, NEMO is finishing like that:

==⇒>> : E R R O R

===========

stpctl: the zonal velocity is larger than 20 m/s
======

kt= 3 max abs(U): 159.3 , i j k: 133 145 29

output of last fields in numwso

==⇒>> : E R R O R

===========

stp_ctl : NEGATIVE sea surface salinity
=======
kt= 3 min SSS: -219.2 , i j: 134 145

output of last fields in numwso

==⇒>> : E R R O R

je suis en train de chercher la modif entre les 2 rev qui le fait planter….

Commit History (2)

ChangesetAuthorTimeChangeLog
4671epico2014-06-17T17:00:51+02:00

bug fix in north fold optimization when land-processes are removed. see ticket #1195

4645epico2014-05-20T17:17:03+02:00

bug fixes for the north-fold optimization. see ticket #1195

Attachments (2)

northFoldProblem2.png (372.5 KB) - added by epico 6 years ago.
northFoldProblem.png (76.4 KB) - added by epico 6 years ago.

Download all attachments as: .zip

Change History (12)

comment:1 Changed 6 years ago by epico

  • Owner changed from NEMO team to epico
  • Status changed from new to assigned

comment:2 in reply to: ↑ description Changed 6 years ago by epico

Replying to cbricaud:

CASE U for npolj = 3 or 4
at line 694:

pt2dl(ji,ijpj) = psgn * pt2dr(iju,ijpjm1-1)

with ORCA2_LIM standard configuration, and ln_nogather=T

Without compiler option for arrys bounds checking, NEMO is finishing like that:

==⇒>> : E R R O R

===========

stpctl: the zonal velocity is larger than 20 m/s
======

kt= 3 max abs(U): 159.3 , i j k: 133 145 29

output of last fields in numwso

==⇒>> : E R R O R

===========

stp_ctl : NEGATIVE sea surface salinity
=======
kt= 3 min SSS: -219.2 , i j: 134 145

output of last fields in numwso

==⇒>> : E R R O R

je suis en train de chercher la modif entre les 2 rev qui le fait planter….

Dear Clement

we found the bug. Before updating the svn and closing the ticket we wish to be sure our patch is correct.

The problem is in the mpp_lbc_north_3d routine where before applying the folding algorithm (through the call to the lbc_nfd) the communication phase is activated among the processes in the north region. The communication can happen (depending on the ln_nnogather flag) through a collective call or through a point-to-point communication. After this step the folding algorithm is invoked and namely we have two routines: the lbc_nfd is invoked when the ln_nnogather is false (optimization deactivated) and the mpp_lbc_nfd which is called with the optimized code.

During the merge and namely in the dev_LOCEAN_CMCC_INGV_2013 a further invocation of the lbc_nfd (the non-optimized version) has been added at the end of the routine.

Let see the code

      IF ( l_north_nogather ) THEN
         !
....
         CALL mpp_lbc_nfd( ztabl, ztabr, cd_type, psgn )   ! North fold boundary condition
         !
         DO jk = 1, jpk
            DO jj = nlcj-ijpj+1, nlcj             ! Scatter back to pt3d
               ij = jj - nlcj + ijpj
               DO ji= 1, nlci
                  pt3d(ji,jj,jk) = ztabl(ji,ij,jk)
               END DO
            END DO
         END DO
    ELSE
.....   
         CALL lbc_nfd( ztab, cd_type, psgn )   ! North fold boundary condition
         !
         DO jk = 1, jpk
            DO jj = nlcj-ijpj+1, nlcj             ! Scatter back to pt3d
               ij = jj - nlcj + ijpj
               DO ji= 1, nlci
                  pt3d(ji,jj,jk) = ztab(ji+nimpp-1,ij,jk)
               END DO
            END DO
         END DO
         !
      ENDIF
 
      CALL lbc_nfd( ztab, cd_type, psgn )   ! North fold boundary condition
      !
      DO jk = 1, jpk
         DO jj = nlcj-ijpj+1, nlcj             ! Scatter back to pt3d
            ij = jj - nlcj + ijpj
            DO ji= 1, nlci
               pt3d(ji,jj,jk) = ztab(ji+nimpp-1,ij,jk)
            END DO
        END DO
      END DO

The last call of the lbc_nfd (which appears after the IF statement) is not correct when the ln_nnogather is true since the ztab array is not initialized.
This call should be removed.

We have commented out those lines and verified using the ORCA2_LIM configuration with ln_nnogather flag activated/deactivated and with/without land processes. We compared the solver.stat and the tracer.stat achieving the same results.

Moreover we noticed that some modification have been added in the lbc_nfd. Those modifications should be inserted also in the mpp_lbc_nfd routine.

Could you check the patch and give us a confirmation before commit?

comment:3 Changed 6 years ago by OriolTP

I found the same bug when setting ln_nnogather as true.
I tried to use that patch and now it works .

comment:4 Changed 6 years ago by epico

The problem has been fixed and committed in the trunk rev. 4645.

The fixes have been tested on ORCA2 configuration with and without land-processes for the following decompositions: 8x8 and 16x8

Anyway the simulation freezes with ORCA2 using 30 procs with a decomposition of 8 x 4

Since the problem rises also when the ln_nogather flag is False (i.e. without the optimisation on north-fold) a new ticket will be opened.

comment:5 Changed 6 years ago by epico

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:6 Changed 6 years ago by molines

Hello All !

Unfortunatly, I am not so positive regarding the fix concerning this ticket. With ORCA025, running the trunk version ( Rev 4645), with elimination of land processors, ( eg 17 x 10 = 145 ) there are still errors (when compiled with Check Bound on). So I try to track the error a bit more, and looked at the lbcnfd.F90 routine. To start with, I think that mpp_lbc_nfd_2/3d routines do not consider the elimination of land processor. As an example, from mpp_lbc_nfd_2d :

....
         SELECT CASE ( npolj )
         !
         CASE ( 3 , 4 )                        ! *  North fold  T-point pivot
            !
            SELECT CASE ( cd_type )
            CASE ( 'T' , 'W' )                         ! T-, W-point
               IF (narea .ne. (jpnij - jpni + 1)) THEN
                 startloop = 1
               ELSE
                 startloop = 2
               ENDIF
...

It seems to me that the test on narea is written assuming that all processes are used, and the statement intercepts the left most domain of the northern row of subdomains. Further on there are similar tests (narea == jpnij) for the right most domain of the northern row of subdomains.

I was wondering if those tests should'nt be replaced by a test on nbondi (-1 or 1). I did it but then I fall into other problems, always linked with some indexed being 0 into arrays (eg pt2dr ) starting at 1. This routine is now quite complex and tricky, but I think it cannot be used safely

when elimination of land processors is active (which by itself may reduce by 30% the computational
burden of high resolution configurations ! ).

This is therefore a major problem for running ORCA high resolution configurations.

Regards,

Jean-Marc Molines

comment:7 Changed 6 years ago by charris

I've not done any testing with the latest versions of the trunk, but just a reminder that at 3.4 I found that activating elimination of land processors changed results in an ORCA025 configuration (see #1163). Presumably this could well be related to issues being discussed here.

comment:8 Changed 6 years ago by epico

  • Resolution fixed deleted
  • Status changed from closed to reopened

The ticket is reopened because the proposed fix does not fully solve the problem with array indexes in the north fold optimization when land-processes are removed.

Changed 6 years ago by epico

Changed 6 years ago by epico

comment:9 Changed 6 years ago by epico

From the Jean-Marc's email:

Hi Andrew and Italo

Thank you for looking at this tricky (and rather tedious) problem.

Just speaking of ORCA025 in the domain decomposition that I am using for tests (by far not optimal …) jpni=17, jpnj=10 for jpnij=145.
The northern most row of procs shows like this (sorry for the wide format) .. It start from domain 133 (rank 132) to domain 145 (rank 144) . Domains marked with X are land only and eliminated. The connection lines shows the isendto(1) and isendto(2) — if not 0 but this is not the main point.


Looking at domain 145 you see that it is not connected to any other proc in the N-S direction. In fact for this domain (145) isendto(1:3) = 0 and by the way produce an Check Bound error when adressing nimppt(isendto(1)) in lbcnfd.F90 …Next figure shows a close up of domain 144 145 together with a tilted representation of the matching left domains.


We see that it is normal that 145 does not require a north communication. I agree that this is a particular case .. .But this case may happen in other part of the North boundary in particular in the Canadian Archipelago where there are kind of 'inland fingers' of water, and is likely to happen more frequently at higher resolution with more processors. So this case must be adressed.

In an intent to fix this, at the very begining of mpp_lbc_ndf routine, I just return if isendto(1) = 0 ( by the way, only insendto(1) is used in this routine).

This fixes the problem of nimppt(isendto(1) … but then I fall into another problem when for instance

 iju= jpiglo - ji - nimpp - nimppt(isendto(1)) + 3  

became 0 and make a fault for

pt2dl(ji,ijpj) = psgn * pt2dr(iju,ijpjm1-1) 

At this point I thought that using nbondi instead of a test on narea would have fixed the problem ( as described in the ticket) … but unfortunatly no. Going ahead, I track the points that gave iju=0 and I found weird that they are not even the last points of the i-loop…
For instance : it happens for 2 domains of the northern row :

domain 144

nimpp = 1189
isendto(1)= 133
nimppt(133) = 171
at

ji = 85
endloop = 86
nlci = 86
nbondi = 0

jpiglo=1442

domain 138

nimpp = 596
isendto(1)= 139
nimppt(139) = 766
at

ji = 83
endloop = 86
nlci = 87
nbondi = 1

jpiglo=1442

So … Here I am ! I do not know if these comments are usefull for tackling the problem…but in any case I can test easily any modification or give you more relevant information.

comment:10 Changed 6 years ago by epico

  • Resolution set to fixed
  • Status changed from reopened to closed

The bug has been fixed and committed to the trunk at the revision 4671

Note: See TracTickets for help on using tickets.