New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
#1588 (ICB fails on large core counts) – NEMO

Opened 9 years ago

Closed 8 years ago

#1588 closed Bug (fixed)

ICB fails on large core counts

Reported by: timgraham Owned by: nemo
Priority: low Milestone:
Component: OCE Version: v3.6
Severity: Keywords: ICB
Cc:

Description

When the extended ORCA025 is run on large core counts (in this case 1216) it fails with the following errors in standard error:

Rank 1406 [Mon Aug 17 16:32:10 2015] [c3-0c0s8n0] Fatal error in PMPI_Isend: Invalid rank, error stack:
PMPI_Isend(158): MPI_Isend(buf=0x7ffffffee4c0, count=2, MPI_DOUBLE_PRECISION, dest=12774, tag=21, comm=0x84000005, request=0xab58f00) failed
PMPI_Isend(104): Invalid rank has value 12774 but must be nonnegative and less than 1440

Running on 1184 cores it runs ok. It may be significant that jpj=1208 in this model.

I've traced the error to the iceberg code and looking at the iceberg.stat_1406 file the destination processor is incorrect:

north fold destination procs  
2*1435,  36*1434
north fold destination proclist  
12775,  39*-1

Commit History (2)

ChangesetAuthorTimeChangeLog
6814mathiot2016-07-21T11:04:56+02:00

correction related to ticket #1588 (ICB fails on large core counts) (trunk)

6812mathiot2016-07-20T20:58:26+02:00

correction related to ticket #1588 (ICB fails on large core counts)

Change History (6)

comment:1 Changed 9 years ago by timgraham

I was mistaken in my report above. The failure occurs on 1440 cores (jpni=40, jpnj=36).

This is a result of the output from these lbc_lnk calls:

      CALL lbc_lnk( src_calving_hflx, 'T', 1._wp )
      CALL lbc_lnk( src_calving     , 'T', 1._wp )

Later in the subroutine the code uses values from src_calving_hflx(:,jpj/2). In this configuration jpj=38 but rows 19:38 (jpj/2=19) are all zero after the call to lbc_lnk. Later in the code this leads to an out of bounds array index and most of the time causes the error seen above.

In configurations where jpj is odd only rows jpj/2 + 1 to jpj are zero and these all work.

Is this a bug in lbc_lnk for the North fold on T pivots or is this a mistake in the logic in ICB?

comment:2 Changed 9 years ago by mocavero

Does error occur when you activate the ln_nnogather namelist key or not? This is just to understand if the bug has been introduced with the north fold optimization.

comment:3 Changed 9 years ago by timgraham

I've just tried running with ln_nnogather=True and the model did run successfully but I think this may just be chance (it occasionally ran successfully before). I say this because row jpj/2 of the src_calving_hflx array is still zeros leading to an out of bounds index in this loop as both nicbdj and nicbej=-1

         DO ji = nicbdi, nicbei
            WRITE(numicb,*) 'ji=',ji
            ii = nicbflddest(ji)
            IF( ii .GT. 0 ) THEN     ! Needed because land suppression can mean
                                     ! that unused points are not set in edge haloes
               DO jn = 1, jpni
                  ! work along array until we find an empty slot
                  WRITE(numicb,*) 'jn=',jn,'ii=',ii
                  IF( nicbfldproc(jn) == -1 ) THEN
                     nicbfldproc(jn) = ii
                     EXIT                             !!gm EXIT should be avoided: use DO WHILE expression instead
                  ENDIF
                  ! before we find an empty slot, we may find processor number is already here so we exit
                  IF( nicbfldproc(jn) == ii ) EXIT
                  WRITE(numicb,*) nicbfldproc(jn)
               END DO
            ENDIF
         END DO

By chance it didn't crash the model this time but on another occasion it may do (I should probably recompile with array bounds checking).

comment:4 Changed 8 years ago by mathiot

In eORCA12 with 3435 cores, NEMO gets stuck in the WHILE loop in icbclv.F90. One of the reason is that the data berg_grid%stored_ice(-1,jj,jn) is HUGE (out of bound due to the index -1). The -1 comes from an error in icbini.F90 when NEMO tries to find the halo index for ICB (nicbdi/j and nicbei/j) in specific configuration. It is not the same error message than Tim's error, but I suspect it is the same bug (both nicbdj and nicbej=-1).

In my configuration, all cores along the East boundary of the eORCA12 grid get stuck. For example, domain 1299 has an halo size of 35 on the right side (more than half the domain):

ncdump -h mesh_mask_1299.nc | grep DOMAIN
		:DOMAIN_number_total = 3435 ;
		:DOMAIN_number = 1299 ;
		:DOMAIN_dimensions_ids = 1, 2 ;
		:DOMAIN_size_global = 4322, 3606 ;
		:DOMAIN_size_local = 67, 45 ;
		:DOMAIN_position_first = 4291, 1506 ;
		:DOMAIN_position_last = 4357, 1550 ;
		:DOMAIN_halo_size_start = 1, 1 ;
		:DOMAIN_halo_size_end = 35, 1 ;
		:DOMAIN_type = "BOX" ;

So the test at jpi/2 fall on the halo instead of in the middle of the "active" domain. Consequently, loops and tests to find the halo indexes for the iceberg model failed. The result is that nicbdj and nicbej is set to -1:

cat icebergs.stat_1299
 processor  1300
 jpi, jpj    67,  45
 nldi, nlei  2,  32
 nldj, nlej  2,  44
 berg i interior  2,  31
 berg j interior  2*-1
 berg left        4291.
 berg right       2.
 central j line:
 i processor
 1299,  30*1300,  1235,  35*0
 i point
 15274291,  15274292,  ... ,  15274321,  15270002,  35*0
 central i line:
 j processor
 45*0
 j point
 45*0

So I suggest to test the row at nlcj/2 instead of jpj/2 and the column at nlci/2 instead of jpi/2 in icbini.F90. By doing this, the model seems to run fine and the new iceberg stat is:

cat icebergs.stat_1299
 processor  1300
 jpi, jpj    67,  45
 nldi, nlei  2,  32
 nldj, nlej  2,  44
 berg i interior  2,  31
 berg j interior  2,  44
 berg left        4291.
 berg right       2.
 central j line:
 i processor
 1299,  30*1300,  1235,  35*0
 i point
 15274291,  15274292, ... ,  15274321,  15270002,  35*0
 central i line:
 j processor
 1234,  43*1300,  1365
 j point
 15064306,  15074306,  ... ,  15494306,  15504306

Now the berg interior indexes and the i/j processor list are correct.

Can someone confirm that this fix does not break anything in the ICB code ?

I think this bug also affects the trunk.

comment:5 Changed 8 years ago by acc

Yes this fix would seem to be necessary in some extreme cases with a large number of unused ghost rows or columns. The ICB module has its own boundary exchange mechanism because it exchanges linked lists rather than 2 or 3-dimensional arrays. In the set-up for this, in icbini.F90, the logic assumes that checking rank ids along the jpi/2 and jpj/2 centre-lines is sufficient to identify interior points and is ultimately used to identify communicating neighbours. This fails in the cases where the number of ghost rows or columns occupy more than half the processor domain. The use of nlci/2 and nlcj/2 as replacements everywhere for jpi/2 and jpj/2 respectively (in icb_init) should rectify this since the active region is always left and bottom justified within the jpi x jpj area. The change should have no detrimental effect for working configurations.

comment:6 Changed 8 years ago by mathiot

  • Resolution set to fixed
  • Status changed from new to closed

Correction submitted in revision 6812 (3.6_STABLE) and 6814 (trunk)

Note: See TracTickets for help on using tickets.