Opened 9 years ago
Closed 8 years ago
#1588 closed Bug (fixed)
ICB fails on large core counts
Reported by: | timgraham | Owned by: | nemo |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | OCE | Version: | v3.6 |
Severity: | Keywords: | ICB | |
Cc: |
Description
When the extended ORCA025 is run on large core counts (in this case 1216) it fails with the following errors in standard error:
Rank 1406 [Mon Aug 17 16:32:10 2015] [c3-0c0s8n0] Fatal error in PMPI_Isend: Invalid rank, error stack: PMPI_Isend(158): MPI_Isend(buf=0x7ffffffee4c0, count=2, MPI_DOUBLE_PRECISION, dest=12774, tag=21, comm=0x84000005, request=0xab58f00) failed PMPI_Isend(104): Invalid rank has value 12774 but must be nonnegative and less than 1440
Running on 1184 cores it runs ok. It may be significant that jpj=1208 in this model.
I've traced the error to the iceberg code and looking at the iceberg.stat_1406 file the destination processor is incorrect:
north fold destination procs 2*1435, 36*1434 north fold destination proclist 12775, 39*-1
Commit History (2)
Changeset | Author | Time | ChangeLog |
---|---|---|---|
6814 | mathiot | 2016-07-21T11:04:56+02:00 | correction related to ticket #1588 (ICB fails on large core counts) (trunk) |
6812 | mathiot | 2016-07-20T20:58:26+02:00 |
|
Change History (6)
comment:1 Changed 9 years ago by timgraham
comment:2 Changed 9 years ago by mocavero
Does error occur when you activate the ln_nnogather namelist key or not? This is just to understand if the bug has been introduced with the north fold optimization.
comment:3 Changed 9 years ago by timgraham
I've just tried running with ln_nnogather=True and the model did run successfully but I think this may just be chance (it occasionally ran successfully before). I say this because row jpj/2 of the src_calving_hflx array is still zeros leading to an out of bounds index in this loop as both nicbdj and nicbej=-1
DO ji = nicbdi, nicbei WRITE(numicb,*) 'ji=',ji ii = nicbflddest(ji) IF( ii .GT. 0 ) THEN ! Needed because land suppression can mean ! that unused points are not set in edge haloes DO jn = 1, jpni ! work along array until we find an empty slot WRITE(numicb,*) 'jn=',jn,'ii=',ii IF( nicbfldproc(jn) == -1 ) THEN nicbfldproc(jn) = ii EXIT !!gm EXIT should be avoided: use DO WHILE expression instead ENDIF ! before we find an empty slot, we may find processor number is already here so we exit IF( nicbfldproc(jn) == ii ) EXIT WRITE(numicb,*) nicbfldproc(jn) END DO ENDIF END DO
By chance it didn't crash the model this time but on another occasion it may do (I should probably recompile with array bounds checking).
comment:4 Changed 9 years ago by mathiot
In eORCA12 with 3435 cores, NEMO gets stuck in the WHILE loop in icbclv.F90. One of the reason is that the data berg_grid%stored_ice(-1,jj,jn) is HUGE (out of bound due to the index -1). The -1 comes from an error in icbini.F90 when NEMO tries to find the halo index for ICB (nicbdi/j and nicbei/j) in specific configuration. It is not the same error message than Tim's error, but I suspect it is the same bug (both nicbdj and nicbej=-1).
In my configuration, all cores along the East boundary of the eORCA12 grid get stuck. For example, domain 1299 has an halo size of 35 on the right side (more than half the domain):
ncdump -h mesh_mask_1299.nc | grep DOMAIN :DOMAIN_number_total = 3435 ; :DOMAIN_number = 1299 ; :DOMAIN_dimensions_ids = 1, 2 ; :DOMAIN_size_global = 4322, 3606 ; :DOMAIN_size_local = 67, 45 ; :DOMAIN_position_first = 4291, 1506 ; :DOMAIN_position_last = 4357, 1550 ; :DOMAIN_halo_size_start = 1, 1 ; :DOMAIN_halo_size_end = 35, 1 ; :DOMAIN_type = "BOX" ;
So the test at jpi/2 fall on the halo instead of in the middle of the "active" domain. Consequently, loops and tests to find the halo indexes for the iceberg model failed. The result is that nicbdj and nicbej is set to -1:
cat icebergs.stat_1299 processor 1300 jpi, jpj 67, 45 nldi, nlei 2, 32 nldj, nlej 2, 44 berg i interior 2, 31 berg j interior 2*-1 berg left 4291. berg right 2. central j line: i processor 1299, 30*1300, 1235, 35*0 i point 15274291, 15274292, ... , 15274321, 15270002, 35*0 central i line: j processor 45*0 j point 45*0
So I suggest to test the row at nlcj/2 instead of jpj/2 and the column at nlci/2 instead of jpi/2 in icbini.F90. By doing this, the model seems to run fine and the new iceberg stat is:
cat icebergs.stat_1299 processor 1300 jpi, jpj 67, 45 nldi, nlei 2, 32 nldj, nlej 2, 44 berg i interior 2, 31 berg j interior 2, 44 berg left 4291. berg right 2. central j line: i processor 1299, 30*1300, 1235, 35*0 i point 15274291, 15274292, ... , 15274321, 15270002, 35*0 central i line: j processor 1234, 43*1300, 1365 j point 15064306, 15074306, ... , 15494306, 15504306
Now the berg interior indexes and the i/j processor list are correct.
Can someone confirm that this fix does not break anything in the ICB code ?
I think this bug also affects the trunk.
comment:5 Changed 8 years ago by acc
Yes this fix would seem to be necessary in some extreme cases with a large number of unused ghost rows or columns. The ICB module has its own boundary exchange mechanism because it exchanges linked lists rather than 2 or 3-dimensional arrays. In the set-up for this, in icbini.F90, the logic assumes that checking rank ids along the jpi/2 and jpj/2 centre-lines is sufficient to identify interior points and is ultimately used to identify communicating neighbours. This fails in the cases where the number of ghost rows or columns occupy more than half the processor domain. The use of nlci/2 and nlcj/2 as replacements everywhere for jpi/2 and jpj/2 respectively (in icb_init) should rectify this since the active region is always left and bottom justified within the jpi x jpj area. The change should have no detrimental effect for working configurations.
comment:6 Changed 8 years ago by mathiot
- Resolution set to fixed
- Status changed from new to closed
Correction submitted in revision 6812 (3.6_STABLE) and 6814 (trunk)
I was mistaken in my report above. The failure occurs on 1440 cores (jpni=40, jpnj=36).
This is a result of the output from these lbc_lnk calls:
Later in the subroutine the code uses values from src_calving_hflx(:,jpj/2). In this configuration jpj=38 but rows 19:38 (jpj/2=19) are all zero after the call to lbc_lnk. Later in the code this leads to an out of bounds array index and most of the time causes the error seen above.
In configurations where jpj is odd only rows jpj/2 + 1 to jpj are zero and these all work.
Is this a bug in lbc_lnk for the North fold on T pivots or is this a mistake in the logic in ICB?