#1057 closed Bug (fixed)
Bug in mppini_2.h90 which can result in communication deadlock with some partitioning (mainly evident at high processor counts)
Reported by: | acc | Owned by: | acc |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | OCE | Version: | v3.4 |
Severity: | Keywords: | ||
Cc: |
Description
There appears to be a small error in mppini_2.h90 which results in the wrong northern neighbour being identified for the northernmost row of processors. This is a slightly redundant calculation anyway because the north-fold communications are dealt with separately and do not rely on the identified northern neighbour (nono). However, the northern neighbour is used to set the nbondj value which determines whether a region communicates: just to the north; both north and south; just to the south or neither way. At very high processor counts it is possible to end up with regions on the jpnj-1 row which send to the north but whose northern neighbour has been assigned a nbondj value of 2 (neither way). This results in deadlock at the first lbc_lnk call (usually in iom_get called by hgr_read) with the jpni-1 row processor waiting for a message that is never sent.
The error (TBC) appears to be in this block of code:
ipolj(ii,ij) = 0 IF( jperio == 3 .OR. jperio == 4 ) THEN ijm1 = jpni*(jpnj-1) imil = ijm1+(jpni+1)/2 IF( jarea > ijm1 ) ipolj(ii,ij) = 3 IF( MOD(jpni,2) == 1 .AND. jarea == imil ) ipolj(ii,ij) = 4 IF( ipolj(ii,ij) == 3 ) iono(ii,ij) = jpni*jpnj-jarea+ijm1 ENDIF
which applies a north-fold condition to identify the northern neighbour. I believe the error is that the iono values should be MPI process numbers not the narea vaules as calculated. The iono array is referenced later during the elimination of land-only regions:
DO jarea = 1, jpni*jpnj iproc = jarea-1 ii = 1 + MOD(jarea-1,jpni) ij = 1 + (jarea-1)/jpni IF( ipproc(ii,ij) == -1 .AND. iono(ii,ij) >= 0 & .AND. iono(ii,ij) <= jpni*jpnj-1 ) THEN iino = 1 + MOD(iono(ii,ij),jpni) ijno = 1 + (iono(ii,ij))/jpni IF( ibondj(iino,ijno) == 1 ) ibondj(iino,ijno)=2 IF( ibondj(iino,ijno) == 0 ) ibondj(iino,ijno) = -1 ENDIF
and the mis-identification can lead to the problem described. The occurrence is rare ( e.g. 1 process out of 9014 resulting from a 110x120 partitioning of ORCA_R12) but catastrophic and difficult to trace.
Fortunately, if this diagnosis is correct, the solution is trivial, simply replace:
IF( ipolj(ii,ij) == 3 ) iono(ii,ij) = jpni*jpnj-jarea+ijm1
with
IF( ipolj(ii,ij) == 3 ) iono(ii,ij) = jpni*jpnj-jarea+ijm1 - 1
Tests of this hypothesis are currently queued.
Commit History (2)
Changeset | Author | Time | ChangeLog |
---|---|---|---|
3819 | acc | 2013-02-21T11:31:10+01:00 | Branch dev_v3_4_STABLE_2012. #1057. Correct mppini_2.h90 logic concerning the northern neighbour across the north-fold |
3818 | acc | 2013-02-21T11:30:14+01:00 | Branch dev_MERGE_2012. #1057. Correct mppini_2.h90 logic concerning the northern neighbour across the north-fold |
Change History (3)
comment:1 Changed 11 years ago by acc
comment:2 Changed 11 years ago by acc
- Resolution set to fixed
- Status changed from new to closed
Successfully tested with various processor counts up to 12,000 cores. No recurrence of the dead-locking problem. Changes submitted to both dev_MERGE_2012 and dev_v3_4_STABLE_2012 branches. Closing ticket.
comment:3 Changed 10 years ago by acc
The problem of deadlocking has been revisited at version 3.6 (#1324). It looks as if the original conclusion regarding the iono values at the northern row was incorrect. The -1 offsets are not required since the calculation was already returning the correct MPI rank values.
The second part of this solution regarding the ibondj settings is probably sufficient, in itself, to resolve the issue and this is the preliminary conclusion of #1324
The proposed solution has partially resolved the issue but a different pair of north-fold neighbours is now dead-locking. I think there is also a fault in the logic of the second code block, namely:
I read this as "If you are a land-only area with an active north-neighbour, then disable the southward communication on the north-neighbour". However, on the jpnj row the north-neighbour communicates via its north interface across the north-fold. For this row, it is the northward communication that needs to be disabled. Tests are currently queued to try this solution:
where idir is a local integer.