Opened 6 years ago
Closed 6 years ago
#2213 closed Defect (fixed)
Model freeze when using BDY in some particular cases
Reported by: | molines | Owned by: | systeam |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | BDY | Version: | trunk |
Severity: | minor | Keywords: | bdy, bdyini mpp_lnk_bdy_xxx |
Cc: | smasson@… |
Description
Context
When using a configuration with open boundaries, it freezes at the first step under certain circumstances. This behaviour is sensitive to the domain decomposistion.
Analysis
After some tedious debuging in bdyini.F90 (in the routine bdy_segs, very busy by the way !), a dead-lock appears when the end of a boundary segment lays exactly in the first point of a halo between 2 adjacents processors (for instance, in the I directioni [ W and E], if the ending point is at nlci-1 ( W) corresponding to 1 (E). In this particular case, a east communication is triggered, without the corresponding west communication.
The critical piece of code is ( in bdyini.F90 ) :
940 ! check if point has to be sent 941 ii = idx_bdy(ib_bdy)%nbi(icount,igrd) 942 ij = idx_bdy(ib_bdy)%nbj(icount,igrd) 943 if((com_east .ne. 1) .and. (ii == (nlci-1)) .and. (nbondi .le. 0)) then 944 com_east = 1 945 elseif((com_west .ne. 1) .and. (ii == 2) .and. (nbondi .ge. 0) .and. (nbondi .ne. 2)) then 946 com_west = 1 947 endif 948 if((com_south .ne. 1) .and. (ij == 2) .and. (nbondj .ge. 0) .and. (nbondj .ne. 2)) then 949 com_south = 1 950 elseif((com_north .ne. 1) .and. (ij == (nlcj-1)) .and. (nbondj .le. 0)) then 951 com_north = 1 952 endifSuppose your ending point is at ii=nlci-1 (W) hence ii=1 (E). Lines 943-944 will set com_east=1 (for W) and lines 945-946 will let com_west=0 (for E ).
For sure the problem will be the same on the N S direction. (lines 948-949 and lines 950-951).
The problem was fixed successfully (in my case of a structured BDY), by changing lines 943 to :
943 if((com_east .ne. 1) .and. (ii == (nlci)) .and. (nbondi .le. 0)) then
Of course corresponding tests for com_west_b and com_east_b must be adapted accordingly (for instance )
962 if((com_west_b .ne. 1) .and. (ii == (nlcit(nowe+1)))) then
This problem was not present is 3.6 although bdy_ini was the same. It is because the change is in lib_mpp.F90 ( or now mpp_bdy_generic.h90 ). In 3.6 send-rcv messages were triggered by both ( eg) nbondi_bdy AND nbondi_bdy_b). Now, send-rcv are triggered by nbondi_bdy ONLY, and nbondi_bdy_b is used just for putting the exchanged values at the right place.
Recommendation
The bug was activated by the 'simplification' in ROUTINE_BDY, but I think that the fix must be in bdyini. On the other hand, the simplification was done to avoid a comm. and I personally have doubts about the impact on performance of such a change.
Commit History (3)
Changeset | Author | Time | ChangeLog |
---|---|---|---|
10630 | smasson | 2019-02-04T17:09:57+01:00 | v4.0: bugfix in mpp for bdy, back to v3.6, see #2213, #2224, #2225 |
10629 | smasson | 2019-02-04T17:07:39+01:00 | trunk: bugfix in mpp for bdy, back to v3.6, see #2213, #2224, #2225 |
10537 | smasson | 2019-01-16T21:41:21+01:00 | trunk: bugfix in bdyini, see #2213 |
Change History (7)
comment:1 Changed 6 years ago by smasson
comment:2 Changed 6 years ago by smasson
Jean-Marc, following your recommendations, I made some corrections in [10537], could you tell me if it works, so I could close the ticket?
comment:3 Changed 6 years ago by smasson
- Cc smasson@… added
comment:4 Changed 6 years ago by smasson
In 10629:
comment:5 Changed 6 years ago by smasson
In 10630:
comment:6 Changed 6 years ago by smasson
see discussion in #2224
I close the ticket
comment:7 Changed 6 years ago by smasson
- Resolution set to fixed
- Status changed from new to closed
In 10537: