Opened 8 years ago

Closed 7 years ago

#90 closed defect (fixed)

MPI dead lock in XIOS

Reported by: mcastril Owned by: ymipsl
Priority: major Component: XIOS
Version: 1.0 Keywords: XIOS impi deadlock


We are experiencing a repetitive issue with XIOS 1.0 . It appeared using NEMO 3.6 stable and more than 2600 cores, and it seemed to be solved when using Intel 16 compiler and IMPI 5. However, after updating to NEMO 3.6 current stable, the problem appears when using 1920 or more cores. I don't really get how the NEMO revision change could affect to this, but there it is.

The problem is just in this line of client.cpp:

MPI_Send(buff,buffer.count(),MPI_CHAR,serverLeader,1,CXios::globalComm) ;

In the meanwhile the server.cpp is doing MPI_Iprobe continuosly in order to receive all the MPI_Send.

What we have observed is that using a high number of cores, around 80-100 of these cores get stucked at the MPI_Send, causing the run to hang and not complete. The fact that with a certain number of cores the issue appears 80% of the times but not always, made us think that could be related with the IMPI implementation.

Attachments (1)

log.out (382.0 KB) - added by mcastril 8 years ago.

Download all attachments as: .zip

Change History (5)

comment:1 Changed 8 years ago by ymipsl

Hi Miguel,

Sorry for my late reply, I was in holidays last week.
Thank you for reporting this. It is a serious problem and it is first time since long time ago we have to report such problem in the communication protocol of XIOS. On our computers (that generally not use intel mpi library) we cannot reproduce such problem, even at large numbers of cores (more than 10 000). The fact that the problem is occurring with a numbers of core larger than 1000 and on a computer on which we cannot accede means that the problem will be probably hard to solve. In order to succeed we probably need your contribution to make some tests on BSC computer and I must know if you are agree to spend a significant part of your time in order to solve this issue.

So we must determine if the problem come from a fail in the XIOS transfer protocol or from the MPI library.
Can you tell me more about the problem : what kind of nemo configuration, the number of core client and servers, the problem is occurring at the starting process or in long time after beginning. Do you have a work-around for this (changing the number of core, numbers of servers, memory ?)
If you set the info level parameters to 100, could you send me the whole xios output (xios_client_*.out / xios_server_*.out).



Changed 8 years ago by mcastril

comment:2 Changed 8 years ago by mcastril

Don't worry Yann,

I was suspecting that the problem is related with the IMPI implementation, and now that you say you're not experiencing the problem, I think it's almost sure. One fact that makes me think that way is that the problem doesn't appear always, for example we can have problems running experiments on 1600 to 3000 cores, and then having succesful runs using 4096 or 8192 processes.

I have done a lot of debugging, including extra messages, and I can confirm you that the problem is in the XIOS initialization, when the clients are sending the messages to the server master, that is waiting for them to have the context information.

I think the problem is related with the fact that this MPI_Send is delivered at the same time from thousands or processes to one only, that is doing MPI_Iprobe to recover the messages. Putting debug messages before and after the send, and in the server side in the eventLoop, we can see that some clients get stuck in the MPI_Send, while the server continues iterating and doing MPI_Iprobe and finding no more messages.

We are using ORCA025, with 1024 to 10000 NEMO processes and 4 to 16 XIOS processes. But I think that lowering or increasing the XIOS numbers doesn't make any difference, cause the problem appears when the messages are sent to the XIOS leader in the initialization.

Our nodes have 16 cores each one and 32 GB memory. I have seen experiments working using 7168 NEMO, and having

As I suspected that the problem came from resource contention I tried a workaround to test the response, putting a sleep(rank%16) before the MPI_Send, and the problem disappears...

So it seems that is related with IMPI having problems managing that amount of messages at the same time...

By putting this messages in the client part:

printf("Client:Rank:%d:HostName:%s:Register:MPI_Send:Before\n", rank, hostName);

error_code = MPI_Send(buff,buffer.count(),MPI_CHAR,serverLeader,1,CXios::globalComm) ;

printf("Client:Rank:%d:HostName:%s:Register:MPI_Send:After:error_code:%d\n", rank, hostName, error_code);

And this ones in the server side:

if (c%100000==0) printf("%d:Server:listenContext:MPI_Iprobe:Before:flag:%d\n", c, flag);

MPI_Iprobe(MPI_ANY_SOURCE,1,CXios::globalComm, &flag, &status) ;

if (c%100000==0 | flag) printf("%d:Server:listenContext:MPI_Iprobe:After:flag:%d:sender:%d\n", c, flag, status.MPI_SOURCE);

The result is the logfile that I attach. If you count the MPI_Send:Before and MPI_Send:After you will be able to say that are missing After's, and MPI_Iprobe is still running.


comment:3 Changed 8 years ago by mcastril

Fortunately the problem seems to be fixed.

After performing several tests with a mockup reproducing the same communication pattern, and seeing that in Open MPI the problem didn't appear, we found that using the UD DAPL protocol solved this issue. So the fix is to do: export I_MPI_DAPL_UD=on


comment:4 Changed 7 years ago by ymipsl

  • Resolution set to fixed
  • Status changed from new to closed

OK, thanks Miguel.

Note: See TracTickets for help on using tickets.