Ignore:
Timestamp:
06/27/18 15:18:29 (6 years ago)
Author:
yushan
Message:

report updated. Todo: add performance results

File:
1 edited

Legend:

Unmodified
Added
Removed
  • XIOS/dev/branch_openmp/Note/rapport ESIWACE.tex

    r1551 r1552  
    66\usepackage{amsmath} 
    77\usepackage{url} 
     8\usepackage{verbatim} 
    89 
    910% Title Page 
     
    2930writing on the server side can be overlapped completely by calculates on the client side.  
    3031 
    31 \begin{figure}[h] 
     32\begin{figure}[ht] 
    3233\includegraphics[scale=0.4]{Charge1.png} 
    3334\includegraphics[scale=0.4]{Charge2.png} 
     
    5556This topic comes along with the configuration of the climate models. Take LMDZ as example, it is designed with the 2-level parallelization scheme. To be more specific, LMDZ uses the domain decomposition method in which each sub-domain is associated with one MPI process. Inside of the sub-domain, the model also uses OpenMP derivatives to accelerate the computation. We can imagine that the sub-domain be divided into sub-sub-domain and is managed by threads.  
    5657 
    57 \begin{figure}[h] 
     58\begin{figure}[ht] 
    5859\centering 
    5960\includegraphics[scale=0.5]{domain.pdf} 
     
    6364As we know, each sub-domain, or in another word, each MPI process is a XIOS client. The data exchange between client and XIOS servers is handled by MPI communications. In order to write an output field, all threads must gather the data to the master thread who acts as MPI process in order to call MPI routines. There are two disadvantages about this method : first, we have to spend time on gathering information to the master thread which not only increases the memory use, but also implies an OpenMP barrier; second, while the master thread calls MPI routine, other threads are in the idle state thus a waster of computing resources. What we want obtain with the thread-friendly XIOS is that all threads can act like MPI processes. They can call directly the MPI routine thus no waste in memory nor in computing resources as shown in Figure \ref{fig:omp}. 
    6465 
    65 \begin{figure}[h!] 
     66\begin{figure}[ht] 
    6667\centering 
    6768\includegraphics[scale=0.6]{omp.pdf} 
     
    7374 
    7475 
    75 In this project, we choose to implement the interface to handle the threads. To do so, we introduce the MPI\_endpoint which is a concept proposed in the last MPI Forum and several papers has already discussed the importance of such idea and have introduced the framework of the MPI\_endpoint \cite{Dinan:2013}\cite{Sridharan:2014}. The concept of an endpoint is shown by Figure \ref{fig:scheme}. Threads of an MPI process is associated with a unique rank (global endpoint rank) and an endpoint communicator. They also have a local rank (rank inside the MPI process) which is very similar to the \verb|OMP_thread_num| rank.  
    76  
    77 \begin{figure}[h!] 
     76In this project, we choose to implement an interface to handle the threads. To do so, we introduce the MPI\_endpoint which is a  
     77concept proposed in the last MPI Forums and several papers have already discussed the importance of such idea and have introduced the  
     78framework of the MPI\_endpoint \cite{Dinan:2013}\cite{Sridharan:2014}. The concept of an endpoint is shown by Figure \ref{fig:scheme}. In  
     79the MPI\_endpoint environment, each OpenMP thread will be associated with a unique rank (global endpoint rank), an endpoint communicator,  
     80and a local rank (rank inside the MPI process) which is very similar to the \verb|OMP_thread_num|. The global endpoint rank will replace the  
     81role of the classic MPI rank and will be used in MPI communication calls.  
     82 
     83 
     84\begin{figure}[ht] 
    7885\begin{center} 
    7986\includegraphics[scale=0.4]{scheme.png}  
     
    8592%XIOS is a library dedicated to IO management of climate code. It has a client-server pattern in which clients are in charge of computations and servers manage the reading and writing of files. The communication between clients and servers are handled by MPI. However, some of the climate models (\textit{e.g.} LMDZ) nowadays use an hybrid programming policy. Within a shared memory node, OpenMP directives are used to manage message exchanges. In such configuration, XIOS can not take full advantages of the computing resources to maximize the performance. This is because XIOS can only work with MPI processes. Before each call of XIOS routines, threads of one MPI process must gather their information to the master thread who works as an MPI process. After the call, the master thread distributes the updated information among its slave threads. As result, all slave threads have to wait while the master thread calls the XIOS routines. This introduce extra synchronization into the model and leads to not optimized performance. Aware of this situation, we need to develop a new version of XIOS (EP\_XIOS) which can work with threads, or in other words, can consider threads as they were processes. To do so, we introduce the MPI endpoints.   
    8693 
    87  
    88 The MPI\_endpoints interface we implemented lies on top of an existing MPI Implementation. It consists of wrappers to all MPI functions used in XIOS.  
    89  
    90 will be re-implemented in order to cope with OpenMP threads. The idea is that, in the MPI endpoints environment, each OpenMP thread will be  
    91 associated with a unique rank and with an endpoint communicator. This rank (EP rank) will replace the role of the classic MPI rank and will  
    92 be used in MPI communications. In order to successfully execute an MPI communication, for example \verb|MPI_Send|, we know already which  
    93 endpoints to be the receiver but not sufficient. We also need to know which MPI process should be involved in such communication. To  
    94 identify the MPI rank, we added a ``map'' in the EP communicator in which the relation of all EP and MPI ranks can be easily obtained. 
    95  
    96  
    97 In XIOS, we used the ``probe'' technique to search for arrived messages and then performing the receive action. The principle is  
    98 that sender processes execute the send operations as usual. However, to minimize the time spent on waiting incoming messages, the receiver  
    99 processe performs in the first place the \verb|MPI_Probe| function to check if a message destinated to it has been published. If yes, the  
    100 process execute in the second place the \verb|MPI_Recv| to receive the message. In this situation, if we introduce the threads, problems  
    101 occur. The reason why the ``probe'' method is not suitable is that messages destinated to one certain process can be probed by any of  
    102 its threads. Thus the message can be received by the wrong thread which gives errors. 
    103  
    104 To solve this problem, we introduce the ``matching-probe'' technique. The idea of the method is that each process is equiped with a local  
    105 incoming message queue. All incoming message will be probed, sorted, and then stored in this queue according to their destination rank.  
    106 Every time we call an MPI function, we firstly call the \verb|MPI_Mprobe| function to get the handle to  
    107 the incoming message. Then, we identify the destination thread rank and store the message handle inside the local queue of the target  
    108 thread. After this, we perform the usual ``probe'' technique upon the local incoming message queue. In this way, we can assure the messages  
    109 to be received by the right thread. 
    110  
    111 Another issue remains in this technique: how to identify the receiver's rank? The solution is to use the tag argument. In the MPI  
    112 environment, a tag is an integer ranging from 0 to $2^{31}$. We can explore the large range of the tag to store in it information about the  
    113 source and destination thread ranks. We choose to limite the first 15 bits for the tag used in the classic MPI communication, the next 8  
    114 bits to the sender's thread rank, and the last 8 bits to the receiver's thread rank. In such way, with an extra analysis of the EP tag, we  
    115 can identify the ranks of the sender and the receiver in any P2P communication. As results, we a thread probes a message, it knows  
    116 exactly in which local queue should store the probed message.  
    117  
    118  
    119 With the global rank map, tag extension, and the matching-probe techniques, we are able to use any P2P communication in the endpoint  
    120 environment. For the collective communications, we perform a step-by-step execution and no special technique is required. The most  
    121 representative functions is the collective communications are \verb|MPI_Gather| and \verb|MPI_Bcast|. A step-by-step execution consists of  
    122 3 steps (not necessarily in this order): arrangement of the source data, execution of the MPI function by all  
    123 master/root threads, distribution or arrangement of the data among threads.  
    124  
    125 For example, if we want to perform a broadcast operation, 2 steps are needed. Firstly, the root thread, along with the master threads of  
    126 other processes, perform the classic \verb|MPI_Bcast| operation. Secondly, the root thread, and the master threads send data to threads  
    127 sharing the same process via local memory transfer. In another example for illustrating the \verb|MPI_Gather| function, we also need 2  
    128 steps. First of all, data is gathered from slave threads to the master thread or the root thread. Next, the master thread and the root  
    129 thread execute the \verb|MPI_Gather| operation of complete the communication. Other collective calls such as \verb|MPI_Scan|,  
    130 \verb|MPI_Reduce|, \verb|MPI_Scatter| \textit{etc} follow the same principle of step-by-step execution. 
    131  
    132  
     94An other important aspect about the MPI\_endpoint interface is that each endpoints has knowledge of the ranks of other endpoints in the  
     95same communicator. This knowledge is necessary because when executing an MPI communication, for example a point-to-point exchange, we need  
     96to know not only the ranks of sender/receiver threads, but also the thread number of the sender/receiver threads and the MPI ranks  
     97of the sender/receiver processes. This ranking information is implemented inside an map object included in the endpoint communicator  
     98class. 
     99 
     100 
     101 
     102 
     103%\newpage 
     104 
     105%The MPI\_endpoint interface we implemented lies on top of an existing MPI Implementation. It consists of wrappers to all MPI functions  
     106%used in XIOS.  
     107 
     108 
     109In XIOS, we used the ``probe'' technique to search for arrived messages and then perform the receiving action. The principle is  
     110that sender process executes the send operation as usual. However, to minimize the time spent on waiting incoming messages, the receiver  
     111process calls in the first place the \verb|MPI_Probe| function to check if a message destinate to it has been published. If yes, the  
     112process execute in the second place the \verb|MPI_Recv| to receive the message. If not, the receiver process can carry on with other tasks  
     113or repeats the \verb|MPI_Probe| and \verb|MPI_Recv| actions if the required message is in immediate need. This technique works well in  
     114the current version of XIOS. However, if we introduce threads into this mechanism, problems can occur: The incoming message is labeled by  
     115the tag and receiver's MPI rank. Because threads within a process share the MPI rank, and the message probed is always available in the  
     116message queue, it can lead to the problem of data race and thus the message can be received by the wrong thread. 
     117 
     118 
     119To solve this problem, we introduce the ``matching-probe'' technique. The idea of the method is that each thread is equipped with a local  
     120incoming message queue. Each time a thread calls an MPI function, for example \verb|MPI_Recv|, it calls firstly the \verb|MPI_Mprobe|  
     121function to query the MPI incoming message with any tag and from any source. Once a message is probed, the thread gets the handle to the  
     122incoming message and this specific message is erased from the MPI message queue. Then, the thread proceed the identification of the message  
     123to get the destination thread's local rank and store the message handle to the local queue of the target thread. The thread repeats these  
     124steps until the MPI incoming message queue is empty. Then the thread we perform the usual ``probe'' technique to query its local incoming  
     125message queue to check if the required message is available. If yes, it performs the \verb|MPI_Recv| operation. With this ``matching-probe''  
     126technique, we can assure that a message is probed only once and is received by the right receiver. 
     127  
     128 
     129 
     130Another issue needs to be clarified with this technique is that: how to identify the receiver's rank? The solution to this question is to  
     131use the tag argument. In the MPI environment, a tag is an integer ranging from 0 to $2^{31}$ depending on the Implementation. We can explore  
     132the large range property of the tag to store in it information about the source and destination thread ranks. In our endpoint interface, we  
     133choose to limit the first 15 bits for the tag used in the classic MPI communication, the next 8 bits to store the sender thread's local  
     134rank, and the last 8 bits to store the receiver thread's local rank (\textit{c.f.} Figure \ref{fig:tag}). In this way, with an extra  
     135analysis of the tag, we can identify the local ranks of the sender and the receiver in any P2P communication. As results, when a thread  
     136probes a message, it knows exactly in which local queue should store the probed message.  
     137 
     138\begin{figure}[ht] 
     139 \centering 
     140 \includegraphics[scale=0.4]{tag.png} 
     141 \caption{}\label{fig:tag} 
     142\end{figure} 
     143 
     144In Figure \ref{fig:tag}, Tag contains the user defined value for a certain communication. MPI\_tag is computed in the endpoint interface  
     145with help of the rank map and is used in the MPI calls.  
     146 
     147\begin{figure}[ht] 
     148\centering 
     149 \includegraphics[scale = 0.4]{sendrecv.png} 
     150\caption{This figure shows the classic pattern of a P2P communication with the endpoint interface. Thread/endpoint rank 0 sends a message  
     151to thread/endpoint rank 3 with tag=1. The underlying MPI function called by the sender is indeed a send for MPI rank of 1  
     152and tag=65537. From the receiver's point of view, the endpoint 3 is actually receiving a message from MPI rank 0 with  
     153tag=65537.} 
     154\label{fig:sendrecv} 
     155\end{figure} 
     156 
     157 
     158 
     159 
     160With the rank map, tag extension, and the matching-probe techniques, we are now able to call any P2P communication in the endpoint  
     161environment. For the collective communications, we apply a step-by-step execution pattern and no special technique is required. A  
     162step-by-step execution pattern consists of 3 steps (not necessarily in this order and not all steps are needed): arrangement of the source  
     163data, execution of the MPI function by all master/root threads, distribution or arrangement of the resulting data among threads.  
     164 
     165%The most representative functions of the collective communications are \verb|MPI_Gather| and \verb|MPI_Bcast|. 
     166 
     167For example, if we want to perform a broadcast operation, only 2 steps are needed (\textit{c.f.} Figure \ref{fig:bcast}). Firstly, the root  
     168thread, along with the master threads of other processes, perform the classic \verb|MPI_Bcast| operation. Secondly, the root thread, and the  
     169master threads send data to other threads via local memory transfer.  
     170 
     171\begin{figure}[ht] 
     172\centering 
     173\includegraphics[scale=0.3]{bcast.png}  
     174\caption{} 
     175\label{fig:bcast} 
     176\end{figure} 
     177 
     178Figure \ref{fig:allreduce} illustrates how the \verb|MPI_Allreduce| function is proceeded in the endpoint interface. First of all, We  
     179perform a intra-process ``allreduce'' operation: source data is reduced from slave threads to the master thread via local memory transfer.  
     180Next, all master threads call the classic \verb|MPI_Allreduce| routine. Finally, all master threads send the updated reduced data to its  
     181slaves via local memory transfer.  
     182 
     183\begin{figure}[ht] 
     184\centering 
     185\includegraphics[scale=0.3]{allreduce.png}  
     186\caption{} 
     187\label{fig:allreduce} 
     188\end{figure} 
     189 
     190Other MPI routines, such as \verb|MPI_Wait|, \verb|MPI_Intercomm_create| \textit{etc.}, can be found in the technique report of the  
     191endpoint interface. 
     192 
     193\section{The multi-threaded XIOS and performance results} 
     194 
     195The development of endpoint interface for thread-friendly XIOS library took about one year and a half. The main difficulty is the  
     196co-existence of MPI processes and OpenMP threads. All MPI classes must be redefined in the endpoint interface along with all the routines.  
     197The development is now available on the forge server: \url{http://forge.ipsl.jussieu.fr/ioserver/browser/XIOS/dev/branch_openmp}. One  
     198technique report is also available in which one can find more detail about how endpoint works and how the routines are implemented  
     199\cite{ep:2018}. We must note that the thread-friendly XIOS library is still in the phase of optimization. It will be released in the  
     200future with a stable version. 
     201 
     202All the functionalities of XIOS is reserved in its thread-friendly version. Single threaded code can work successfully with the new  
     203version of XIOS. For multi-threaded models, some modifications are needed in order to work with the multi-threaded XIOS library. Detail can  
     204be found in our technique report \cite{ep:2018}. 
     205 
     206Even though the multi-threaded XIOS library is not fully accomplished and further optimization in ongoing. We have already done some tests  
     207to see the potential of the endpoint framework. We take LMDZ as the target model and have tested with several work-flow charges.  
     208 
     209\begin{comment} 
    133210\section{Performance of LMDZ using EP\_XIOS} 
    134211 
     
    141218simulation duration settings: 1 day, 5 days, 15 days, and 31 days.  
    142219 
    143 \begin{figure}[h] 
     220\begin{figure}[ht] 
    144221 \centering 
    145222 \includegraphics[scale = 0.6]{LMDZ_perf.png} 
     
    167244histmth with daily output 
    168245 
    169 \section{Perspectives of EP\_XIOS} 
     246\end{comment} 
     247 
     248 
     249\section{Future works for XIOS} 
    170250 
    171251 
Note: See TracChangeset for help on using the changeset viewer.