Changeset 1552 for XIOS/dev/branch_openmp/Note/rapport ESIWACE.tex
- Timestamp:
- 06/27/18 15:18:29 (6 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
XIOS/dev/branch_openmp/Note/rapport ESIWACE.tex
r1551 r1552 6 6 \usepackage{amsmath} 7 7 \usepackage{url} 8 \usepackage{verbatim} 8 9 9 10 % Title Page … … 29 30 writing on the server side can be overlapped completely by calculates on the client side. 30 31 31 \begin{figure}[h ]32 \begin{figure}[ht] 32 33 \includegraphics[scale=0.4]{Charge1.png} 33 34 \includegraphics[scale=0.4]{Charge2.png} … … 55 56 This topic comes along with the configuration of the climate models. Take LMDZ as example, it is designed with the 2-level parallelization scheme. To be more specific, LMDZ uses the domain decomposition method in which each sub-domain is associated with one MPI process. Inside of the sub-domain, the model also uses OpenMP derivatives to accelerate the computation. We can imagine that the sub-domain be divided into sub-sub-domain and is managed by threads. 56 57 57 \begin{figure}[h ]58 \begin{figure}[ht] 58 59 \centering 59 60 \includegraphics[scale=0.5]{domain.pdf} … … 63 64 As we know, each sub-domain, or in another word, each MPI process is a XIOS client. The data exchange between client and XIOS servers is handled by MPI communications. In order to write an output field, all threads must gather the data to the master thread who acts as MPI process in order to call MPI routines. There are two disadvantages about this method : first, we have to spend time on gathering information to the master thread which not only increases the memory use, but also implies an OpenMP barrier; second, while the master thread calls MPI routine, other threads are in the idle state thus a waster of computing resources. What we want obtain with the thread-friendly XIOS is that all threads can act like MPI processes. They can call directly the MPI routine thus no waste in memory nor in computing resources as shown in Figure \ref{fig:omp}. 64 65 65 \begin{figure}[h !]66 \begin{figure}[ht] 66 67 \centering 67 68 \includegraphics[scale=0.6]{omp.pdf} … … 73 74 74 75 75 In this project, we choose to implement the interface to handle the threads. To do so, we introduce the MPI\_endpoint which is a concept proposed in the last MPI Forum and several papers has already discussed the importance of such idea and have introduced the framework of the MPI\_endpoint \cite{Dinan:2013}\cite{Sridharan:2014}. The concept of an endpoint is shown by Figure \ref{fig:scheme}. Threads of an MPI process is associated with a unique rank (global endpoint rank) and an endpoint communicator. They also have a local rank (rank inside the MPI process) which is very similar to the \verb|OMP_thread_num| rank. 76 77 \begin{figure}[h!] 76 In this project, we choose to implement an interface to handle the threads. To do so, we introduce the MPI\_endpoint which is a 77 concept proposed in the last MPI Forums and several papers have already discussed the importance of such idea and have introduced the 78 framework of the MPI\_endpoint \cite{Dinan:2013}\cite{Sridharan:2014}. The concept of an endpoint is shown by Figure \ref{fig:scheme}. In 79 the MPI\_endpoint environment, each OpenMP thread will be associated with a unique rank (global endpoint rank), an endpoint communicator, 80 and a local rank (rank inside the MPI process) which is very similar to the \verb|OMP_thread_num|. The global endpoint rank will replace the 81 role of the classic MPI rank and will be used in MPI communication calls. 82 83 84 \begin{figure}[ht] 78 85 \begin{center} 79 86 \includegraphics[scale=0.4]{scheme.png} … … 85 92 %XIOS is a library dedicated to IO management of climate code. It has a client-server pattern in which clients are in charge of computations and servers manage the reading and writing of files. The communication between clients and servers are handled by MPI. However, some of the climate models (\textit{e.g.} LMDZ) nowadays use an hybrid programming policy. Within a shared memory node, OpenMP directives are used to manage message exchanges. In such configuration, XIOS can not take full advantages of the computing resources to maximize the performance. This is because XIOS can only work with MPI processes. Before each call of XIOS routines, threads of one MPI process must gather their information to the master thread who works as an MPI process. After the call, the master thread distributes the updated information among its slave threads. As result, all slave threads have to wait while the master thread calls the XIOS routines. This introduce extra synchronization into the model and leads to not optimized performance. Aware of this situation, we need to develop a new version of XIOS (EP\_XIOS) which can work with threads, or in other words, can consider threads as they were processes. To do so, we introduce the MPI endpoints. 86 93 87 88 The MPI\_endpoints interface we implemented lies on top of an existing MPI Implementation. It consists of wrappers to all MPI functions used in XIOS. 89 90 will be re-implemented in order to cope with OpenMP threads. The idea is that, in the MPI endpoints environment, each OpenMP thread will be 91 associated with a unique rank and with an endpoint communicator. This rank (EP rank) will replace the role of the classic MPI rank and will 92 be used in MPI communications. In order to successfully execute an MPI communication, for example \verb|MPI_Send|, we know already which 93 endpoints to be the receiver but not sufficient. We also need to know which MPI process should be involved in such communication. To 94 identify the MPI rank, we added a ``map'' in the EP communicator in which the relation of all EP and MPI ranks can be easily obtained. 95 96 97 In XIOS, we used the ``probe'' technique to search for arrived messages and then performing the receive action. The principle is 98 that sender processes execute the send operations as usual. However, to minimize the time spent on waiting incoming messages, the receiver 99 processe performs in the first place the \verb|MPI_Probe| function to check if a message destinated to it has been published. If yes, the 100 process execute in the second place the \verb|MPI_Recv| to receive the message. In this situation, if we introduce the threads, problems 101 occur. The reason why the ``probe'' method is not suitable is that messages destinated to one certain process can be probed by any of 102 its threads. Thus the message can be received by the wrong thread which gives errors. 103 104 To solve this problem, we introduce the ``matching-probe'' technique. The idea of the method is that each process is equiped with a local 105 incoming message queue. All incoming message will be probed, sorted, and then stored in this queue according to their destination rank. 106 Every time we call an MPI function, we firstly call the \verb|MPI_Mprobe| function to get the handle to 107 the incoming message. Then, we identify the destination thread rank and store the message handle inside the local queue of the target 108 thread. After this, we perform the usual ``probe'' technique upon the local incoming message queue. In this way, we can assure the messages 109 to be received by the right thread. 110 111 Another issue remains in this technique: how to identify the receiver's rank? The solution is to use the tag argument. In the MPI 112 environment, a tag is an integer ranging from 0 to $2^{31}$. We can explore the large range of the tag to store in it information about the 113 source and destination thread ranks. We choose to limite the first 15 bits for the tag used in the classic MPI communication, the next 8 114 bits to the sender's thread rank, and the last 8 bits to the receiver's thread rank. In such way, with an extra analysis of the EP tag, we 115 can identify the ranks of the sender and the receiver in any P2P communication. As results, we a thread probes a message, it knows 116 exactly in which local queue should store the probed message. 117 118 119 With the global rank map, tag extension, and the matching-probe techniques, we are able to use any P2P communication in the endpoint 120 environment. For the collective communications, we perform a step-by-step execution and no special technique is required. The most 121 representative functions is the collective communications are \verb|MPI_Gather| and \verb|MPI_Bcast|. A step-by-step execution consists of 122 3 steps (not necessarily in this order): arrangement of the source data, execution of the MPI function by all 123 master/root threads, distribution or arrangement of the data among threads. 124 125 For example, if we want to perform a broadcast operation, 2 steps are needed. Firstly, the root thread, along with the master threads of 126 other processes, perform the classic \verb|MPI_Bcast| operation. Secondly, the root thread, and the master threads send data to threads 127 sharing the same process via local memory transfer. In another example for illustrating the \verb|MPI_Gather| function, we also need 2 128 steps. First of all, data is gathered from slave threads to the master thread or the root thread. Next, the master thread and the root 129 thread execute the \verb|MPI_Gather| operation of complete the communication. Other collective calls such as \verb|MPI_Scan|, 130 \verb|MPI_Reduce|, \verb|MPI_Scatter| \textit{etc} follow the same principle of step-by-step execution. 131 132 94 An other important aspect about the MPI\_endpoint interface is that each endpoints has knowledge of the ranks of other endpoints in the 95 same communicator. This knowledge is necessary because when executing an MPI communication, for example a point-to-point exchange, we need 96 to know not only the ranks of sender/receiver threads, but also the thread number of the sender/receiver threads and the MPI ranks 97 of the sender/receiver processes. This ranking information is implemented inside an map object included in the endpoint communicator 98 class. 99 100 101 102 103 %\newpage 104 105 %The MPI\_endpoint interface we implemented lies on top of an existing MPI Implementation. It consists of wrappers to all MPI functions 106 %used in XIOS. 107 108 109 In XIOS, we used the ``probe'' technique to search for arrived messages and then perform the receiving action. The principle is 110 that sender process executes the send operation as usual. However, to minimize the time spent on waiting incoming messages, the receiver 111 process calls in the first place the \verb|MPI_Probe| function to check if a message destinate to it has been published. If yes, the 112 process execute in the second place the \verb|MPI_Recv| to receive the message. If not, the receiver process can carry on with other tasks 113 or repeats the \verb|MPI_Probe| and \verb|MPI_Recv| actions if the required message is in immediate need. This technique works well in 114 the current version of XIOS. However, if we introduce threads into this mechanism, problems can occur: The incoming message is labeled by 115 the tag and receiver's MPI rank. Because threads within a process share the MPI rank, and the message probed is always available in the 116 message queue, it can lead to the problem of data race and thus the message can be received by the wrong thread. 117 118 119 To solve this problem, we introduce the ``matching-probe'' technique. The idea of the method is that each thread is equipped with a local 120 incoming message queue. Each time a thread calls an MPI function, for example \verb|MPI_Recv|, it calls firstly the \verb|MPI_Mprobe| 121 function to query the MPI incoming message with any tag and from any source. Once a message is probed, the thread gets the handle to the 122 incoming message and this specific message is erased from the MPI message queue. Then, the thread proceed the identification of the message 123 to get the destination thread's local rank and store the message handle to the local queue of the target thread. The thread repeats these 124 steps until the MPI incoming message queue is empty. Then the thread we perform the usual ``probe'' technique to query its local incoming 125 message queue to check if the required message is available. If yes, it performs the \verb|MPI_Recv| operation. With this ``matching-probe'' 126 technique, we can assure that a message is probed only once and is received by the right receiver. 127 128 129 130 Another issue needs to be clarified with this technique is that: how to identify the receiver's rank? The solution to this question is to 131 use the tag argument. In the MPI environment, a tag is an integer ranging from 0 to $2^{31}$ depending on the Implementation. We can explore 132 the large range property of the tag to store in it information about the source and destination thread ranks. In our endpoint interface, we 133 choose to limit the first 15 bits for the tag used in the classic MPI communication, the next 8 bits to store the sender thread's local 134 rank, and the last 8 bits to store the receiver thread's local rank (\textit{c.f.} Figure \ref{fig:tag}). In this way, with an extra 135 analysis of the tag, we can identify the local ranks of the sender and the receiver in any P2P communication. As results, when a thread 136 probes a message, it knows exactly in which local queue should store the probed message. 137 138 \begin{figure}[ht] 139 \centering 140 \includegraphics[scale=0.4]{tag.png} 141 \caption{}\label{fig:tag} 142 \end{figure} 143 144 In Figure \ref{fig:tag}, Tag contains the user defined value for a certain communication. MPI\_tag is computed in the endpoint interface 145 with help of the rank map and is used in the MPI calls. 146 147 \begin{figure}[ht] 148 \centering 149 \includegraphics[scale = 0.4]{sendrecv.png} 150 \caption{This figure shows the classic pattern of a P2P communication with the endpoint interface. Thread/endpoint rank 0 sends a message 151 to thread/endpoint rank 3 with tag=1. The underlying MPI function called by the sender is indeed a send for MPI rank of 1 152 and tag=65537. From the receiver's point of view, the endpoint 3 is actually receiving a message from MPI rank 0 with 153 tag=65537.} 154 \label{fig:sendrecv} 155 \end{figure} 156 157 158 159 160 With the rank map, tag extension, and the matching-probe techniques, we are now able to call any P2P communication in the endpoint 161 environment. For the collective communications, we apply a step-by-step execution pattern and no special technique is required. A 162 step-by-step execution pattern consists of 3 steps (not necessarily in this order and not all steps are needed): arrangement of the source 163 data, execution of the MPI function by all master/root threads, distribution or arrangement of the resulting data among threads. 164 165 %The most representative functions of the collective communications are \verb|MPI_Gather| and \verb|MPI_Bcast|. 166 167 For example, if we want to perform a broadcast operation, only 2 steps are needed (\textit{c.f.} Figure \ref{fig:bcast}). Firstly, the root 168 thread, along with the master threads of other processes, perform the classic \verb|MPI_Bcast| operation. Secondly, the root thread, and the 169 master threads send data to other threads via local memory transfer. 170 171 \begin{figure}[ht] 172 \centering 173 \includegraphics[scale=0.3]{bcast.png} 174 \caption{} 175 \label{fig:bcast} 176 \end{figure} 177 178 Figure \ref{fig:allreduce} illustrates how the \verb|MPI_Allreduce| function is proceeded in the endpoint interface. First of all, We 179 perform a intra-process ``allreduce'' operation: source data is reduced from slave threads to the master thread via local memory transfer. 180 Next, all master threads call the classic \verb|MPI_Allreduce| routine. Finally, all master threads send the updated reduced data to its 181 slaves via local memory transfer. 182 183 \begin{figure}[ht] 184 \centering 185 \includegraphics[scale=0.3]{allreduce.png} 186 \caption{} 187 \label{fig:allreduce} 188 \end{figure} 189 190 Other MPI routines, such as \verb|MPI_Wait|, \verb|MPI_Intercomm_create| \textit{etc.}, can be found in the technique report of the 191 endpoint interface. 192 193 \section{The multi-threaded XIOS and performance results} 194 195 The development of endpoint interface for thread-friendly XIOS library took about one year and a half. The main difficulty is the 196 co-existence of MPI processes and OpenMP threads. All MPI classes must be redefined in the endpoint interface along with all the routines. 197 The development is now available on the forge server: \url{http://forge.ipsl.jussieu.fr/ioserver/browser/XIOS/dev/branch_openmp}. One 198 technique report is also available in which one can find more detail about how endpoint works and how the routines are implemented 199 \cite{ep:2018}. We must note that the thread-friendly XIOS library is still in the phase of optimization. It will be released in the 200 future with a stable version. 201 202 All the functionalities of XIOS is reserved in its thread-friendly version. Single threaded code can work successfully with the new 203 version of XIOS. For multi-threaded models, some modifications are needed in order to work with the multi-threaded XIOS library. Detail can 204 be found in our technique report \cite{ep:2018}. 205 206 Even though the multi-threaded XIOS library is not fully accomplished and further optimization in ongoing. We have already done some tests 207 to see the potential of the endpoint framework. We take LMDZ as the target model and have tested with several work-flow charges. 208 209 \begin{comment} 133 210 \section{Performance of LMDZ using EP\_XIOS} 134 211 … … 141 218 simulation duration settings: 1 day, 5 days, 15 days, and 31 days. 142 219 143 \begin{figure}[h ]220 \begin{figure}[ht] 144 221 \centering 145 222 \includegraphics[scale = 0.6]{LMDZ_perf.png} … … 167 244 histmth with daily output 168 245 169 \section{Perspectives of EP\_XIOS} 246 \end{comment} 247 248 249 \section{Future works for XIOS} 170 250 171 251
Note: See TracChangeset
for help on using the changeset viewer.