Ignore:
Timestamp:
06/27/18 15:18:29 (6 years ago)
Author:
yushan
Message:

report updated. Todo: add performance results

File:
1 edited

Legend:

Unmodified
Added
Removed
  • XIOS/dev/branch_openmp/Note/rapport ESIWACE.tex.backup

    r1548 r1552  
    55\usepackage[usenames,dvipsnames,svgnames,table]{xcolor} 
    66\usepackage{amsmath} 
     7\usepackage{url} 
    78 
    89% Title Page 
    910 
    10 \title{Developping XIOS with multithread : to accelerate the IO of climate models} 
     11\title{Developing XIOS with multi-thread : to accelerate the I/O of climate models} 
    1112 
    1213\author{} 
     
    1617\maketitle 
    1718 
    18 \section{background} 
     19\section{Context} 
    1920 
    2021The simulation models of climate systems, running on a large number of computing resources can produce an important volume of data. At this  
    21 scale, the IO and the post-treatement of data becomes a bottle-neck for the performance. In order to manage efficiently the data flux  
    22 generated by the simulations, we use XIOS developped by the Institut Pierre Simon Laplace and Maison de la simulation. 
    23  
    24 XIOS, a libarary dedicated to intense calculates, allows us to easily and efficiently manage the parallel IO on the storage systems. XIOS  
     22scale, the I/O and the post-treatment of data becomes a bottle-neck for the performance. In order to manage efficiently the data flux  
     23generated by the simulations, we use XIOS developed by the Institut Pierre Simon Laplace and Maison de la simulation. 
     24 
     25XIOS, a library dedicated to intense calculates, allows us to easily and efficiently manage the parallel I/O on the storage systems. XIOS  
    2526uses the client/server scheme in which computing resources (server) are reserved exclusively for IO in order to minimize their impact on  
    26 the performance of the climate models (client). 
    27  
    28 Cette bibliothÚque, dédiée au calcul intensif, permet de gérer efficacement et simplement les 
    29 entrée/sortie parallÚles des données sur les systÚmes de stockage. Dans cette nouvelle 
    30 approche, orientée client/serveur, des cœurs de calcul sont exclusivement dédiés aux I/O de 
    31 façon à minimiser leur impact sur le temps de calcul des modÚles. L’utilisation des 
    32 communications asynchrones entre les modÚles (clients) et les serveurs I/O permet de lisser 
    33 les pics I/O en envoyant un flux de données constant au systÚme de fichiers tout au long de la 
    34 simulation, recouvrant ainsi totalement les écritures par du calcul. 
    35  
    36  
    37 The aim of this project ESIWACE is to develop a multithreaded version of XIOS, a library dedicated to IO manegement of climate code. 
    38 The current XIOS code lies on a single level of parallelization using MPI. However, many climate models are now disigned with two-level  
    39 parallelization through MPI and OpenMP. The difference of parallelization between the climate models and XIOS can lead to performance lost  
    40 because XIOS can not cope with threads. This fact  
    41  
    42  
    43 The resulting multithreaded XIOS is desinged to cope with climate models which use a two-level parallelization (MPI/Openmp) scheme. 
    44 The principle model we work with is the LMDZ code developped at Laboratoire de Météorologie Dynamique. This model has  
    45  
    46  
    47  
    48 \section{Developpement of a thread-friendly MPI for XIOS} 
    49  
    50 XIOS is a library dedicated to IO management of climate code. It has a client-server pattern in which clients are in charge of computations  
    51 and servers manage the reading and writing of files. The communication between clients and servers are handled by MPI. 
    52 However, some of the climate models (\textit{e.g.} LMDZ) nowadays use an hybrid programming policy. Within a shared memory node, OpenMP  
    53 directives are used to manage message exchanges. In such configuration, XIOS can not take full advantages of the computing resources to  
    54 maximize the performance. This is because XIOS can only work with MPI processes. Before each call of XIOS routines, threads of one MPI  
    55 process must gather their information to the master thread who works as an MPI process. After the call, the master thread distributes the  
    56 updated information among its slave threads. As result, all slave threads have to wait while the master thread calls the XIOS routines.  
    57 This introduce extra synchronization into the model and leads to not optimized performance. Aware of this situation, we need to develop a  
    58 new version of XIOS (EP\_XIOS) which can work with threads, or in other words, can consider threads as they were processes. To do so, we  
    59 introduce the MPI endpoints.   
    60  
    61  
    62 The MPI endpoints (EP) is a layer on top of an existing MPI Implementation. All MPI function, or in our work the functions used in XIOS,  
    63 will be reimplemented in order to cope with OpenMP threads. The idea is that, in the MPI endpoints environment, each OpenMP thread will be  
    64 associated with a unique rank and with an endpoint communicator. This rank (EP rank) will replace the role of the classic MPI rank and will  
    65 be used in MPI communications. In order to successfully execute an MPI communication, for example \verb|MPI_Send|, we know already which  
    66 endpoints to be the receiver but not sufficient. We also need to know which MPI process should be involved in such communication. To  
    67 identify the MPI rank, we added a ``map'' in the EP communicator in which the relation of all EP and MPI ranks can be easily obtained. 
    68  
    69  
    70 In XIOS, we used the ``probe'' technique to search for arrived messages and then performing the receive action. The principle is  
    71 that sender processes execute the send operations as usual. However, to minimise the time spent on waiting incoming messages, the receiver  
    72 processe performs in the first place the \verb|MPI_Probe| function to check if a message destinated to it has been published. If yes, the  
    73 process execute in the second place the \verb|MPI_Recv| to receive the message. In this situation, if we introduce the threads, problems  
    74 occur. The reason why the ``probe'' method is not suitable is that messages destinated to one certain process can be probed by any of  
    75 its threads. Thus the message can be received by the wrong thread which gives errors. 
    76  
    77 To solve this problem, we introduce the ``matching-probe'' technique. The idea of the method is that each process is equiped with a local  
    78 incoming message queue. All incoming message will be probed, sorted, and then stored in this queue according to their destination rank.  
    79 Every time we call an MPI function, we firstly call the \verb|MPI_Mprobe| function to get the handle to  
    80 the incoming message. Then, we identify the destination thread rank and store the message handle inside the local queue of the target  
    81 thread. After this, we perform the usual ``probe'' technique upon the local incoming message queue. In this way, we can assure the messages  
    82 to be received by the right thread. 
    83  
    84 Another issue remains in this technique: how to identify the receiver's rank? The solution is to use the tag argument. In the MPI  
    85 environment, a tag is an integer ranging from 0 to $2^{31}$. We can explore the large range of the tag to store in it information about the  
    86 source and destination thread ranks. We choose to limite the first 15 bits for the tag used in the classic MPI communication, the next 8  
    87 bits to the sender's thread rank, and the last 8 bits to the receiver's thread rank. In such way, with an extra analysis of the EP tag, we  
    88 can identify the ranks of the sender and the receiver in any P2P communication. As results, we a thread probes a message, it knows  
    89 exactly in which local queue should store the probed message.  
    90  
    91  
    92 With the global rank map, tag extension, and the matching-probe techniques, we are able to use any P2P communication in the endpoint  
    93 environment. For the collective communications, we perform a step-by-step execution and no special technique is required. The most  
    94 representative functions is the collective communications are \verb|MPI_Gather| and \verb|MPI_Bcast|. A step-by-step execution consists of  
    95 3 steps (not necessarily in this order): arrangement of the source data, execution of the MPI function by all  
    96 master/root threads, distribution or arrangement of the data among threads.  
    97  
    98 For example, if we want to perform a broadcast operation, 2 steps are needed. Firstly, the root thread, along with the master threads of  
    99 other processes, perform the classic \verb|MPI_Bcast| operation. Secondly, the root thread, and the master threads send data to threads  
    100 sharing the same process via local memory transfer. In another example for illustrating the \verb|MPI_Gather| function, we also need 2  
    101 steps. First of all, data is gathered from slave threads to the master thread or the root thread. Next, the master thread and the root  
    102 thread execute the \verb|MPI_Gather| operation of complete the communication. Other collective calls such as \verb|MPI_Scan|,  
    103 \verb|MPI_Reduce|, \verb|MPI_Scatter| \textit{etc} follow the same principle of step-by-step execution. 
    104  
     27the performance of the climate models (client). The clients and servers are executed in parallel and communicate asynchronously. In this  
     28way, the I/O peaks can be smoothed out as data fluxes are send to server constantly throughout the simulation and the time spent on data  
     29writing on the server side can be overlapped completely by calculates on the client side.  
     30 
     31\begin{figure}[ht] 
     32\includegraphics[scale=0.4]{Charge1.png} 
     33\includegraphics[scale=0.4]{Charge2.png} 
     34\caption{On the left, each peak of computing power corresponds to the valley of memory bandwidth which shows that the computing resources  
     35are alternating between calculates and I/O. ON the right, both curves are smooth which means that the computing resources have a stable  
     36charge of work, either calculates or I/O.} 
     37\end{figure} 
     38 
     39 
     40XIOS works well with many climate simulation codes. For example, LMDZ\footnote{LMDZ is a general circulation model (or global climate model)  
     41developed since the 70s at the "Laboratoire de Météorologie Dynamique", which includes various variants for the Earth and other planets  
     42(Mars, Titan, Venus, Exoplanets). The 'Z' in LMDZ stands for "zoom" (and the 'LMD' is for  'Laboratoire de Météorologie Dynamique").  
     43\url{http://lmdz.lmd.jussieu.fr}}, NENO\footnote{Nucleus for European Modeling of the Ocean alias NEMO is a  
     44state-of-the-art modelling framework of ocean related engines. \url{https://www.nemo-ocean.eu}}, ORCHIDEE\footnote{the land surface  
     45model of the IPSL (Institut Pierre Simon Laplace) Earth System Model. \url{https://orchidee.ipsl.fr}}, and DYNAMICO\footnote{The DYNAMICO  
     46project develops a new dynamical core for LMD-Z, the atmospheric general circulation model (GCM) part of IPSL-CM Earth System Model.  
     47\url{http://www.lmd.polytechnique.fr/~dubos/DYNAMICO/}} all use XIOS as the output back end. M\'et\'eoFrance and MetOffice also choose XIOS  
     48to manege the I/O for their models. 
     49 
     50 
     51\section{Development of thread-friendly XIOS} 
     52 
     53Although XIOS copes well with many models, there is one potential optimization in XIOS which needs to be investigated: making XIOS thread-friendly. 
     54 
     55This topic comes along with the configuration of the climate models. Take LMDZ as example, it is designed with the 2-level parallelization scheme. To be more specific, LMDZ uses the domain decomposition method in which each sub-domain is associated with one MPI process. Inside of the sub-domain, the model also uses OpenMP derivatives to accelerate the computation. We can imagine that the sub-domain be divided into sub-sub-domain and is managed by threads.  
     56 
     57\begin{figure}[ht] 
     58\centering 
     59\includegraphics[scale=0.5]{domain.pdf} 
     60\caption{Illustration of the domain decomposition used in LMDZ.} 
     61\end{figure} 
     62 
     63As we know, each sub-domain, or in another word, each MPI process is a XIOS client. The data exchange between client and XIOS servers is handled by MPI communications. In order to write an output field, all threads must gather the data to the master thread who acts as MPI process in order to call MPI routines. There are two disadvantages about this method : first, we have to spend time on gathering information to the master thread which not only increases the memory use, but also implies an OpenMP barrier; second, while the master thread calls MPI routine, other threads are in the idle state thus a waster of computing resources. What we want obtain with the thread-friendly XIOS is that all threads can act like MPI processes. They can call directly the MPI routine thus no waste in memory nor in computing resources as shown in Figure \ref{fig:omp}. 
     64 
     65\begin{figure}[ht] 
     66\centering 
     67\includegraphics[scale=0.6]{omp.pdf} 
     68\caption{} 
     69\label{fig:omp} 
     70\end{figure} 
     71 
     72There are two ways to make XIOS thread-friendly. First of all, change the structure of XIOS which demands a lot of modification is the XIOS library. Knowing that XIOS is about 100 000 lines of code, this method will be very time consuming. What's more, the modification will be local to XIOS. If we want to optimize an other code to be thread-friendly, we have to redo the modifications. The second choice is to add an extra interface to MPI in order to manage the threads. When a thread want to call an MPI routine inside XIOS, it will first pass the interface, in which the communication information will be analyzed before the MPI routine is invoked. With this method, we only need to modify a very small part of XIOS in order to make it work. What is more interesting is that the interface we created can be adjusted to suit other MPI based libraries. 
     73 
     74 
     75In this project, we choose to implement an interface to handle the threads. To do so, we introduce the MPI\_endpoint which is a  
     76concept proposed in the last MPI Forums and several papers have already discussed the importance of such idea and have introduced the  
     77framework of the MPI\_endpoint \cite{Dinan:2013}\cite{Sridharan:2014}. The concept of an endpoint is shown by Figure \ref{fig:scheme}. In  
     78the MPI\_endpoint environment, each OpenMP thread will be associated with a unique rank (global endpoint rank), an endpoint communicator,  
     79and a local rank (rank inside the MPI process) which is very similar to the \verb|OMP_thread_num|. The global endpoint rank will replace the  
     80role of the classic MPI rank and will be used in MPI communication calls.  
     81 
     82 
     83\begin{figure}[ht] 
     84\begin{center} 
     85\includegraphics[scale=0.4]{scheme.png}  
     86\end{center} 
     87\caption{} 
     88\label{fig:scheme} 
     89\end{figure} 
     90 
     91%XIOS is a library dedicated to IO management of climate code. It has a client-server pattern in which clients are in charge of computations and servers manage the reading and writing of files. The communication between clients and servers are handled by MPI. However, some of the climate models (\textit{e.g.} LMDZ) nowadays use an hybrid programming policy. Within a shared memory node, OpenMP directives are used to manage message exchanges. In such configuration, XIOS can not take full advantages of the computing resources to maximize the performance. This is because XIOS can only work with MPI processes. Before each call of XIOS routines, threads of one MPI process must gather their information to the master thread who works as an MPI process. After the call, the master thread distributes the updated information among its slave threads. As result, all slave threads have to wait while the master thread calls the XIOS routines. This introduce extra synchronization into the model and leads to not optimized performance. Aware of this situation, we need to develop a new version of XIOS (EP\_XIOS) which can work with threads, or in other words, can consider threads as they were processes. To do so, we introduce the MPI endpoints.   
     92 
     93An other important aspect about the MPI\_endpoint interface is that each endpoints has knowledge of the ranks of other endpoints in the  
     94same communicator. This knowledge is necessary because when executing an MPI communication, for example a point-to-point exchange, we need  
     95to know not only the ranks of sender/receiver threads, but also the thread number of the sender/receiver threads and the MPI ranks  
     96of the sender/receiver processes. This ranking information is implemented inside an map object included in the endpoint communicator  
     97class. 
     98 
     99 
     100 
     101 
     102%\newpage 
     103 
     104%The MPI\_endpoint interface we implemented lies on top of an existing MPI Implementation. It consists of wrappers to all MPI functions  
     105%used in XIOS.  
     106 
     107 
     108In XIOS, we used the ``probe'' technique to search for arrived messages and then perform the receiving action. The principle is  
     109that sender process executes the send operation as usual. However, to minimize the time spent on waiting incoming messages, the receiver  
     110process calls in the first place the \verb|MPI_Probe| function to check if a message destinate to it has been published. If yes, the  
     111process execute in the second place the \verb|MPI_Recv| to receive the message. If not, the receiver process can carry on with other tasks  
     112or repeats the \verb|MPI_Probe| and \verb|MPI_Recv| actions if the required message is in immediate need. This technique works well in  
     113the current version of XIOS. However, if we introduce threads into this mechanism, problems can occur: The incoming message is labeled by  
     114the tag and receiver's MPI rank. Because threads within a process share the MPI rank, and the message probed is always available in the  
     115message queue, it can lead to the problem of data race and thus the message can be received by the wrong thread. 
     116 
     117 
     118To solve this problem, we introduce the ``matching-probe'' technique. The idea of the method is that each thread is equipped with a local  
     119incoming message queue. Each time a thread calls an MPI function, for example \verb|MPI_Recv|, it calls firstly the \verb|MPI_Mprobe|  
     120function to query the MPI incoming message with any tag and from any source. Once a message is probed, the thread gets the handle to the  
     121incoming message and this specific message is erased from the MPI message queue. Then, the thread proceed the identification of the message  
     122to get the destination thread's local rank and store the message handle to the local queue of the target thread. The thread repeats these  
     123steps until the MPI incoming message queue is empty. Then the thread we perform the usual ``probe'' technique to query its local incoming  
     124message queue to check if the required message is available. If yes, it performs the \verb|MPI_Recv| operation. With this ``matching-probe''  
     125technique, we can assure that a message is probed only once and is received by the right receiver. 
     126  
     127 
     128 
     129Another issue needs to be clarified with this technique is that: how to identify the receiver's rank? The solution to this question is to  
     130use the tag argument. In the MPI environment, a tag is an integer ranging from 0 to $2^{31}$ depending on the Implementation. We can explore  
     131the large range property of the tag to store in it information about the source and destination thread ranks. In our endpoint interface, we  
     132choose to limit the first 15 bits for the tag used in the classic MPI communication, the next 8 bits to store the sender thread's local  
     133rank, and the last 8 bits to store the receiver thread's local rank (\textit{c.f.} Figure \ref{fig:tag}). In this way, with an extra  
     134analysis of the tag, we can identify the local ranks of the sender and the receiver in any P2P communication. As results, when a thread  
     135probes a message, it knows exactly in which local queue should store the probed message.  
     136 
     137\begin{figure}[ht] 
     138 \centering 
     139 \includegraphics[scale=0.4]{tag.png} 
     140 \caption{}\label{fig:tag} 
     141\end{figure} 
     142 
     143In Figure \ref{fig:tag}, Tag contains the user defined value for a certain communication. MPI\_tag is computed in the endpoint interface  
     144with help of the rank map and is used in the MPI calls.  
     145 
     146\begin{figure}[ht] 
     147\centering 
     148 \includegraphics[scale = 0.4]{sendrecv.png} 
     149\caption{This figure shows the classic pattern of a P2P communication with the endpoint interface. Thread/endpoint rank 0 sends a message  
     150to thread/endpoint rank 3 with tag=1. The underlying MPI function called by the sender is indeed a send for MPI rank of 1  
     151and tag=65537. From the receiver's point of view, the endpoint 3 is actually receving a message from MPI rank 0 with  
     152tag=65537.} 
     153\label{fig:sendrecv} 
     154\end{figure} 
     155 
     156 
     157 
     158 
     159With the rank map, tag extension, and the matching-probe techniques, we are now able to call any P2P communication in the endpoint  
     160environment. For the collective communications, we apply a step-by-step execution pattern and no special technique is required. A  
     161step-by-step execution pattern consists of 3 steps (not necessarily in this order and not all steps are needed): arrangement of the source  
     162data, execution of the MPI function by all master/root threads, distribution or arrangement of the resulting data among threads.  
     163 
     164%The most representative functions of the collective communications are \verb|MPI_Gather| and \verb|MPI_Bcast|. 
     165 
     166For example, if we want to perform a broadcast operation, only 2 steps are needed (\textit{c.f.} Figure \ref{fig:bcast}). Firstly, the root  
     167thread, along with the master threads of other processes, perform the classic \verb|MPI_Bcast| operation. Secondly, the root thread, and the  
     168master threads send data to other threads via local memory transfer.  
     169 
     170\begin{figure}[ht] 
     171\centering 
     172\includegraphics[scale=0.3]{bcast.png}  
     173\caption{} 
     174\label{fig:bcast} 
     175\end{figure} 
     176 
     177Figure \ref{fig:allreduce} illustrates how the \verb|MPI_Allreduce| function is proceeded in the endpoint interface. First of all, We  
     178perform a intra-process ``allreduce'' operation: source data is reduced from slave threads to the master thread via local memory transfer.  
     179Next, alm master threads call the classic \verb|MPI_Allreduce| routine. Finally, all master threads send the updated reduced data to its  
     180slaves via local memory transfer.  
     181 
     182\begin{figure}[ht] 
     183\centering 
     184\includegraphics[scale=0.3]{allreduce.png}  
     185\caption{} 
     186\label{fig:allreduce} 
     187\end{figure} 
     188 
     189Other MPI routines, such as \verb|MPI_Wait|, \verb|MPI_Intercomm_create| \textit{etc.}, can be found in the technique report of the  
     190endpoint interface. 
     191 
     192\section{The multi-threaded XIOS and performce results} 
     193 
     194The development of endpoint interface for thread-friendly XIOS library took about one year and a half. The main difficulty is the  
     195co-existance of MPI processes and OpenMP threads. All MPI classes must be redefined in the endpoint interface along with all the routines.  
     196The development is now available on the forge server: \url{http://forge.ipsl.jussieu.fr/ioserver/browser/XIOS/dev/branch_openmp}. One  
     197technique report is also available in which one can find more detail about how endpoint works and how the routines are implemented  
     198\cite{ep:2018}. We must note that the thread-friendly XIOS library is still in the phase of optimization. It will be released in the  
     199future with a stable version. 
     200 
     201All the funcionalities of XIOS is reserved in its thread-friendly version. Single threaded code can work successfully with the new  
     202version of XIOS. For multi-threaded models, some modifications are needed in order to work with the multi-threaded XIOS library. Detail can  
     203be found in our technique report \cite{ep:2018}. 
     204 
     205Even though the multi-threaded 
    105206 
    106207\section{Performance of LMDZ using EP\_XIOS} 
     
    114215simulation duration settings: 1 day, 5 days, 15 days, and 31 days.  
    115216 
    116 \begin{figure}[h] 
     217\begin{figure}[ht] 
    117218 \centering 
    118219 \includegraphics[scale = 0.6]{LMDZ_perf.png} 
     
    131232decrease in time of 25\%. Even the 25\% may seems to be small, it is still a gain in performance with existing computing resources. 
    132233 
     234\section{Performance of EP\_XIOS} 
     235 
     236workfloz\_cmip6  
     237light output 
     23824*8+2 
     23930s - 52s 
     24032 days 
     241histmth with daily output 
     242 
    133243\section{Perspectives of EP\_XIOS} 
    134244 
     245 
     246\bibliographystyle{plain} 
     247\bibliography{reference} 
     248 
    135249\end{document}           
Note: See TracChangeset for help on using the changeset viewer.