source: XIOS/dev/branch_openmp/Note/rapport ESIWACE.tex @ 1560

Last change on this file since 1560 was 1560, checked in by yushan, 6 years ago

report update

File size: 19.1 KB
Line 
1\documentclass[a4paper,10pt]{article}
2\usepackage[utf8]{inputenc}
3\usepackage{graphicx}
4\usepackage{listings}
5\usepackage[usenames,dvipsnames,svgnames,table]{xcolor}
6\usepackage{amsmath}
7\usepackage{url}
8\usepackage{verbatim}
9\usepackage{cprotect}
10
11% Title Page
12
13\title{Developing XIOS with multi-thread : to accelerate the I/O of climate models}
14
15\author{}
16
17
18\begin{document}
19\maketitle
20
21\section{Context}
22
23The simulation models of climate systems, running on a large number of computing resources can produce an important volume of data. At this
24scale, the I/O and the post-treatment of data becomes a bottle-neck for the performance. In order to manage efficiently the data flux
25generated by the simulations, we use XIOS developed by the Institut Pierre Simon Laplace and Maison de la simulation.
26
27XIOS, a library dedicated to intense calculates, allows us to easily and efficiently manage the parallel I/O on the storage systems. XIOS
28uses the client/server scheme in which computing resources (server) are reserved exclusively for IO in order to minimize their impact on
29the performance of the climate models (client). The clients and servers are executed in parallel and communicate asynchronously. In this
30way, the I/O peaks can be smoothed out as data fluxes are send to server constantly throughout the simulation and the time spent on data
31writing on the server side can be overlapped completely by calculates on the client side.
32
33\begin{figure}[ht]
34\includegraphics[scale=0.4]{Charge1.png}
35\includegraphics[scale=0.4]{Charge2.png}
36\caption{On the left, each peak of computing power corresponds to the valley of memory bandwidth which shows that the computing resources
37are alternating between calculates and I/O. ON the right, both curves are smooth which means that the computing resources have a stable
38charge of work, either calculates or I/O.}
39\end{figure}
40
41
42XIOS works well with many climate simulation codes. For example, LMDZ\footnote{LMDZ is a general circulation model (or global climate model)
43developed since the 70s at the "Laboratoire de Météorologie Dynamique", which includes various variants for the Earth and other planets
44(Mars, Titan, Venus, Exoplanets). The 'Z' in LMDZ stands for "zoom" (and the 'LMD' is for  'Laboratoire de Météorologie Dynamique").
45\url{http://lmdz.lmd.jussieu.fr}}, NENO\footnote{Nucleus for European Modeling of the Ocean alias NEMO is a
46state-of-the-art modelling framework of ocean related engines. \url{https://www.nemo-ocean.eu}}, ORCHIDEE\footnote{the land surface
47model of the IPSL (Institut Pierre Simon Laplace) Earth System Model. \url{https://orchidee.ipsl.fr}}, and DYNAMICO\footnote{The DYNAMICO
48project develops a new dynamical core for LMD-Z, the atmospheric general circulation model (GCM) part of IPSL-CM Earth System Model.
49\url{http://www.lmd.polytechnique.fr/~dubos/DYNAMICO/}} all use XIOS as the output back end. M\'et\'eoFrance and MetOffice also choose XIOS
50to manage the I/O for their models.
51
52
53\section{Development of thread-friendly XIOS}
54
55Although XIOS copes well with many models, there is one potential optimization in XIOS which needs to be investigated: making XIOS thread-friendly.
56
57This topic comes along with the configuration of the climate models. Take LMDZ as example, it is designed with the 2-level parallelization
58scheme. To be more specific, LMDZ uses the domain decomposition method in which each sub-domain is associated with one MPI process. Inside
59of the sub-domain, the model also uses OpenMP derivatives to accelerate the computation. We can imagine that the sub-domain be divided into
60sub-sub-domain and is managed by threads.
61
62\begin{figure}[ht]
63\centering
64\includegraphics[scale=0.5]{domain.pdf}
65\caption{Illustration of the domain decomposition used in LMDZ.}
66\end{figure}
67
68As we know, each sub-domain, or in another word, each MPI process is a XIOS client. The data exchange between client and XIOS servers is
69handled by MPI communications. In order to write an output field, all threads must gather the data to the master thread who acts as MPI
70process in order to call MPI routines. There are two disadvantages about this method : first, we have to spend time on gathering information
71to the master thread which not only increases the memory use, but also implies an OpenMP barrier; second, while the master thread calls MPI
72routine, other threads are in the idle state thus a waster of computing resources. What we want obtain with the thread-friendly XIOS is that
73all threads can act like MPI processes. They can call directly the MPI routine thus no waste in memory nor in computing resources as shown
74in Figure \ref{fig:omp}.
75
76\begin{figure}[ht]
77\centering
78\includegraphics[scale=0.6]{omp.pdf}
79\caption{}
80\label{fig:omp}
81\end{figure}
82
83There are two ways to make XIOS thread-friendly. First of all, change the structure of XIOS which demands a lot of modification is the XIOS
84library. Knowing that XIOS is about 100 000 lines of code, this method will be very time consuming. What's more, the modification will be
85local to XIOS. If we want to optimize an other code to be thread-friendly, we have to redo the modifications. The second choice is to add an
86extra interface to MPI in order to manage the threads. When a thread want to call an MPI routine inside XIOS, it will first pass the
87interface, in which the communication information will be analyzed before the MPI routine is invoked. With this method, we only need to
88modify a very small part of XIOS in order to make it work. What is more interesting is that the interface we created can be adjusted to suit
89other MPI based libraries.
90
91
92In this project, we choose to implement an interface to handle the threads. To do so, we introduce the MPI\_endpoint which is a
93concept proposed in the last MPI Forums and several papers have already discussed the importance of such idea and have introduced the
94framework of the MPI\_endpoint \cite{Dinan:2013}\cite{Sridharan:2014}. The concept of an endpoint is shown by Figure \ref{fig:scheme}. In
95the MPI\_endpoint environment, each OpenMP thread will be associated with a unique rank (global endpoint rank), an endpoint communicator,
96and a local rank (rank inside the MPI process) which is very similar to the \verb|OMP_thread_num|. The global endpoint rank will replace the
97role of the classic MPI rank and will be used in MPI communication calls.
98
99
100\begin{figure}[ht]
101\begin{center}
102\includegraphics[scale=0.4]{scheme.png} 
103\end{center}
104\caption{}
105\label{fig:scheme}
106\end{figure}
107
108%XIOS is a library dedicated to IO management of climate code. It has a client-server pattern in which clients are in charge of computations and servers manage the reading and writing of files. The communication between clients and servers are handled by MPI. However, some of the climate models (\textit{e.g.} LMDZ) nowadays use an hybrid programming policy. Within a shared memory node, OpenMP directives are used to manage message exchanges. In such configuration, XIOS can not take full advantages of the computing resources to maximize the performance. This is because XIOS can only work with MPI processes. Before each call of XIOS routines, threads of one MPI process must gather their information to the master thread who works as an MPI process. After the call, the master thread distributes the updated information among its slave threads. As result, all slave threads have to wait while the master thread calls the XIOS routines. This introduce extra synchronization into the model and leads to not optimized performance. Aware of this situation, we need to develop a new version of XIOS (EP\_XIOS) which can work with threads, or in other words, can consider threads as they were processes. To do so, we introduce the MPI endpoints. 
109
110An other important aspect about the MPI\_endpoint interface is that each endpoints has knowledge of the ranks of other endpoints in the
111same communicator. This knowledge is necessary because when executing an MPI communication, for example a point-to-point exchange, we need
112to know not only the ranks of sender/receiver threads, but also the thread number of the sender/receiver threads and the MPI ranks
113of the sender/receiver processes. This ranking information is implemented inside an map object included in the endpoint communicator
114class.
115
116
117
118
119%\newpage
120
121%The MPI\_endpoint interface we implemented lies on top of an existing MPI Implementation. It consists of wrappers to all MPI functions
122%used in XIOS.
123
124
125In XIOS, we used the ``probe'' technique to search for arrived messages and then perform the receiving action. The principle is
126that sender process executes the send operation as usual. However, to minimize the time spent on waiting incoming messages, the receiver
127process calls in the first place the \verb|MPI_Probe| function to check if a message destinate to it has been published. If yes, the
128process execute in the second place the \verb|MPI_Recv| to receive the message. If not, the receiver process can carry on with other tasks
129or repeats the \verb|MPI_Probe| and \verb|MPI_Recv| actions if the required message is in immediate need. This technique works well in
130the current version of XIOS. However, if we introduce threads into this mechanism, problems can occur: The incoming message is labeled by
131the tag and receiver's MPI rank. Because threads within a process share the MPI rank, and the message probed is always available in the
132message queue, it can lead to the problem of data race and thus the message can be received by the wrong thread.
133
134
135To solve this problem, we introduce the ``matching-probe'' technique. The idea of the method is that each thread is equipped with a local
136incoming message queue. Each time a thread calls an MPI function, for example \verb|MPI_Recv|, it calls firstly the \verb|MPI_Mprobe|
137function to query the MPI incoming message with any tag and from any source. Once a message is probed, the thread gets the handle to the
138incoming message and this specific message is erased from the MPI message queue. Then, the thread proceed the identification of the message
139to get the destination thread's local rank and store the message handle to the local queue of the target thread. The thread repeats these
140steps until the MPI incoming message queue is empty. Then the thread we perform the usual ``probe'' technique to query its local incoming
141message queue to check if the required message is available. If yes, it performs the \verb|MPI_Recv| operation. With this ``matching-probe''
142technique, we can assure that a message is probed only once and is received by the right receiver.
143 
144
145
146Another issue needs to be clarified with this technique is that: how to identify the receiver's rank? The solution to this question is to
147use the tag argument. In the MPI environment, a tag is an integer ranging from 0 to $2^{31}$ depending on the Implementation. We can explore
148the large range property of the tag to store in it information about the source and destination thread ranks. In our endpoint interface, we
149choose to limit the first 15 bits for the tag used in the classic MPI communication, the next 8 bits to store the sender thread's local
150rank, and the last 8 bits to store the receiver thread's local rank (\textit{c.f.} Figure \ref{fig:tag}). In this way, with an extra
151analysis of the tag, we can identify the local ranks of the sender and the receiver in any P2P communication. As results, when a thread
152probes a message, it knows exactly in which local queue should store the probed message.
153
154\begin{figure}[ht]
155 \centering
156 \includegraphics[scale=0.4]{tag.png}
157 \caption{}\label{fig:tag}
158\end{figure}
159
160In Figure \ref{fig:tag}, Tag contains the user defined value for a certain communication. MPI\_tag is computed in the endpoint interface
161with help of the rank map and is used in the MPI calls.
162
163\begin{figure}[ht]
164\centering
165 \includegraphics[scale = 0.4]{sendrecv.png}
166\caption{This figure shows the classic pattern of a P2P communication with the endpoint interface. Thread/endpoint rank 0 sends a message
167to thread/endpoint rank 3 with tag=1. The underlying MPI function called by the sender is indeed a send for MPI rank of 1
168and tag=65537. From the receiver's point of view, the endpoint 3 is actually receiving a message from MPI rank 0 with
169tag=65537.}
170\label{fig:sendrecv}
171\end{figure}
172
173
174
175
176With the rank map, tag extension, and the matching-probe techniques, we are now able to call any P2P communication in the endpoint
177environment. For the collective communications, we apply a step-by-step execution pattern and no special technique is required. A
178step-by-step execution pattern consists of 3 steps (not necessarily in this order and not all steps are needed): arrangement of the source
179data, execution of the MPI function by all master/root threads, distribution or arrangement of the resulting data among threads.
180
181
182For example, if we want to perform a broadcast operation, only 2 steps are needed (\textit{c.f.} Figure \ref{fig:bcast}). Firstly, the root
183thread, along with the master threads of other processes, perform the classic \verb|MPI_Bcast| operation. Secondly, the root thread, and the
184master threads send data to other threads via local memory transfer.
185
186\begin{figure}[ht]
187\centering
188\includegraphics[scale=0.3]{bcast.png} 
189\cprotect\caption{\verb|MPI_Bcast|}
190\label{fig:bcast}
191\end{figure}
192
193Figure \ref{fig:allreduce} illustrates how the \verb|MPI_Allreduce| function is proceeded in the endpoint interface. First of all, We
194perform a intra-process ``allreduce'' operation: source data is reduced from slave threads to the master thread via local memory transfer.
195Next, all master threads call the classic \verb|MPI_Allreduce| routine. Finally, all master threads send the updated reduced data to its
196slaves via local memory transfer.
197
198\begin{figure}[ht]
199\centering
200\includegraphics[scale=0.3]{allreduce.png} 
201\cprotect\caption{\verb|MPI_Allreduce|}
202\label{fig:allreduce}
203\end{figure}
204
205Other MPI routines, such as \verb|MPI_Wait|, \verb|MPI_Intercomm_create| \textit{etc.}, can be found in the technique report of the
206endpoint interface \cite{ep:2018}.
207
208\section{The multi-threaded XIOS and performance results}
209
210The development of endpoint interface for thread-friendly XIOS library took about one year and a half. The main difficulty is the
211co-existence of MPI processes and OpenMP threads. One essential requirement for using the endpoint interface is that the underlying MPI
212implementation must support the level-3 of thread support which is \verb|MPI_THREAD_MULTIPLE|. This means that if the MPI process is
213multi-threaded, multiple threads may call MPI at once with no restrictions. Another importance aspect to be mentioned is that in XIOS, we
214have variables with \verb|static| attribute. It means that inside of an MPI process, threads share the static variable. In order to use
215correctly the endpoint interface, these static variables have to be defined as \verb|threadprivate| to limit the visibility to thread. 
216
217To develop the endpoint interface, we redefined all MPI classes along with all the MPI routines that are used in XIOS library. The current
218version of the interface includes about 7000 lines of code and is now available on the forge server:
219\url{http://forge.ipsl.jussieu.fr/ioserver/browser/XIOS/dev/branch_openmp}. One technique report is also available in which one can find
220more detail about how endpoint works and how the routines are implemented \cite{ep:2018}. We must note that the thread-friendly XIOS
221library is still in the phase of optimization. It will be released in the future with a stable version.
222
223All the functionalities of XIOS is reserved in its thread-friendly XIOS library. Single threaded code can work successfully under the
224endpoint interface with the new version of XIOS. For multi-threaded models, some modifications are needed in order to work with the
225multi-threaded XIOS library. For example, the MPI initialization has be to modified to require the \verb|MPI_THREAD_MULTIPLE|
226support. Each thread should have its own data set. What's most important is that the OpenMP master region in which the master thread calls
227XIOS routines should be erased in order that every threads can call XIOS routines simultaneously. More detail can be found in our technique
228report \cite{ep:2018}.
229
230Even though the multi-threaded XIOS library is not fully accomplished and further optimization in ongoing. We have already done some tests
231to see the potential of the endpoint framework. We take LMDZ as the target model and have tested with several work-flow charges.
232
233\subsection{LMDZ work-flow}
234
235In the LMDZ work-flow, we have a daily output file. We have up to 413 two-dimension variables and 187 three-dimension variables. According
236to user's need, we can change the ``output\_level'' key argument in the \verb|xml| file to select the desired variables to be written. In
237our
238tests, we choose to set ``output\_level=2'' for a light output, and ``output\_level=11'' for a full output. We run the LMDZ code for
239one, two, and three-month simulations using 12 MPI client processes and 1 server process. Each client process includes 8 OpenMP threads
240which gives us 92 XIOS clients in total.
241
242\subsection{CMIP6 work-flow}
243
244\begin{comment}
245\section{Performance of LMDZ using EP\_XIOS}
246
247With the new version of XIOS, we are now capable of taking full advantages of the computing resources allocated by a simulation model when
248calling XIOS functions. All threads, can participate in XIOS as if they are MPI processes. We have tested the EP\_XIOS in LMDZ and the
249performance results are very encouraging.
250
251In our tests, we used 12 client processor with 8 threads each (96 XIOS clients in total), and one single-thread server processor. We have 2
252output densities. The light output gives mainly 2 dimensional fields while the heavy output records more 3D fields. We also have differente
253simulation duration settings: 1 day, 5 days, 15 days, and 31 days.
254
255\begin{figure}[ht]
256 \centering
257 \includegraphics[scale = 0.6]{LMDZ_perf.png}
258 \caption{Speedup obtained by using EP in LMDZ simulations.}
259\end{figure}
260
261In this figure, we show the speedup which is computed by $\displaystyle{\frac{time_{XIOS}}{time_{EP\_XIOS}}}$. The blue bars
262represent speedup of the XIOS file output and the red bars the speedup of LMDZ: calculates + XIOS file output. In all experimens,
263we can observe a speedup which represents a gain in performance. One important conclusion we can get from this result is that, more dense
264the output is, more efficient is the EP\_XIOS. With 8 threads per process, we can reach a speedup in XIOS upto 6, and a speedup of 1.5 in
265LMDZ which represents a decrease of the total execution time to 68\% ($\approx 1/1.5$). This observation confirmes steadily the importance
266of using EP in XIOS. 
267
268The reason why LMDZ does not show much speedup, is because the model is calcutation dominant: time spent on calculation is much longer than
269that on the file output. For example, if 30\% of the execution time is spent on the output, then with a speepup of 6, we can obtain a
270decrease in time of 25\%. Even the 25\% may seems to be small, it is still a gain in performance with existing computing resources.
271
272\section{Performance of EP\_XIOS}
273
274workfloz\_cmip6
275light output
27624*8+2
27730s - 52s
27832 days
279histmth with daily output
280
281\end{comment}
282
283
284\section{Future works for XIOS}
285
286
287\bibliographystyle{plain}
288\bibliography{reference}
289
290\end{document}         
Note: See TracBrowser for help on using the repository browser.