source: XIOS/dev/branch_openmp/Note/rapport ESIWACE.tex @ 1551

Last change on this file since 1551 was 1551, checked in by yushan, 6 years ago

report updated

File size: 13.9 KB
Line 
1\documentclass[a4paper,10pt]{article}
2\usepackage[utf8]{inputenc}
3\usepackage{graphicx}
4\usepackage{listings}
5\usepackage[usenames,dvipsnames,svgnames,table]{xcolor}
6\usepackage{amsmath}
7\usepackage{url}
8
9% Title Page
10
11\title{Developing XIOS with multi-thread : to accelerate the I/O of climate models}
12
13\author{}
14
15
16\begin{document}
17\maketitle
18
19\section{Context}
20
21The simulation models of climate systems, running on a large number of computing resources can produce an important volume of data. At this
22scale, the I/O and the post-treatment of data becomes a bottle-neck for the performance. In order to manage efficiently the data flux
23generated by the simulations, we use XIOS developed by the Institut Pierre Simon Laplace and Maison de la simulation.
24
25XIOS, a library dedicated to intense calculates, allows us to easily and efficiently manage the parallel I/O on the storage systems. XIOS
26uses the client/server scheme in which computing resources (server) are reserved exclusively for IO in order to minimize their impact on
27the performance of the climate models (client). The clients and servers are executed in parallel and communicate asynchronously. In this
28way, the I/O peaks can be smoothed out as data fluxes are send to server constantly throughout the simulation and the time spent on data
29writing on the server side can be overlapped completely by calculates on the client side.
30
31\begin{figure}[h]
32\includegraphics[scale=0.4]{Charge1.png}
33\includegraphics[scale=0.4]{Charge2.png}
34\caption{On the left, each peak of computing power corresponds to the valley of memory bandwidth which shows that the computing resources
35are alternating between calculates and I/O. ON the right, both curves are smooth which means that the computing resources have a stable
36charge of work, either calculates or I/O.}
37\end{figure}
38
39
40XIOS works well with many climate simulation codes. For example, LMDZ\footnote{LMDZ is a general circulation model (or global climate model)
41developed since the 70s at the "Laboratoire de Météorologie Dynamique", which includes various variants for the Earth and other planets
42(Mars, Titan, Venus, Exoplanets). The 'Z' in LMDZ stands for "zoom" (and the 'LMD' is for  'Laboratoire de Météorologie Dynamique").
43\url{http://lmdz.lmd.jussieu.fr}}, NENO\footnote{Nucleus for European Modeling of the Ocean alias NEMO is a
44state-of-the-art modelling framework of ocean related engines. \url{https://www.nemo-ocean.eu}}, ORCHIDEE\footnote{the land surface
45model of the IPSL (Institut Pierre Simon Laplace) Earth System Model. \url{https://orchidee.ipsl.fr}}, and DYNAMICO\footnote{The DYNAMICO
46project develops a new dynamical core for LMD-Z, the atmospheric general circulation model (GCM) part of IPSL-CM Earth System Model.
47\url{http://www.lmd.polytechnique.fr/~dubos/DYNAMICO/}} all use XIOS as the output back end. M\'et\'eoFrance and MetOffice also choose XIOS
48to manege the I/O for their models.
49
50
51\section{Development of thread-friendly XIOS}
52
53Although XIOS copes well with many models, there is one potential optimization in XIOS which needs to be investigated: making XIOS thread-friendly.
54
55This topic comes along with the configuration of the climate models. Take LMDZ as example, it is designed with the 2-level parallelization scheme. To be more specific, LMDZ uses the domain decomposition method in which each sub-domain is associated with one MPI process. Inside of the sub-domain, the model also uses OpenMP derivatives to accelerate the computation. We can imagine that the sub-domain be divided into sub-sub-domain and is managed by threads.
56
57\begin{figure}[h]
58\centering
59\includegraphics[scale=0.5]{domain.pdf}
60\caption{Illustration of the domain decomposition used in LMDZ.}
61\end{figure}
62
63As we know, each sub-domain, or in another word, each MPI process is a XIOS client. The data exchange between client and XIOS servers is handled by MPI communications. In order to write an output field, all threads must gather the data to the master thread who acts as MPI process in order to call MPI routines. There are two disadvantages about this method : first, we have to spend time on gathering information to the master thread which not only increases the memory use, but also implies an OpenMP barrier; second, while the master thread calls MPI routine, other threads are in the idle state thus a waster of computing resources. What we want obtain with the thread-friendly XIOS is that all threads can act like MPI processes. They can call directly the MPI routine thus no waste in memory nor in computing resources as shown in Figure \ref{fig:omp}.
64
65\begin{figure}[h!]
66\centering
67\includegraphics[scale=0.6]{omp.pdf}
68\caption{}
69\label{fig:omp}
70\end{figure}
71
72There are two ways to make XIOS thread-friendly. First of all, change the structure of XIOS which demands a lot of modification is the XIOS library. Knowing that XIOS is about 100 000 lines of code, this method will be very time consuming. What's more, the modification will be local to XIOS. If we want to optimize an other code to be thread-friendly, we have to redo the modifications. The second choice is to add an extra interface to MPI in order to manage the threads. When a thread want to call an MPI routine inside XIOS, it will first pass the interface, in which the communication information will be analyzed before the MPI routine is invoked. With this method, we only need to modify a very small part of XIOS in order to make it work. What is more interesting is that the interface we created can be adjusted to suit other MPI based libraries.
73
74
75In this project, we choose to implement the interface to handle the threads. To do so, we introduce the MPI\_endpoint which is a concept proposed in the last MPI Forum and several papers has already discussed the importance of such idea and have introduced the framework of the MPI\_endpoint \cite{Dinan:2013}\cite{Sridharan:2014}. The concept of an endpoint is shown by Figure \ref{fig:scheme}. Threads of an MPI process is associated with a unique rank (global endpoint rank) and an endpoint communicator. They also have a local rank (rank inside the MPI process) which is very similar to the \verb|OMP_thread_num| rank.
76
77\begin{figure}[h!]
78\begin{center}
79\includegraphics[scale=0.4]{scheme.png} 
80\end{center}
81\caption{}
82\label{fig:scheme}
83\end{figure}
84
85%XIOS is a library dedicated to IO management of climate code. It has a client-server pattern in which clients are in charge of computations and servers manage the reading and writing of files. The communication between clients and servers are handled by MPI. However, some of the climate models (\textit{e.g.} LMDZ) nowadays use an hybrid programming policy. Within a shared memory node, OpenMP directives are used to manage message exchanges. In such configuration, XIOS can not take full advantages of the computing resources to maximize the performance. This is because XIOS can only work with MPI processes. Before each call of XIOS routines, threads of one MPI process must gather their information to the master thread who works as an MPI process. After the call, the master thread distributes the updated information among its slave threads. As result, all slave threads have to wait while the master thread calls the XIOS routines. This introduce extra synchronization into the model and leads to not optimized performance. Aware of this situation, we need to develop a new version of XIOS (EP\_XIOS) which can work with threads, or in other words, can consider threads as they were processes. To do so, we introduce the MPI endpoints. 
86
87
88The MPI\_endpoints interface we implemented lies on top of an existing MPI Implementation. It consists of wrappers to all MPI functions used in XIOS.
89
90will be re-implemented in order to cope with OpenMP threads. The idea is that, in the MPI endpoints environment, each OpenMP thread will be
91associated with a unique rank and with an endpoint communicator. This rank (EP rank) will replace the role of the classic MPI rank and will
92be used in MPI communications. In order to successfully execute an MPI communication, for example \verb|MPI_Send|, we know already which
93endpoints to be the receiver but not sufficient. We also need to know which MPI process should be involved in such communication. To
94identify the MPI rank, we added a ``map'' in the EP communicator in which the relation of all EP and MPI ranks can be easily obtained.
95
96
97In XIOS, we used the ``probe'' technique to search for arrived messages and then performing the receive action. The principle is
98that sender processes execute the send operations as usual. However, to minimize the time spent on waiting incoming messages, the receiver
99processe performs in the first place the \verb|MPI_Probe| function to check if a message destinated to it has been published. If yes, the
100process execute in the second place the \verb|MPI_Recv| to receive the message. In this situation, if we introduce the threads, problems
101occur. The reason why the ``probe'' method is not suitable is that messages destinated to one certain process can be probed by any of
102its threads. Thus the message can be received by the wrong thread which gives errors.
103
104To solve this problem, we introduce the ``matching-probe'' technique. The idea of the method is that each process is equiped with a local
105incoming message queue. All incoming message will be probed, sorted, and then stored in this queue according to their destination rank.
106Every time we call an MPI function, we firstly call the \verb|MPI_Mprobe| function to get the handle to
107the incoming message. Then, we identify the destination thread rank and store the message handle inside the local queue of the target
108thread. After this, we perform the usual ``probe'' technique upon the local incoming message queue. In this way, we can assure the messages
109to be received by the right thread.
110
111Another issue remains in this technique: how to identify the receiver's rank? The solution is to use the tag argument. In the MPI
112environment, a tag is an integer ranging from 0 to $2^{31}$. We can explore the large range of the tag to store in it information about the
113source and destination thread ranks. We choose to limite the first 15 bits for the tag used in the classic MPI communication, the next 8
114bits to the sender's thread rank, and the last 8 bits to the receiver's thread rank. In such way, with an extra analysis of the EP tag, we
115can identify the ranks of the sender and the receiver in any P2P communication. As results, we a thread probes a message, it knows
116exactly in which local queue should store the probed message.
117
118
119With the global rank map, tag extension, and the matching-probe techniques, we are able to use any P2P communication in the endpoint
120environment. For the collective communications, we perform a step-by-step execution and no special technique is required. The most
121representative functions is the collective communications are \verb|MPI_Gather| and \verb|MPI_Bcast|. A step-by-step execution consists of
1223 steps (not necessarily in this order): arrangement of the source data, execution of the MPI function by all
123master/root threads, distribution or arrangement of the data among threads.
124
125For example, if we want to perform a broadcast operation, 2 steps are needed. Firstly, the root thread, along with the master threads of
126other processes, perform the classic \verb|MPI_Bcast| operation. Secondly, the root thread, and the master threads send data to threads
127sharing the same process via local memory transfer. In another example for illustrating the \verb|MPI_Gather| function, we also need 2
128steps. First of all, data is gathered from slave threads to the master thread or the root thread. Next, the master thread and the root
129thread execute the \verb|MPI_Gather| operation of complete the communication. Other collective calls such as \verb|MPI_Scan|,
130\verb|MPI_Reduce|, \verb|MPI_Scatter| \textit{etc} follow the same principle of step-by-step execution.
131
132
133\section{Performance of LMDZ using EP\_XIOS}
134
135With the new version of XIOS, we are now capable of taking full advantages of the computing resources allocated by a simulation model when
136calling XIOS functions. All threads, can participate in XIOS as if they are MPI processes. We have tested the EP\_XIOS in LMDZ and the
137performance results are very encouraging.
138
139In our tests, we used 12 client processor with 8 threads each (96 XIOS clients in total), and one single-thread server processor. We have 2
140output densities. The light output gives mainly 2 dimensional fields while the heavy output records more 3D fields. We also have differente
141simulation duration settings: 1 day, 5 days, 15 days, and 31 days.
142
143\begin{figure}[h]
144 \centering
145 \includegraphics[scale = 0.6]{LMDZ_perf.png}
146 \caption{Speedup obtained by using EP in LMDZ simulations.}
147\end{figure}
148
149In this figure, we show the speedup which is computed by $\displaystyle{\frac{time_{XIOS}}{time_{EP\_XIOS}}}$. The blue bars
150represent speedup of the XIOS file output and the red bars the speedup of LMDZ: calculates + XIOS file output. In all experimens,
151we can observe a speedup which represents a gain in performance. One important conclusion we can get from this result is that, more dense
152the output is, more efficient is the EP\_XIOS. With 8 threads per process, we can reach a speedup in XIOS upto 6, and a speedup of 1.5 in
153LMDZ which represents a decrease of the total execution time to 68\% ($\approx 1/1.5$). This observation confirmes steadily the importance
154of using EP in XIOS. 
155
156The reason why LMDZ does not show much speedup, is because the model is calcutation dominant: time spent on calculation is much longer than
157that on the file output. For example, if 30\% of the execution time is spent on the output, then with a speepup of 6, we can obtain a
158decrease in time of 25\%. Even the 25\% may seems to be small, it is still a gain in performance with existing computing resources.
159
160\section{Performance of EP\_XIOS}
161
162workfloz\_cmip6
163light output
16424*8+2
16530s - 52s
16632 days
167histmth with daily output
168
169\section{Perspectives of EP\_XIOS}
170
171
172\bibliographystyle{plain}
173\bibliography{reference}
174
175\end{document}         
Note: See TracBrowser for help on using the repository browser.