1 | \documentclass[a4paper,10pt]{article} |
---|
2 | \usepackage[utf8]{inputenc} |
---|
3 | \usepackage{graphicx} |
---|
4 | \usepackage{listings} |
---|
5 | \usepackage[usenames,dvipsnames,svgnames,table]{xcolor} |
---|
6 | \usepackage{amsmath} |
---|
7 | \usepackage{url} |
---|
8 | |
---|
9 | % Title Page |
---|
10 | |
---|
11 | \title{Developping XIOS with multithread : to accelerate the IO of climate models} |
---|
12 | |
---|
13 | \author{} |
---|
14 | |
---|
15 | |
---|
16 | \begin{document} |
---|
17 | \maketitle |
---|
18 | |
---|
19 | \section{Context} |
---|
20 | |
---|
21 | The simulation models of climate systems, running on a large number of computing resources can produce an important volume of data. At this |
---|
22 | scale, the IO and the post-treatement of data becomes a bottle-neck for the performance. In order to manage efficiently the data flux |
---|
23 | generated by the simulations, we use XIOS developped by the Institut Pierre Simon Laplace and Maison de la simulation. |
---|
24 | |
---|
25 | XIOS, a libarary dedicated to intense calculates, allows us to easily and efficiently manage the parallel IO on the storage systems. XIOS |
---|
26 | uses the client/server scheme in which computing resources (server) are reserved exclusively for IO in order to minimize their impact on |
---|
27 | the performance of the climate models (client). The clients and servers are executed in parallel and communicate asynchronuously. In this |
---|
28 | way, the IO peaks can be smoothed out as data fluxes are send to server constantly throughout the simulation and the time spent on data |
---|
29 | writing on the server side can be overlapped completely by calculates on the client side. |
---|
30 | |
---|
31 | \begin{figure}[h] |
---|
32 | \includegraphics[scale=0.4]{Charge1.png} |
---|
33 | \includegraphics[scale=0.4]{Charge2.png} |
---|
34 | \caption{On the left, each peak of computing power corresponds to the vallay of memory bandwidth which shows that the computing resources |
---|
35 | are alternating between calculates and IO. ON the right, both curves are smooth which means that the computing resources have a stable |
---|
36 | charge of work, either calculates or IO.} |
---|
37 | \end{figure} |
---|
38 | |
---|
39 | |
---|
40 | XIOS works well with many climate simulation codes. For example, LMDZ\footnote{LMDZ is a general circulation model (or global climate model) |
---|
41 | developped since the 70s at the "Laboratoire de Météorologie Dynamique", which includes various variants for the Earth and other planets |
---|
42 | (Mars, Titan, Venus, Exoplanets). The 'Z' in LMDZ stands for "zoom" (and the 'LMD' is for 'Laboratoire de Météorologie Dynamique"). |
---|
43 | \url{http://lmdz.lmd.jussieu.fr}}, NENO\footnote{Nucleus for European Modelling of the Ocean alias NEMO is a |
---|
44 | state-of-the-art modelling framework of ocean related engines. \url{https://www.nemo-ocean.eu}}, ORCHIDEE\footnote{the land surface |
---|
45 | model of the IPSL (Institut Pierre Simon Laplace) Earth System Model. \url{https://orchidee.ipsl.fr}}, and DYNAMICO\footnote{The DYNAMICO |
---|
46 | project develops a new dynamical core for LMD-Z, the atmospheric general circulation model (GCM) part of IPSL-CM Earth System Model. |
---|
47 | \url{http://www.lmd.polytechnique.fr/~dubos/DYNAMICO/}} all use XIOS as the output backend. M\'et\'eoFrance and MetOffice also choose XIOS |
---|
48 | to manege the IO for their models. |
---|
49 | |
---|
50 | |
---|
51 | \section{Developpement of thread-friendly XIOS} |
---|
52 | |
---|
53 | |
---|
54 | XIOS is a library dedicated to IO management of climate code. It has a client-server pattern in which clients are in charge of computations |
---|
55 | and servers manage the reading and writing of files. The communication between clients and servers are handled by MPI. |
---|
56 | However, some of the climate models (\textit{e.g.} LMDZ) nowadays use an hybrid programming policy. Within a shared memory node, OpenMP |
---|
57 | directives are used to manage message exchanges. In such configuration, XIOS can not take full advantages of the computing resources to |
---|
58 | maximize the performance. This is because XIOS can only work with MPI processes. Before each call of XIOS routines, threads of one MPI |
---|
59 | process must gather their information to the master thread who works as an MPI process. After the call, the master thread distributes the |
---|
60 | updated information among its slave threads. As result, all slave threads have to wait while the master thread calls the XIOS routines. |
---|
61 | This introduce extra synchronization into the model and leads to not optimized performance. Aware of this situation, we need to develop a |
---|
62 | new version of XIOS (EP\_XIOS) which can work with threads, or in other words, can consider threads as they were processes. To do so, we |
---|
63 | introduce the MPI endpoints. |
---|
64 | |
---|
65 | |
---|
66 | The MPI endpoints (EP) is a layer on top of an existing MPI Implementation. All MPI function, or in our work the functions used in XIOS, |
---|
67 | will be reimplemented in order to cope with OpenMP threads. The idea is that, in the MPI endpoints environment, each OpenMP thread will be |
---|
68 | associated with a unique rank and with an endpoint communicator. This rank (EP rank) will replace the role of the classic MPI rank and will |
---|
69 | be used in MPI communications. In order to successfully execute an MPI communication, for example \verb|MPI_Send|, we know already which |
---|
70 | endpoints to be the receiver but not sufficient. We also need to know which MPI process should be involved in such communication. To |
---|
71 | identify the MPI rank, we added a ``map'' in the EP communicator in which the relation of all EP and MPI ranks can be easily obtained. |
---|
72 | |
---|
73 | |
---|
74 | In XIOS, we used the ``probe'' technique to search for arrived messages and then performing the receive action. The principle is |
---|
75 | that sender processes execute the send operations as usual. However, to minimise the time spent on waiting incoming messages, the receiver |
---|
76 | processe performs in the first place the \verb|MPI_Probe| function to check if a message destinated to it has been published. If yes, the |
---|
77 | process execute in the second place the \verb|MPI_Recv| to receive the message. In this situation, if we introduce the threads, problems |
---|
78 | occur. The reason why the ``probe'' method is not suitable is that messages destinated to one certain process can be probed by any of |
---|
79 | its threads. Thus the message can be received by the wrong thread which gives errors. |
---|
80 | |
---|
81 | To solve this problem, we introduce the ``matching-probe'' technique. The idea of the method is that each process is equiped with a local |
---|
82 | incoming message queue. All incoming message will be probed, sorted, and then stored in this queue according to their destination rank. |
---|
83 | Every time we call an MPI function, we firstly call the \verb|MPI_Mprobe| function to get the handle to |
---|
84 | the incoming message. Then, we identify the destination thread rank and store the message handle inside the local queue of the target |
---|
85 | thread. After this, we perform the usual ``probe'' technique upon the local incoming message queue. In this way, we can assure the messages |
---|
86 | to be received by the right thread. |
---|
87 | |
---|
88 | Another issue remains in this technique: how to identify the receiver's rank? The solution is to use the tag argument. In the MPI |
---|
89 | environment, a tag is an integer ranging from 0 to $2^{31}$. We can explore the large range of the tag to store in it information about the |
---|
90 | source and destination thread ranks. We choose to limite the first 15 bits for the tag used in the classic MPI communication, the next 8 |
---|
91 | bits to the sender's thread rank, and the last 8 bits to the receiver's thread rank. In such way, with an extra analysis of the EP tag, we |
---|
92 | can identify the ranks of the sender and the receiver in any P2P communication. As results, we a thread probes a message, it knows |
---|
93 | exactly in which local queue should store the probed message. |
---|
94 | |
---|
95 | |
---|
96 | With the global rank map, tag extension, and the matching-probe techniques, we are able to use any P2P communication in the endpoint |
---|
97 | environment. For the collective communications, we perform a step-by-step execution and no special technique is required. The most |
---|
98 | representative functions is the collective communications are \verb|MPI_Gather| and \verb|MPI_Bcast|. A step-by-step execution consists of |
---|
99 | 3 steps (not necessarily in this order): arrangement of the source data, execution of the MPI function by all |
---|
100 | master/root threads, distribution or arrangement of the data among threads. |
---|
101 | |
---|
102 | For example, if we want to perform a broadcast operation, 2 steps are needed. Firstly, the root thread, along with the master threads of |
---|
103 | other processes, perform the classic \verb|MPI_Bcast| operation. Secondly, the root thread, and the master threads send data to threads |
---|
104 | sharing the same process via local memory transfer. In another example for illustrating the \verb|MPI_Gather| function, we also need 2 |
---|
105 | steps. First of all, data is gathered from slave threads to the master thread or the root thread. Next, the master thread and the root |
---|
106 | thread execute the \verb|MPI_Gather| operation of complete the communication. Other collective calls such as \verb|MPI_Scan|, |
---|
107 | \verb|MPI_Reduce|, \verb|MPI_Scatter| \textit{etc} follow the same principle of step-by-step execution. |
---|
108 | |
---|
109 | |
---|
110 | \section{Performance of LMDZ using EP\_XIOS} |
---|
111 | |
---|
112 | With the new version of XIOS, we are now capable of taking full advantages of the computing resources allocated by a simulation model when |
---|
113 | calling XIOS functions. All threads, can participate in XIOS as if they are MPI processes. We have tested the EP\_XIOS in LMDZ and the |
---|
114 | performance results are very encouraging. |
---|
115 | |
---|
116 | In our tests, we used 12 client processor with 8 threads each (96 XIOS clients in total), and one single-thread server processor. We have 2 |
---|
117 | output densities. The light output gives mainly 2 dimensional fields while the heavy output records more 3D fields. We also have differente |
---|
118 | simulation duration settings: 1 day, 5 days, 15 days, and 31 days. |
---|
119 | |
---|
120 | \begin{figure}[h] |
---|
121 | \centering |
---|
122 | \includegraphics[scale = 0.6]{LMDZ_perf.png} |
---|
123 | \caption{Speedup obtained by using EP in LMDZ simulations.} |
---|
124 | \end{figure} |
---|
125 | |
---|
126 | In this figure, we show the speedup which is computed by $\displaystyle{\frac{time_{XIOS}}{time_{EP\_XIOS}}}$. The blue bars |
---|
127 | represent speedup of the XIOS file output and the red bars the speedup of LMDZ: calculates + XIOS file output. In all experimens, |
---|
128 | we can observe a speedup which represents a gain in performance. One important conclusion we can get from this result is that, more dense |
---|
129 | the output is, more efficient is the EP\_XIOS. With 8 threads per process, we can reach a speedup in XIOS upto 6, and a speedup of 1.5 in |
---|
130 | LMDZ which represents a decrease of the total execution time to 68\% ($\approx 1/1.5$). This observation confirmes steadily the importance |
---|
131 | of using EP in XIOS. |
---|
132 | |
---|
133 | The reason why LMDZ does not show much speedup, is because the model is calcutation dominant: time spent on calculation is much longer than |
---|
134 | that on the file output. For example, if 30\% of the execution time is spent on the output, then with a speepup of 6, we can obtain a |
---|
135 | decrease in time of 25\%. Even the 25\% may seems to be small, it is still a gain in performance with existing computing resources. |
---|
136 | |
---|
137 | \section{Perspectives of EP\_XIOS} |
---|
138 | |
---|
139 | \end{document} |
---|