Context Navigation

rapport ESIWACE.tex.backup @ 1551

Last change on this file since 1551 was 1548, checked in by yushan, 6 years ago
add documents for ep_lib
File size: 10.0 KB

Line
1	\documentclass[a4paper,10pt]{article}
2	\usepackage[utf8]{inputenc}
3	\usepackage{graphicx}
4	\usepackage{listings}
5	\usepackage[usenames,dvipsnames,svgnames,table]{xcolor}
6	\usepackage{amsmath}
7
8	% Title Page
9
10	\title{Developping XIOS with multithread : to accelerate the IO of climate models}
11
12	\author{}
13
14
15	\begin{document}
16	\maketitle
17
18	\section{background}
19
20	The simulation models of climate systems, running on a large number of computing resources can produce an important volume of data. At this
21	scale, the IO and the post-treatement of data becomes a bottle-neck for the performance. In order to manage efficiently the data flux
22	generated by the simulations, we use XIOS developped by the Institut Pierre Simon Laplace and Maison de la simulation.
23
24	XIOS, a libarary dedicated to intense calculates, allows us to easily and efficiently manage the parallel IO on the storage systems. XIOS
25	uses the client/server scheme in which computing resources (server) are reserved exclusively for IO in order to minimize their impact on
26	the performance of the climate models (client).
27
28	Cette bibliothÃšque, dÃ©diÃ©e au calcul intensif, permet de gÃ©rer efficacement et simplement les
29	entrÃ©e/sortie parallÃšles des donnÃ©es sur les systÃšmes de stockage. Dans cette nouvelle
30	approche, orientÃ©e client/serveur, des cÅurs de calcul sont exclusivement dÃ©diÃ©s aux I/O de
31	faÃ§on Ã minimiser leur impact sur le temps de calcul des modÃšles. Lâutilisation des
32	communications asynchrones entre les modÃšles (clients) et les serveurs I/O permet de lisser
33	les pics I/O en envoyant un flux de donnÃ©es constant au systÃšme de fichiers tout au long de la
34	simulation, recouvrant ainsi totalement les Ã©critures par du calcul.
35
36
37	The aim of this project ESIWACE is to develop a multithreaded version of XIOS, a library dedicated to IO manegement of climate code.
38	The current XIOS code lies on a single level of parallelization using MPI. However, many climate models are now disigned with two-level
39	parallelization through MPI and OpenMP. The difference of parallelization between the climate models and XIOS can lead to performance lost
40	because XIOS can not cope with threads. This fact
41
42
43	The resulting multithreaded XIOS is desinged to cope with climate models which use a two-level parallelization (MPI/Openmp) scheme.
44	The principle model we work with is the LMDZ code developped at Laboratoire de MÃ©tÃ©orologie Dynamique. This model has
45
46
47
48	\section{Developpement of a thread-friendly MPI for XIOS}
49
50	XIOS is a library dedicated to IO management of climate code. It has a client-server pattern in which clients are in charge of computations
51	and servers manage the reading and writing of files. The communication between clients and servers are handled by MPI.
52	However, some of the climate models (\textit{e.g.} LMDZ) nowadays use an hybrid programming policy. Within a shared memory node, OpenMP
53	directives are used to manage message exchanges. In such configuration, XIOS can not take full advantages of the computing resources to
54	maximize the performance. This is because XIOS can only work with MPI processes. Before each call of XIOS routines, threads of one MPI
55	process must gather their information to the master thread who works as an MPI process. After the call, the master thread distributes the
56	updated information among its slave threads. As result, all slave threads have to wait while the master thread calls the XIOS routines.
57	This introduce extra synchronization into the model and leads to not optimized performance. Aware of this situation, we need to develop a
58	new version of XIOS (EP\_XIOS) which can work with threads, or in other words, can consider threads as they were processes. To do so, we
59	introduce the MPI endpoints.
60
61
62	The MPI endpoints (EP) is a layer on top of an existing MPI Implementation. All MPI function, or in our work the functions used in XIOS,
63	will be reimplemented in order to cope with OpenMP threads. The idea is that, in the MPI endpoints environment, each OpenMP thread will be
64	associated with a unique rank and with an endpoint communicator. This rank (EP rank) will replace the role of the classic MPI rank and will
65	be used in MPI communications. In order to successfully execute an MPI communication, for example \verb\|MPI_Send\|, we know already which
66	endpoints to be the receiver but not sufficient. We also need to know which MPI process should be involved in such communication. To
67	identify the MPI rank, we added a ``map'' in the EP communicator in which the relation of all EP and MPI ranks can be easily obtained.
68
69
70	In XIOS, we used the ``probe'' technique to search for arrived messages and then performing the receive action. The principle is
71	that sender processes execute the send operations as usual. However, to minimise the time spent on waiting incoming messages, the receiver
72	processe performs in the first place the \verb\|MPI_Probe\| function to check if a message destinated to it has been published. If yes, the
73	process execute in the second place the \verb\|MPI_Recv\| to receive the message. In this situation, if we introduce the threads, problems
74	occur. The reason why the ``probe'' method is not suitable is that messages destinated to one certain process can be probed by any of
75	its threads. Thus the message can be received by the wrong thread which gives errors.
76
77	To solve this problem, we introduce the ``matching-probe'' technique. The idea of the method is that each process is equiped with a local
78	incoming message queue. All incoming message will be probed, sorted, and then stored in this queue according to their destination rank.
79	Every time we call an MPI function, we firstly call the \verb\|MPI_Mprobe\| function to get the handle to
80	the incoming message. Then, we identify the destination thread rank and store the message handle inside the local queue of the target
81	thread. After this, we perform the usual ``probe'' technique upon the local incoming message queue. In this way, we can assure the messages
82	to be received by the right thread.
83
84	Another issue remains in this technique: how to identify the receiver's rank? The solution is to use the tag argument. In the MPI
85	environment, a tag is an integer ranging from 0 to $2^{31}$. We can explore the large range of the tag to store in it information about the
86	source and destination thread ranks. We choose to limite the first 15 bits for the tag used in the classic MPI communication, the next 8
87	bits to the sender's thread rank, and the last 8 bits to the receiver's thread rank. In such way, with an extra analysis of the EP tag, we
88	can identify the ranks of the sender and the receiver in any P2P communication. As results, we a thread probes a message, it knows
89	exactly in which local queue should store the probed message.
90
91
92	With the global rank map, tag extension, and the matching-probe techniques, we are able to use any P2P communication in the endpoint
93	environment. For the collective communications, we perform a step-by-step execution and no special technique is required. The most
94	representative functions is the collective communications are \verb\|MPI_Gather\| and \verb\|MPI_Bcast\|. A step-by-step execution consists of
95	3 steps (not necessarily in this order): arrangement of the source data, execution of the MPI function by all
96	master/root threads, distribution or arrangement of the data among threads.
97
98	For example, if we want to perform a broadcast operation, 2 steps are needed. Firstly, the root thread, along with the master threads of
99	other processes, perform the classic \verb\|MPI_Bcast\| operation. Secondly, the root thread, and the master threads send data to threads
100	sharing the same process via local memory transfer. In another example for illustrating the \verb\|MPI_Gather\| function, we also need 2
101	steps. First of all, data is gathered from slave threads to the master thread or the root thread. Next, the master thread and the root
102	thread execute the \verb\|MPI_Gather\| operation of complete the communication. Other collective calls such as \verb\|MPI_Scan\|,
103	\verb\|MPI_Reduce\|, \verb\|MPI_Scatter\| \textit{etc} follow the same principle of step-by-step execution.
104
105
106	\section{Performance of LMDZ using EP\_XIOS}
107
108	With the new version of XIOS, we are now capable of taking full advantages of the computing resources allocated by a simulation model when
109	calling XIOS functions. All threads, can participate in XIOS as if they are MPI processes. We have tested the EP\_XIOS in LMDZ and the
110	performance results are very encouraging.
111
112	In our tests, we used 12 client processor with 8 threads each (96 XIOS clients in total), and one single-thread server processor. We have 2
113	output densities. The light output gives mainly 2 dimensional fields while the heavy output records more 3D fields. We also have differente
114	simulation duration settings: 1 day, 5 days, 15 days, and 31 days.
115
116	\begin{figure}[h]
117	\centering
118	\includegraphics[scale = 0.6]{LMDZ_perf.png}
119	\caption{Speedup obtained by using EP in LMDZ simulations.}
120	\end{figure}
121
122	In this figure, we show the speedup which is computed by $\displaystyle{\frac{time_{XIOS}}{time_{EP\_XIOS}}}$. The blue bars
123	represent speedup of the XIOS file output and the red bars the speedup of LMDZ: calculates + XIOS file output. In all experimens,
124	we can observe a speedup which represents a gain in performance. One important conclusion we can get from this result is that, more dense
125	the output is, more efficient is the EP\_XIOS. With 8 threads per process, we can reach a speedup in XIOS upto 6, and a speedup of 1.5 in
126	LMDZ which represents a decrease of the total execution time to 68\% ($\approx 1/1.5$). This observation confirmes steadily the importance
127	of using EP in XIOS.
128
129	The reason why LMDZ does not show much speedup, is because the model is calcutation dominant: time spent on calculation is much longer than
130	that on the file output. For example, if 30\% of the execution time is spent on the output, then with a speepup of 6, we can obtain a
131	decrease in time of 25\%. Even the 25\% may seems to be small, it is still a gain in performance with existing computing resources.
132
133	\section{Perspectives of EP\_XIOS}
134
135	\end{document}

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format