source: XIOS/dev/branch_openmp/Note/EP note.tex @ 1620

Last change on this file since 1620 was 1552, checked in by yushan, 6 years ago

report updated. Todo: add performance results

File size: 27.1 KB
Line 
1\documentclass[a4paper,10pt]{article}
2\usepackage[utf8]{inputenc}
3\usepackage{graphicx}
4\usepackage{listings}
5\usepackage[usenames,dvipsnames,svgnames,table]{xcolor}
6\usepackage{amsmath}
7
8% Title Page
9\title{Note for MPI Endpoints}
10\author{}
11
12
13\begin{document}
14\maketitle
15
16%\begin{abstract}
17%\end{abstract}
18
19\section{Purpose}
20
21Use threads as if they are MPI processes. Each thread will be assigned a rank and be associated with a endpoints communicator
22(EP\_Comm). Convention: one OpenMP thread corresponds to one endpoint.
23
24\section{MPI Endpoints Semantics}
25
26\begin{center}
27\includegraphics[scale=0.4]{scheme.png} 
28\end{center}
29
30Endpoints are created from one MPI communicator and the number of available threads:
31\begin{center}
32 \begin{verbatim}
33int MPI_Comm_create_endpoints(MPI_Comm parent_comm, int num_ep,
34                              MPI_Info info, MPI_Comm out_comm_hdls[])
35 \end{verbatim}
36\end{center}
37
38``In this collective call, a single output communicator is created, and an array of \verb|num_ep| handles to this new communicator
39are returned, where the $i^{th}$ handle corresponds to the $i^{th}$ rank requested by the caller of \verb|MPI_Comm_create_endpoints|. Ranks
40in the output communicator are ordered sequentially and in the same order as the parent communicator. After it has been created, the
41output communicator behaves as a normal communicator, and MPI calls on each endpoint (i.e., communicator handle) behave as though
42they originated from a separate MPI process. In particular, collective calls must be made once per endpoint.''\cite{Dinan:2013}
43
44``Once created, endpoints behave as MPI processes. For example, all ranks in an endpoints communicator must participate in
45collective operations. A consequence of this semantic is that endpoints also have MPI process progress requirements; that operations on
46that endpoint are required to make progress only when an MPI operation (e.g. \verb|MPI_Test|) is performed on that endpoint. This semantic
47enables an MPI implementation to logically separate endpoints, treat them independently within the progress engine, and eliminate
48synchronization in updating their state.''\cite{Sridharan:2014}
49
50\section{EP types}
51\subsection*{MPI\_Comm}
52\verb|MPI_Comm| is composed by:
53\begin{itemize}
54 \item[$\bullet$] \verb|bool is_ep|: true $\implies$ EP, false $\implies$ MPI classic;
55 \item[$\bullet$] \verb|int mpi_comm|: handle to the parent MPI communicator;
56 \item[$\bullet$] \verb|OMPbarrier *ep_barrier|: openMP barrier, used for in-process synchronization and is different from
57 \verb|omp barrier|;
58 \item[$\bullet$] \verb|int[2] size_rank_info[3]|: topology information of the current endpoint:
59 \begin{itemize}
60  \item rank of parent MPI process;
61  \item size of parent MPI communicator;
62  \item rank of endpoint, returned by \verb|MPI_Comm_rank|;
63  \item size of EP communicator, returned by \verb|MPI_Comm_size|;
64  \item in-process rank of endpoint;
65  \item in-process size of EP communicator, also noted as the number of endpoints in one MPI process.
66 \end{itemize}
67 \item[$\bullet$] \verb|MPI_Comm *comm_list|: pointer of the first endpoint communicator of one process;
68 \item[$\bullet$] \verb|Message_list *message_queue|: location of in-coming messages for each endpoint;
69 \item[$\bullet$] \verb|RANK_MAP *rank_map|: a map composed by an integer and a pair of integers. The integer key represents the rank of an
70endpoint. The mapped type (pair of integers) gives the in-process rank of the endpoint and the rank of its parent MPI process:
71 \begin{center}
72 \begin{verbatim}
73rank_map->at(ep_rank)=(ep_rank_local, mpi_rank)
74 \end{verbatim}
75\end{center}
76 \item[$\bullet$] \verb|BUFFER *ep_buffer|: buffer (of type \verb|int|, \verb|float|, \verb|double|, \verb|char|, \verb|long|,
77and \verb|unsigned long|) used for in-process communication.
78\end{itemize}
79
80\subsection{MPI\_Request}
81\verb|MPI_Request| is composed by:
82\begin{itemize}
83 \item[$\bullet$] \verb|int mpi_request|: handle to the MPI request;
84 \item[$\bullet$] \verb|int ep_datatype|: data type of the communication;
85 \item[$\bullet$] \verb|MPI_Comm comm|: handle to the EP communicator;
86 \item[$\bullet$] \verb|int ep_src|: rank of the source endpoint;
87 \item[$\bullet$] \verb|int ep_tag|: tag of the communication.
88 \item[$\bullet$] \verb|int type|: type of the communication:
89 \begin{itemize}
90  \item 1 $\implies$ non-blocking send;
91  \item 2 $\implies$ pending non-blocking receive;
92  \item 3 $\implies$ non-blocking matching receive.
93 \end{itemize}
94
95\end{itemize}
96
97\subsection{MPI\_Status}
98\verb|MPI_Status| consists of:
99\begin{itemize}
100 \item[$\bullet$] \verb|int mpi_status|: handle to the MPI status;
101 \item[$\bullet$] \verb|int ep_datatype|: data type of the communication;
102 \item[$\bullet$] \verb|int ep_src|: rank of the source endpoint;
103 \item[$\bullet$] \verb|int ep_tag|: tag of the communication.
104\end{itemize}
105
106\subsection{MPI\_Message}
107\verb|MPI_Message| includes:
108\begin{itemize}
109 \item[$\bullet$] \verb|int mpi_message|: handle to the MPI message;
110 \item[$\bullet$] \verb|int ep_src|: rank of the source endpoint;
111 \item[$\bullet$] \verb|int ep_tag|: tag of the communication.
112\end{itemize}
113
114Other types, such as \verb|MPI_Info|, \verb|MPI_Aint|, and \verb|MPI_Fint| are defined in the same way. 
115
116\section{P2P communication}
117
118All EP point-to-point communication use tag to distinguish the source and destination endpoint. To be able to add these extra
119information to tag, we require that the tag value is represented using 31 bits in the underlying MPI inmplemention.
120
121\includegraphics[scale=0.4]{tag.png}
122
123EP\_tag is user defined. MPI\_tag is internally computed and used inside MPI calls. Because of the extension of tag, wild-cards as
124\verb|MPI_ANY_SOURCE| and \verb|MPI_ANY_TAG| will not be usable directly. An extra step of tag analysis is needed which leads to the
125message dequeuing mechanism.
126
127\begin{center}
128 \includegraphics[scale = 0.5]{sendrecv.png}
129\end{center}
130
131
132In MPI environment, each MPI process has an incoming message queue. In EP case, messages for all threads inside one MPI process are stored
133in this MPI queue. With the MPI 3 standard, we use the \verb|MPI_Improbe| routine to inquire the message queue and relocate the incoming
134message in the local message queue for the corresponding thread/endpoint.
135
136
137\includegraphics[scale=0.3]{dequeue.png}
138
139% Any EP calls will trigger the message dequeuing and the probing, (matched-)receiving operations are performed upon the local message queue.
140%
141% Example: \verb|EP_Recv(src=2, tag=10, comm1)|:
142% \begin{itemize}
143%  \item[1.] Dequeue MPI message queue;
144%  \item[2.] call \verb|EP_Improb(src=2, tag=10, comm1, message)|;
145%  \item[3.] if find corresponding triple (src, tag, comm1), call \verb|EP_Mrecv(src=2, tag=10, comm1, message)|;
146%  \item[4.] else, repeat from step 2.
147% \end{itemize}
148
149
150\paragraph{Messages are \textit{non-overtaking}} Incoming messages' order is important! If one thread is receiving multiple messages from
151the same source with the same tag. The receive order should be the same order in which the messages are sent. That is to say, the n-th sent
152message should be the n-th received message.
153
154
155
156\paragraph{Progress} ``If a pair of matching send and receives have been initiated on two processes, then at least one of these
157two operations will complete, independently of other actions in the system: the send operation will complete, unless the receive is
158satisfied by another message, and completes; the receive operation will complete, unless the message sent is consumed by another matching
159receive that was posted at the same destination process.'' \cite{MPI}
160
161
162When one \verb|EP_Irecv| is issued, we first dequeue the MPI incoming message queue and distribute all incoming messages to the local
163queues according to the destination identifier. Next, the nonblocking receive request is added at the end of the request pending list.
164Third, the pending list is checked and requests with matching source, tag, and communicator will be accomplished.
165
166
167Because of the importance of message order, some communication completion functions must be discussed here such as \verb|MPI_Test| and
168\verb|MPI_Wait|. ``The functions \verb|MPI_Wait| and \verb|MPI_Test| are used to complete a nonblocking communication. The completion of a
169send operation indicates that the sender is now free to update the locations in the send buffer (the send operation itself leaves the
170content of the send buffer unchanged). It does not indicate that the message has been received, rather, it may have been buffered by
171the communication subsystem. However, if a synchronous mode send was used, the completion of the send operation indicates that a
172matching receive was initiated, and that the message will eventually be received by this matching receive. The completion of a
173receive operation indicates that the receive buffer contains the received message, the receiver is now free to access it, and that the
174status object is set. It does not indicate that the matching send operation has completed (but indicates, of course, that the send
175was initiated).'' \cite{MPI}
176
177\paragraph{Example 1}
178\verb|MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)|
179\begin{itemize}
180 \item[1.] If \verb|request->type == 1|, communication to be tested is indeed issued from a non-blocking send. The completion status is
181returned by:
182 \begin{center}
183  \begin{verbatim}
184   MPI_Test(& request->mpi_request, flag, & status->mpi_status)
185  \end{verbatim}
186 \end{center}
187 \item[2.] If \verb|request->type == 2|, it means that a non-blocking receive is called but the corresponding message is not yet probed.
188The request is in the pending list thus not yet completed. All incoming message is once again probed and all pending requests are
189checked. If after the second check, the matching message is found, thus a \verb|MPI_Imrecv| is called and the type is set to 3.
190Otherwise, the type is still 2, then \verb|flag = false| is returned.
191 \item[3.] If \verb|request->type == 3|, this indcates that the request is issued from a non-blocking receive call and the
192matching message is probed thus the status of the communication lies in the status of the \verb|MPI_Imrecv| function. The
193completion result is returned by:
194 \begin{center}
195  \begin{verbatim}
196   MPI_Test(& request->mpi_request, flag, & status->mpi_status)
197  \end{verbatim}
198 \end{center}
199\end{itemize}
200
201\paragraph{Example 2}
202\verb|MPI_Wait(MPI_Request *request, MPI_Status *status)|
203\begin{itemize}
204 \item[1.] If \verb|request->type == 1|, communication to be tested is indeed issued from a non-blocking send. Jump to step 4.
205 \item[2.] If \verb|request->type == 2|,  it means that a non-blocking receive is called but the corresponding message is not yet probed.
206The request is in the pending list thus not yet completed. We repeat the incoming message probing and the pending request checking until the
207matching message is found, thus a \verb|MPI_Imrecv| is called and the type is set to 3. Jump to step 4.
208 \item[3.] If \verb|request->type == 3|, this indcates that the request is issued from a non-blocking receive call and the
209matching message is probed thus the status of the communication lies in the status of the \verb|MPI_Imrecv| function. Jump to step 4.
210 \item[4.] We force the completion by calling:
211 \begin{center}
212  \begin{verbatim}
213   MPI_Wat(& request->mpi_request, & status->mpi_status)
214  \end{verbatim}
215 \end{center}
216\end{itemize}
217
218\section{Collective communication}
219
220All MPI classic collective communications are performed as the following pattern:
221
222\begin{itemize}
223 \item[1.] Intra-process communication using OpenMP. \textit{e.g.} Collect data from slave threads to master thread.
224 \item[2.] Inter-process communication using MPI collective calls on master threads.
225 \item[3.] Intra-process communication using OpenMP. \textit{e.g.} Distribute data from master thread to slave threads.
226\end{itemize}
227
228\paragraph{Example 1}
229\verb|EP_Bcast(buffer, count, datatype, root = 4, comm)| with \verb|comm| composed by 4 MPI
230processes and 3 threads per process:
231
232We can consider the communicator as $\{\underbrace{(0,1,2)}_\textrm{proc 0} \quad \underbrace{(3,\textcolor{red}{4},5)}_\textrm{proc
2331}\quad \underbrace{(6,7,8)}_\textrm{proc 2}\quad \underbrace{(9,10,11)}_\textrm{proc 3}\}$.
234
235This collective communication is performed by the following three steps:
236
237\begin{itemize}
238 %\item[1.] EP process with rank 4 send the buffer to EP process rank 3 which is a master thread.
239 \item[1.] We call \verb|MPI_Bcast(buffer, count, datatype, mpi_root = 1, mpi_comm) | with EP processes rank 0, 4, 6, and 9.
240 \item[2.] EP processes rank 0, 4, 6, and 9 send the buffer to its slaves.
241\end{itemize}
242
243\begin{center}
244\includegraphics[scale=0.3]{bcast.png} 
245\end{center}
246
247
248\paragraph{Example 2} 
249\verb|EP_Allreduce(sendbuf, recvbuf, count, datatype, op, comm)| with \verb|comm| the same
250as in example 1.
251
252This collective communication is performed by the following three steps:
253
254\begin{itemize}
255 \item[1.] We perform a intra-process ``allreduce'' operation: master threads collect data from its slaves and perform the reduce
256operation.
257 \item[2.] Master threads call the classic \verb|MPI_Allreduce| routine.
258 \item[3.] All master threads send the updated reduced data to its slaves.
259\end{itemize}
260
261\begin{center}
262\includegraphics[scale=0.3]{allreduce.png} 
263\end{center}
264
265
266
267Other collective communications have the similar execution pattern.
268
269\section{Inter-communicator}
270
271In XIOS, inter-communicator is an very important component. Thus, our EP library must support inter-communications.
272
273\subsection{The splitting of intra-communicator}
274
275Before talking about the inter-communicator, we will start by splitting intra-communicator.
276The C prototype of the splitting routine is
277\begin{center}
278 \begin{verbatim}
279int  MPI_Comm_split(MPI_Comm comm, int color, int key,
280                                    MPI_Comm *newcomm)
281 \end{verbatim}
282\end{center}
283
284``This function partitions the group associated with \verb|comm| into disjoint subgroups, one for
285each value of \verb|color|. Each subgroup contains all processes of the same color. Within each
286subgroup, the processes are ranked in the order defined by the value of the argument
287\verb|key|, with ties broken according to their rank in the old group. A new communicator is
288created for each subgroup and returned in \verb|newcomm|. A process may supply the color value
289\verb|MPI_UNDEFINED|, in which case \verb|newcomm| returns \verb|MPI_COMM_NULL|. This is a collective
290call, but each process is permitted to provide different values for color and key.''\cite{MPI}
291
292By definition of the routine, in the case of EP, each thread participating the split operation will have only one color
293(\verb|MPI_UNDEFINED| is also considered to be one color). However, in the process's point of view, it can have multiple colors as shown
294in the following figure.
295
296\begin{center}
297\includegraphics[scale=0.4]{split.png} 
298\end{center}
299
300This figure shows the result of the EP communicator splitting. Here we used the EP rank as key to assign the new rank of the thread in the
301resulting split intra-communicator. If the key is anything else than the EP rank, we follow the convention that the key takes effect only
302inside a process. This means that the threads are at first ordered by the MPI process rank and then by the value of key.
303
304Due to the fact that one process can have multiple colors for its threads, the splitting operation is executed by the following steps:
305
306\begin{itemize}
307 \item[1.] Master threads collect all colors from its slaves and communicate with each other to determine the total number of colors across
308the communicator.
309 \item[2.] For each color, the master thread check all its slave threads to obtain the number of threads having the same color.
310 \item[3.] If at least one of the slave threads holds the color, then the master thread takes this color. If not, the master thread takes
311color \verb|MPI_UNDEFINED|. All master threads call classic communicator splitting routine with key $=$ MPI rank.
312 \item[4.] For master threads holding a defined color, we execute the endpoint creation routine according to the number of slave
313threads holding the same color. The resulting EP communicators are then assigned to these slave threads.
314\end{itemize}
315
316\begin{center}
317\includegraphics[scale=0.4]{split2.png} 
318\end{center}
319
320\subsection{The creation of inter-communicator}
321
322In XIOS, the inter-communicators are create by the routine \verb|MPI_Intercomm_create| which is used to bind two intra-communicators into
323an inter-communicator. The C prototype is
324\begin{center}
325 \begin{verbatim}
326int MPI_Intercomm_create(MPI_Comm local_comm, int local_leader,
327                         MPI_Comm peer_comm, int remote_leader,
328                               int tag, MPI_Comm *newintercomm)
329 \end{verbatim}
330\end{center}
331
332
333According to the MPI standard, ``an inter-communication is a point-to-point communication between processes in
334different groups''. ``All inter-communicator constructors are blocking except for \verb|MPI_COMM_IDUP| and
335require that the local and remote groups be disjoint.''
336
337As in EP the threads are considered as processes, the non-overlapping condition can be translated to ``non-overlapping'' at the thread
338level which means that one thread can not belong to the local group and the remote group. However, the parent process of the thread can
339be overlapped. As the EP library is built upon an existing MPI implementation which follows the non-overlapping condition at the process
340level, we can have an issue in the case.
341
342Before digging into this issue, we shall at first look at the case where the non-overlapping condition is perfectly respected.
343\begin{center}
344 \includegraphics[scale=0.3]{intercomm.png}
345\end{center}
346
347As shown in the figure, we have two intra-communicators A and B and they are totally disjoint both at the thread and process level. Each of
348the communicators has a local leader. We also assume that both leaders belong to a peer communicator and have rank 4 and 9 respectively.
349
350To create the inter-communicator, all threads from the left intra-comm call:
351\begin{verbatim}
352MPI_Intercomm_create(commA, local_leader = 2, peer_comm,
353                     remote_leader = 9, tag, inter_comm)
354\end{verbatim}
355and for threads of the right intra-comm, they call:
356\begin{verbatim}
357MPI_Intercomm_create(commB, local_leader = 3, peer_comm,
358                     remote_leader = 4, tag, inter_comm)
359\end{verbatim}
360
361To perform the inter-communicator creation, we follow the 3 steps:
362\begin{itemize}
363 \item[1.] Determine the leaders and ranks at the process level;
364 \item[2.] Call classic \verb|MPI_Intercomm_create|;
365 \item[3.] Create endpoints from process and assigned to threads.
366\end{itemize}
367
368\begin{center}
369 \includegraphics[scale=0.25]{intercomm_step.png}
370\end{center}
371
372If we have overlapped process in the creation of inter-communicator, we should add an \textit{priority check} to assign the process to only
373one intra-communicator.
374Several possibilities:
375\begin{itemize}
376 \item[1.] Process is shared and contains no local leader $\implies$ process belongs to group with higher rank in peer comm;
377 \item[2.] Process is shared and contains one local leader $\implies$ process belongs to group with the leader;
378 \item[3.] Process is shared and contains both local leaders : leader change is performed and the peer communicator is
379\verb|MPI_COMM_WORLD| and we note ``group A'' the group with smaller peer rank and ``group B'' the group with higher peer rank.
380   \begin{itemize}
381    \item[3a.] If group A has at least two processes, the leader of group A is changed to the master thread of the process with smallest
382rank except the overlapped process. The overlapped process belongs to group B.
383    \item[3b.] If group A has only one processes, and group B has at least two processes, then the leader of group B is changed to the
384master thread of the process with smallest rank except the overlapped process. The overlapped process belongs to group A.
385    \item[3c.] If both group A and group B have only one process, then an one-process intra-communicator is created though it will be
386considered (labeled) as an inter-communicator.
387   \end{itemize}
388\end{itemize}
389
390\begin{center}
391 \includegraphics[scale=0.25]{intercomm2.png}
392\end{center}
393
394\subsection{The merge of inter-communicators}
395\verb|MPI_Intercomm_Merge(MPI_Comm intercomm, int high, MPI_Comm *newintracomm)| creates an intra-communicator by merging the
396local and remote groups of an inter-communicator. All processes should provide the same \verb|high| value within each of the two groups. If
397processes in one group provided the value \verb|high=false| and processes in the other group provided the value \verb|high=true| then the
398union orders the “low” group before the “high” group. If all processes provided the same high argument then the order of the union
399is arbitrary. This call is blocking and collective within the union of the two groups. \cite{MPI}
400
401This routine can be considered as the inverse of \verb|MPI_Intercomm_create|. In the intercommunicator create function, all 5 cases are
402eventually transformed into the case where no MPI process is shared by two groups. It is from this case that the merge funtion takes place.
403
404\begin{itemize}
405 \item[1.] The classic \verb|MPI_Intercomm_merge| is called and an MPI intracommunicator is created from the two disjoint groups and MPI
406processes are ordered by the high value of the local leader.
407 \item[2.] Endpoints are created based on the MPI intracommunicator and the new EP ranks are orderd firstly according to the high value of
408each thread and then to the origianl EP ranks in the intercommunicators.
409\end{itemize}
410
411\begin{center}
412 \includegraphics[scale=0.25]{merge.png}
413\end{center}
414
415\section{P2P communication on inter-communicators}
416
417In case of the intercommunicators, the \verb|MPI_Comm| class has 3 members to determine the topology along with the original
418\verb|rank_map|:
419\begin{itemize}
420 \item \verb|RANK_MAP local_rank_map[size of commA]|: composed of the EP rank in commA' or commB';
421 \item \verb|RANK_MAP remote_rank_map[size of commB]|: = \verb|local_rank_map| of remote group;
422 \item \verb|RANK_MAP intercomm_rank_map[size of commB']|: = \verb|rank_map| of remote group';
423 \item \verb|RANK_MAP rank_map|: rank map of commA' or commB'.
424\end{itemize}
425
426For example, in the following configuration:
427\begin{center}
428 \includegraphics[scale = 0.3]{ranks.png}
429\end{center}
430
431For all endpoints in commA,
432\begin{verbatim}
433 local_rank_map={(rank in commA' or commB',
434                  rank of leader in MPI_Comm_world)}
435               ={(1,0), (0,1), (2,1), (4,1)}
436               
437 remote_rank_map={(remote endpoints' rank in commA' or commB',
438                   rank of remote leader in MPI_Comm_world)}
439                ={(0,0), (1,1), (3,1), (5,1)}
440\end{verbatim}
441
442For all endpoints in commA'
443\begin{verbatim}
444 intercomm_rank_map={(remote endpoints local rank in commA' or commB',
445                      remote endpoints MPI rank in commA' or commB')}
446                   ={(0,0), (1,0)}
447 rank_map={(local rank in commA', mpi rank in commA')}
448         ={(0,0), (1,0), (0,1), (1,1), (0,2), (1,2)}
449\end{verbatim}
450
451
452For all endpoints in comm B,
453\begin{verbatim}
454 local_rank_map={(rank in commA' or commB',
455                  rank of leader in MPI_Comm_world)}
456               ={(0,0), (1,1), (3,1), (5,1)}
457               
458 remote_rank_map={(remote endpoints' rank in commA' or commB',
459                   rank of remote leader in MPI_Comm_world)}
460                ={(1,0), (0,1), (2,1), (4,1)}
461\end{verbatim}
462
463For all endpoints in commB'
464\begin{verbatim}
465 intercomm_rank_map={(remote endpoints local rank in commA' or commB',
466                      remote endpoints MPI rank in commA' or commB')}
467                   ={(0,0), (1,0), (0,1), (1,1), (0,2), (1,2)}
468 rank_map={(local rank in commB', mpi rank in commB')}
469         ={(0,0), (1,0)}
470\end{verbatim}
471
472When calling a p2p communication on an inter-communicator, we should:
473\begin{itemize}
474 \item[1.] Determine if the source and the destination endpoints are in a same group by checking the ``labels''.
475   \begin{itemize}
476    \item[$\bullet$] \verb|src_label = local_rank_map->at(src).second|
477    \item[$\bullet$] \verb|dest_label = remote_rank_map->at(dest).second|
478   \end{itemize}
479   
480   
481 \item[2.] If \verb|src_label == dest_label|, then the communication is in fact a intra-communication. The new source rank and
482destination rank, as well as the local ranks, are deduced by:
483     \begin{verbatim}
484      src_rank = local_rank_map->at(src).first
485      dest_rank = remote_rank_map->at(dest).first
486      src_rank_local = rank_map->at(src_rank).first
487      dest_rank_local = rank_map->at(dest_rank).first
488     \end{verbatim}
489 \item[3.] If \verb|src_label != dest_label|, then the inter-communication is required. The new ranks are obtained by:
490    \begin{verbatim}
491      src_rank = local_rank_map->at(src).first
492      dest_rank = remote_rank_map->at(dest).first
493      src_rank_local = intercomm_rank_map->at(src_rank).first
494      dest_rank_local = rank_map->at(dest_rank).first
495     \end{verbatim}
496 \item[4.] Call MPI P2P function to start the communication.
497 \begin{itemize}
498  \item[$\bullet$] If intra-communication, \verb|mpi_comm = commA'_mpi or commB'_mpi|;
499  \item[$\bullet$] If inter-communication, \verb|mpi_comm = inter_comm_mpi|.
500 \end{itemize}
501\end{itemize}
502
503\begin{center}
504 \includegraphics[scale = 0.3]{sendrecv2.png}
505\end{center}
506
507\section{One-sided communications}
508
509The one-sided communication is a type of communcation which involves only one process to specify all communication parameters, both for the
510sending side and the receiving side \cite[Chapter~11]{MPI}. To extend this type of communication in the context of endpoints, we encounter
511some limitations. In the current work, the one-sided communication can only be used in the client-server mode which means that
512RMA(remote memory access) can occur only between a server and a client.
513
514The construction of RMA windows is illustrated by the following figure:
515\begin{center}
516 \includegraphics[scale=0.5]{RMA_schema.pdf}
517\end{center}
518
519\begin{itemize}
520 \item we determin the max number of threads N in the endpoint environment (N=3 in the example);
521 \item on the server side, N windows are declared and asociated with the same memory adress;
522 \item we start a loop : i = 0, ..., N-1
523 \begin{itemize}
524  \item each endpoint with thread number i declares an RMA window;
525  \item the link between windows on the client side and the i-th window on the server side are created via \verb|MPI_Win_created|;
526  \item if the number of threads on a certain process is less than N, then a \verb|NULL| pointer is used as memory adress.
527 \end{itemize}
528\end{itemize}
529
530With the RMA windows created, we can then perform some communications: \verb|MPI_Put|, \verb|MPI_Get|, \verb|MPI_Accumulate|,
531\verb|MPI_Get_accumulate|, \verb|MPI_Fetch_and_op|, \verb|MPI_Compare_and_swap|, \textit{etc}.
532
533The main idea of any of the mentioned communications is to identify the threads which are involved in the connection. For example, we want
534to perform a put operation from EP 2 to the server. We know that EP 2 is the thread 0 of process 1. Thus the 0-th window (win A) of the
535server side should be used. Once the sender and the receiver are identified, the \verb|MPI_Put| communication can be established.
536
537Other RMA functions, such as \verb|MPI_Win_allocate|, \verb|MPI_win_Fence|, and \verb|MPI_Win_free|, remain nearly the same and we will
538skip the detail in this document.
539
540
541
542\bibliographystyle{plain}
543\bibliography{reference}
544
545\end{document}         
Note: See TracBrowser for help on using the repository browser.