Changeset 11528
 Timestamp:
 20190910T18:28:56+02:00 (22 months ago)
 File:

 1 edited
Legend:
 Unmodified
 Added
 Removed

NEMO/branches/2019/dev_r10984_HPC13_IRRMANN_BDY_optimization/doc/latex/NEMO/subfiles/chap_LBC.tex
r11512 r11528 20 20 {Boundary condition at the coast (\protect\np{rn\_shlat})} 21 21 \label{sec:LBC_coast} 22 %nam _lbc22 %namlbc 23 23 24 24 \nlst{namlbc} … … 251 251 \label{sec:LBC_mpp} 252 252 253 For massively parallel processing (mpp), a domain decomposition method is used. 254 The basic idea of the method is to split the large computation domain of a numerical experiment into 255 several smaller domains and solve the set of equations by addressing independent local problems. 256 Each processor has its own local memory and computes the model equation over a subdomain of the whole model domain. 257 The subdomain boundary conditions are specified through communications between processors which 258 are organized by explicit statements (message passing method). 259 260 A big advantage is that the method does not need many modifications of the initial \fortran code. 261 From the modeller's point of view, each sub domain running on a processor is identical to the "monodomain" code. 262 In addition, the programmer manages the communications between subdomains, 263 and the code is faster when the number of processors is increased. 264 The porting of OPA code on an iPSC860 was achieved during Guyon's PhD [Guyon et al. 1994, 1995] 265 in collaboration with CETIIS and ONERA. 266 The implementation in the operational context and the studies of performance on 267 a T3D and T3E Cray computers have been made in collaboration with IDRIS and CNRS. 268 The present implementation is largely inspired by Guyon's work [Guyon 1995]. 253 %nammpp 254 255 \nlst{nammpp} 256 % 257 258 For massively parallel processing (mpp), a domain decomposition method is used. The basic idea of the method is to split the large computation domain of a numerical experiment into several smaller domains and solve the set of equations by addressing independent local problems. Each processor has its own local memory and computes the model equation over a subdomain of the whole model domain. The subdomain boundary conditions are specified through communications between processors which are organized by explicit statements (message passing method). The present implementation is largely inspired by Guyon's work [Guyon 1995]. 269 259 270 260 The parallelization strategy is defined by the physical characteristics of the ocean model. … … 272 262 depend at the very most on one neighbouring point. 273 263 The only nonlocal computations concern the vertical physics 274 (implicit diffusion, turbulent closure scheme, ...) (delocalization over the whole water column), 275 and the solving of the elliptic equation associated with the surface pressure gradient computation 276 (delocalization over the whole horizontal domain). 264 (implicit diffusion, turbulent closure scheme, ...). 277 265 Therefore, a pencil strategy is used for the data substructuration: 278 266 the 3D initial domain is laid out on local processor memories following a 2D horizontal topological splitting. … … 284 272 each processor sends to its neighbouring processors the update values of the points corresponding to 285 273 the interior overlapping area to its neighbouring subdomain (\ie\ the innermost of the two overlapping rows). 286 The communication is done through the Message Passing Interface (MPI).274 Communications are first done according to the eastwest direction and next according to the northsouth direction. There is no specific communications for the corners. The communication is done through the Message Passing Interface (MPI) and requires \key{mpp\_mpi}. Use also \key{mpi2} if MPI3 is not available on your computer. 287 275 The data exchanges between processors are required at the very place where 288 276 lateral domain boundary conditions are set in the monodomain computation: 289 277 the \rou{lbc\_lnk} routine (found in \mdl{lbclnk} module) which manages such conditions is interfaced with 290 routines found in \mdl{lib\_mpp} module when running on an MPP computer (\ie\ when \key{mpp\_mpi} defined). 291 It has to be pointed out that when using the MPP version of the model, 292 the eastwest cyclic boundary condition is done implicitly, 293 whilst the southsymmetric boundary condition option is not available. 278 routines found in \mdl{lib\_mpp} module. 279 The output file \textit{communication\_report.txt} provides the list of which routines do how 280 many communications during 1 time step of the model.\\ 294 281 295 282 %>>>>>>>>>>>>>>>>>>>>>>>>>>>> … … 305 292 %>>>>>>>>>>>>>>>>>>>>>>>>>>>> 306 293 307 In the standard version of \NEMO, the splitting is regular and arithmetic. 308 The iaxis is divided by \jp{jpni} and 309 the jaxis by \jp{jpnj} for a number of processors \jp{jpnij} most often equal to $jpni \times jpnj$ 310 (parameters set in \nam{mpp} namelist). 311 Each processor is independent and without message passing or synchronous process, 312 programs run alone and access just its own local memory. 313 For this reason, the main model dimensions are now the local dimensions of the subdomain (pencil) that 314 are named \jp{jpi}, \jp{jpj}, \jp{jpk}. 315 These dimensions include the internal domain and the overlapping rows. 316 The number of rows to exchange (known as the halo) is usually set to one (\jp{jpreci}=1, in \mdl{par\_oce}). 317 The whole domain dimensions are named \jp{jpiglo}, \jp{jpjglo} and \jp{jpk}. 318 The relationship between the whole domain and a subdomain is: 319 \[ 320 jpi = ( jpiglo2*jpreci + (jpni1) ) / jpni + 2*jpreci 321 jpj = ( jpjglo2*jprecj + (jpnj1) ) / jpnj + 2*jprecj 322 \] 323 where \jp{jpni}, \jp{jpnj} are the number of processors following the i and jaxis. 324 325 One also defines variables nldi and nlei which correspond to the internal domain bounds, 326 and the variables nimpp and njmpp which are the position of the (1,1) gridpoint in the global domain. 294 In \NEMO, the splitting is regular and arithmetic. The total number of subdomains corresponds to the number of MPI processes allocated to \NEMO\ when the model is launched (\ie\ mpirun np x ./nemo will automatically give x subdomains). The iaxis is divided by \np{jpni} and the jaxis by \np{jpnj}. These parameters are defined in \nam{mpp} namelist. If \np{jpni} and \np{jpnj} are < 1, they will be automatically redefined in the code to give the best domain decomposition (see bellow). 295 296 Each processor is independent and without message passing or synchronous process, programs run alone and access just its own local memory. For this reason, the main model dimensions are now the local dimensions of the subdomain (pencil) that are named \jp{jpi}, \jp{jpj}, \jp{jpk}. 297 These dimensions include the internal domain and the overlapping rows. The number of rows to exchange (known as the halo) is usually set to one (nn\_hls=1, in \mdl{par\_oce}, and must be kept to one until further notice). The whole domain dimensions are named \jp{jpiglo}, \jp{jpjglo} and \jp{jpk}. The relationship between the whole domain and a subdomain is: 298 \[ 299 jpi = ( jpiglo2\times nn\_hls + (jpni1) ) / jpni + 2\times nn\_hls 300 \] 301 \[ 302 jpj = ( jpjglo2\times nn\_hls + (jpnj1) ) / jpnj + 2\times nn\_hls 303 \] 304 305 One also defines variables nldi and nlei which correspond to the internal domain bounds, and the variables nimpp and njmpp which are the position of the (1,1) gridpoint in the global domain (\autoref{fig:mpp}). Note that since the version 4, there is no more extrahalo area as defined in \autoref{fig:mpp} so \jp{jpi} is now always equal to nlci and \jp{jpj} equal to nlcj. 306 327 307 An element of $T_{l}$, a local array (subdomain) corresponds to an element of $T_{g}$, 328 308 a global array (whole domain) by the relationship: … … 331 311 T_{g} (i+nimpp1,j+njmpp1,k) = T_{l} (i,j,k), 332 312 \] 333 with $1 \leq i \leq jpi$, $1 \leq j \leq jpj $ , and $1 \leq k \leq jpk$. 334 335 Processors are numbered from 0 to $jpnij1$, the number is saved in the variable nproc. 336 In the standard version, a processor has no more than 337 four neighbouring processors named nono (for north), noea (east), noso (south) and nowe (west) and 338 two variables, nbondi and nbondj, indicate the relative position of the processor: 339 \begin{itemize} 340 \item nbondi = 1 an east neighbour, no west processor, 341 \item nbondi = 0 an east neighbour, a west neighbour, 342 \item nbondi = 1 no east processor, a west neighbour, 343 \item nbondi = 2 no splitting following the iaxis. 344 \end{itemize} 345 During the simulation, processors exchange data with their neighbours. 346 If there is effectively a neighbour, the processor receives variables from this processor on its overlapping row, 347 and sends the data issued from internal domain corresponding to the overlapping row of the other processor. 348 349 350 The \NEMO\ model computes equation terms with the help of mask arrays (0 on land points and 1 on sea points). 351 It is easily readable and very efficient in the context of a computer with vectorial architecture. 352 However, in the case of a scalar processor, computations over the land regions become more expensive in 353 terms of CPU time. 354 It is worse when we use a complex configuration with a realistic bathymetry like the global ocean where 355 more than 50 \% of points are land points. 356 For this reason, a preprocessing tool can be used to choose the mpp domain decomposition with a maximum number of 357 only land points processors, which can then be eliminated (\autoref{fig:mppini2}) 358 (For example, the mpp\_optimiz tools, available from the DRAKKAR web site). 359 This optimisation is dependent on the specific bathymetry employed. 360 The user then chooses optimal parameters \jp{jpni}, \jp{jpnj} and \jp{jpnij} with $jpnij < jpni \times jpnj$, 361 leading to the elimination of $jpni \times jpnj  jpnij$ land processors. 362 When those parameters are specified in \nam{mpp} namelist, 363 the algorithm in the \rou{inimpp2} routine sets each processor's parameters (nbound, nono, noea,...) so that 364 the landonly processors are not taken into account. 365 366 \gmcomment{Note that the inimpp2 routine is general so that the original inimpp 367 routine should be suppressed from the code.} 368 369 When land processors are eliminated, 370 the value corresponding to these locations in the model output files is undefined. 371 Note that this is a problem for the meshmask file which requires to be defined over the whole domain. 372 Therefore, user should not eliminate land processors when creating a meshmask file 373 (\ie\ when setting a nonzero value to \np{nn\_msh}). 313 with $1 \leq i \leq jpi$, $1 \leq j \leq jpj $ , and $1 \leq k \leq jpk$. 314 315 The 1d arrays $mig(1:\jp{jpi})$ and $mjg(1:\jp{jpj})$, defined in \rou{dom\_glo} routine (\mdl{domain} module), should be used to get global domain indices from local domain indices. The 1d arrays, $mi0(1:\jp{jpiglo})$, $mi1(1:\jp{jpiglo})$ and $mj0(1:\jp{jpjglo})$, $mj1(1:\jp{jpjglo})$ have the reverse purpose and should be used to define loop indices expressed in global domain indices (see examples in \mdl{dtastd} module).\\ 316 317 The \NEMO\ model computes equation terms with the help of mask arrays (0 on land points and 1 on sea points). It is therefore possible that an MPI subdomain contains only land points. To save ressources, we try to supress from the computational domain as much land subdomains as possible. For example if $N_{mpi}$ processes are allocated to NEMO, the domain decomposition will be given by the following equation: 318 \[ 319 N_{mpi} = jpni \times jpnj  N_{land} + N_{useless} 320 \] 321 $N_{land}$ is the total number of land subdomains in the domain decomposition defined by \np{jpni} and \np{jpnj}. $N_{useless}$ is the number of land subdomains that are kept in the compuational domain in order to make sure that $N_{mpi}$ MPI processes are indeed allocated to a given subdomain. The values of $N_{mpi}$, \np{jpni}, \np{jpnj}, $N_{land}$ and $N_{useless}$ are printed in the output file \texttt{ocean.output}. $N_{useless}$ must, of course, be as small as possible to limit the waste of ressources. A warning is issued in \texttt{ocean.output} if $N_{useless}$ is not zero. Note that nonzero value of $N_{useless}$ is uselly required when using AGRIF as, up to now, the parent grid and each of the child grids must use all the $N_{mpi}$ processes. 322 323 If the domain decomposition is automatically defined (when \np{jpni} and \np{jpnj} are < 1), the decomposition chosen by the model will minimise the subdomain size (defined as $max_{all domains}(jpi \times jpj)$) and maximize the number of eliminated land subdomains. This means that no other domain decomposition (a set of \np{jpni} and \np{jpnj} values) will use less processes than $(jpni \times jpnj  N_{land})$ and get a smaller subdomain size. 324 In order to specify $N_{mpi}$ properly (minimize $N_{useless}$), you must run the model once with \np{ln\_list} activated. In this case, the model will start the initialisation phase, print the list of optimum decompositions ($N_{mpi}$, \np{jpni} and \np{jpnj}) in \texttt{ocean.output} and directly abort. The maximum value of $N_{mpi}$ tested in this list is given by $max(N_{MPI\_tasks}, \np{jpni} \times \np{jpnj})$. For example, run the model on 40 nodes with ln\_list activated and $\np{jpni} = 10000$ and $\np{jpnj} = 1$, will print the list of optimum domains decomposition from 1 to about 10000. 325 326 Processors are numbered from 0 to $N_{mpi}  1$. Subdomains containning some ocean points are numbered first from 0 to $jpni * jpnj  N_{land} 1$. The remaining $N_{useless}$ land subdomains are numbered next, which means that, for a given (\np{jpni}, \np{jpnj}), the numbers attributed to he ocean subdomains do not vary with $N_{useless}$. 327 328 When land processors are eliminated, the value corresponding to these locations in the model output files is undefined. \np{ln\_mskland} must be activated in order avoid Not a Number values in output files. Note that it is better to not eliminate land processors when creating a meshmask file (\ie\ when setting a nonzero value to \np{nn\_msh}). 374 329 375 330 %>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Note: See TracChangeset
for help on using the changeset viewer.