New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
WorkingGroups/HPC (diff) – NEMO

Changes between Version 40 and Version 41 of WorkingGroups/HPC


Ignore:
Timestamp:
2017-04-27T20:32:35+02:00 (7 years ago)
Author:
nicolasmartin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WorkingGroups/HPC

    v40 v41  
    1 [[TOC(heading=NEMO_HPC,NEMO_HPC/*, depth=1)]] 
     1[[PageOutline()]] 
    22 
    3 = '''NEMO HPC''' = 
     3= '''HPC WG''' 
    44Working group leader (and responsible for wiki pages) : Mike Bell 
    55 
    66---- 
    7 == Members of the Working group: == 
     7 
     8== Members of the Working group: 
    89 * Jeremy Appleyards (NVIDIA) 
    910 * Lucien Anton (CRAY) 
     
    1516 * Tim Graham (Met Office) 
    1617 * Matt Glover (Met Office) 
    17  * Jason Holt (NOC) 
     18 * Jason Holt (noC) 
    1819 * Dmitry Kuts (Intel)  
    1920 * Claire Levy (CNRS) 
     
    3132---- 
    3233 
    33 == Minutes of Meetings == 
     34== Minutes of Meetings 
     35[[TOC(WorkingGroups/NEMO_HPC/Mins_2*, depth=1)]] 
    3436[[TitleIndex(WorkingGroups/NEMO_HPC/Mins_2)]] 
    3537 
    3638== Minutes of Subgroup Meetings == 
     39[[TOC(WorkingGroups/NEMO_HPC/Mins_s*, depth=1)]] 
    3740[[TitleIndex(WorkingGroups/NEMO_HPC/Mins_s)]] 
    3841 
    3942---- 
    40 == Documents == 
    4143 
    42  * [https://forge.ipsl.jussieu.fr/nemo/wiki/WorkingGroups/NEMO_HPC/Working_document Current version of NEMO HPC working document]  
    43 ---- 
    44 == Old version of page == 
    45  
    46 Working group leader (and responsible for wiki pages) : Sébastien Masson.[[BR]] 
    47  
     44== Documents 
     45 * [/WorkingGroups/NEMO_HPC/Working_document Current version of NEMO HPC working document]  
    4846 
    4947---- 
    50 == Members of the Working group: == 
     48 
     49== Old version of page 
     50 
     51Working group leader (and responsible for wiki pages) : Sébastien Masson. 
     52 
     53---- 
     54 
     55== Members of the Working group: 
    5156 * Sébastien Masson 
    5257 * Italo Epicoco 
     
    5863 
    5964---- 
    60 == Objectives: == 
     65 
     66== Objectives: 
    6167 * make short term recommendations for improving the performance of the existing system 
    6268 * propose criteria for a taking decisions at Gateway 2025 regarding HPC. 
     
    6571 * define a simple configuration (with IO and complex geometry) that will serve as a proof of concept for validating the proposed approach for the future system. 
    6672 
    67 == Some ideas...: == 
    68 Document by Seb detailing some short term actions to help reduce communications (https://forge.ipsl.jussieu.fr/nemo/attachment/wiki/WorkingGroups/NEMO_HPC/HPC_tasks_Masson.doc). 
     73== Some ideas...: 
     74[attachment:wiki:WorkingGroups/NEMO_HPC:HPC_tasks_Masson.doc Document by Seb] detailing some short term actions to help reduce communications . 
    6975 
    7076A strong improvement of NEMO scalability is needed to be able to take advantage of the new machines. This probably means a deep review/rewrite of NEMO code at some point in the future (beyond 5 years from now?). At the same time, we already know that CMIP7 won't use an ocean model that has not been strongly tested and validated and will stick to a NEMO model not so far from the existing one. [[BR]] This means that we need to: 
    7177 
    72   1) keep improving the current structure of NEMO so it works quite efficiently for almost 10 more years (until the end of CMPI7). [[BR]]      2) start to work on a new structure that would fully tested and validated at least for CMIP8 in about 10 years. [[BR]] 
     78 1. keep improving the current structure of NEMO so it works quite efficiently for almost 10 more years (until the end of CMPI7). 
     79 2. start to work on a new structure that would fully tested and validated at least for CMIP8 in about 10 years. 
    7380 
    74 Based on this, we propose to divide the work according to 3 temporal windows [[BR]] 
     81Based on this, we propose to divide the work according to 3 temporal windows 
    7582 
    76 '''0-3 years''': improvements with existing code: [[BR]] 
     83'''0-3 years''': improvements with existing code: 
    7784 
    78   1) remove solvers and global sums (to be done in 3.7) 1) reduce the number of communications: do less and bigger communications (group communications, use larger halo). main priority: communications in the time splitting and sea-ice rheology. [[BR]]      2) reduce the number of communications: remove useless communications (a lot of them are simply associated with output...) [[BR]]      3) introduce asynchronous communications  [[BR]]      4) check code vectorization (SIMD instructions) [[BR]] 
     85 1. remove solvers and global sums (to be done in 3.7) 1) reduce the number of communications: do less and bigger communications (group communications, use larger halo). main priority: communications in the time splitting and sea-ice rheology. 
     86 2. reduce the number of communications: remove useless communications (a lot of them are simply associated with output...) 
     87 3. introduce asynchronous communications 
     88 4. check code vectorization (SIMD instructions) 
    7989 
    80 '''0-5 years''': improvements through the introduction of OpenMP:  [[BR]] 
     90'''0-5 years''': improvements through the introduction of OpenMP: 
    8191 
    8292  work initialed by CMCC.  implementation such as tiling may be efficient with many cores processors? review lbclnk to be able to deal with MPI and OpenMP OpenMP along the vertical axis? Find a way to remove implicit schemes?   test different way to find new sources of parallelism for example with the help of OpenMP4 test OpenACC (not that far from OpenMP)? 
    8393 
    84 '''beyond 5 years''': [[BR]] 
     94'''beyond 5 years''': 
    8595 
    8696  GungHo       or not GungHo      , that is the question... 
     
    8999For the next 2 years, as a start, a workshop to be organized in 2015 on “NEMO in 2025 : routes toward multi-resolution approaches”. 
    90100 
    91 ==  == 
    92 ==  == 
    93 == Comments of group members : == 
    94 '''gurvan''' -- (2014 November 11): 
     101---- 
     102 
     103== Comments of group members : 
     104'''gurvan''' -- (2014 november 11): 
    95105 
    96106  • improving the code efficiency imply using more processor for a given application. This means breaking the current limit of 35x35 local horizontal domain. The 3 years propositions go in that direction. One point is missing: A target for an ORCA 1/36° is a 10x10 local domain to be able to use 1 Million cores... In this case, the number of horizontal grid points is the same as the vertical one (about 100 levels is currently what we are running). So, do we have to consider a change in the indexation of arrays from i-j-k to k-j-i  ? 
     
    104114  • Remark: the version of MON currently under development (MOM5: switch to C-grid, use of finit volume approach,...) is using FMS, a GungHo      type approach...and "There are '''   dozens'''    of scientists and engineers at GFDL focused on meeting the evolving needs of climate scientists pushing the envelope of computational tools for studying climate" 
    105115 
    106 '''Sebastien''' -- (2014 November 17): some ideas I heard about asynchronous communications: 
     116'''Sebastien''' -- (2014 november 17): some ideas I heard about asynchronous communications: 
    107117 
    108118  • compute inner domain during communication of the halo: 
     
    114124  • in 3D loops: hide communication at each level by the computation of the next level... -> do not do 3D communications but n 2D asynchronous communications. Good only if communications are really hidden by the computation at each level (that will be less and less as the size of the subdomain is decreasing... Is it really a good idea? 
    115125 
    116 '''Silvia''' -- (2014 November 25): about asynchronous communications: 
     126'''Silvia''' -- (2014 november 25): about asynchronous communications: 
    117127 
    118 • compute inner domain during communication of the halo: 
     128 • compute inner domain during communication of the halo: 
    119129 
    120 In the past, at CMCC we have carried out some optimization activities on a regional configuration (covering the mediterranean basin at 1/16°) of NEMO (v3.2). The performance analysis highlighted the SOR as one of the most computational intensive kernels, so our optimizations have been focused also on it. One activity aimed at overlapping communication and computation changing the algorithm in the following way: (i) halo computation, (ii) asynchronous communications and (iii) computation over the inner domain (overlapped with communication). The new algorithm has been evaluated on the old MareNostrum system (dismissed in 2013), in the context of an HPC-Europa application. It has been theoretically evaluated using the Dimemas tool (developed at BSC) showing that the new algorithm performed better than the old one, but the experimental results did not confirm the expectations. However, we can plan to test the communication/computation overlap paradigm on new architectures. The idea could be to extract some kernels characterized by the "do loops" you talked about, to change the communication algorithm and to test it before deciding to extend the modification to the entire the code. 
     130In the past, at CMCC we have carried out some optimization activities on a regional configuration (covering the mediterranean basin at 1/16°) of NEMO (v3.2). The performance analysis highlighted the SOR as one of the most computational intensive kernels, so our optimizations have been focused also on it. One activity aimed at overlapping communication and computation changing the algorithm in the following way: (i) halo computation, (ii) asynchronous communications and (iii) computation over the inner domain (overlapped with communication). The new algorithm has been evaluated on the old Marenostrum system (dismissed in 2013), in the context of an HPC-Europa application. It has been theoretically evaluated using the Dimemas tool (developed at BSC) showing that the new algorithm performed better than the old one, but the experimental results did not confirm the expectations. However, we can plan to test the communication/computation overlap paradigm on new architectures. The idea could be to extract some kernels characterized by the "do loops" you talked about, to change the communication algorithm and to test it before deciding to extend the modification to the entire the code. 
    121131 
    122 • larger halo: could be done only on some variables that are for example using neighbor of neighbor (see what was done on sor solver). 
     132 • larger halo: could be done only on some variables that are for example using neighbor of neighbor (see what was done on sor solver). 
    123133 
    124 Larger halo allows to decrease the communication frequency in spite of the computation of a larger domain. It is needed to identify the best trade-off between communication time decrease and computation time increase, that is the halo dimension which minimizes the total execution time. This dimension could depend on the number of MPI processes, the domain size and some architectural parameters such as the communication latency, … (we have published a work on this aspect  "The performance model for a parallel SOR algorithm using the red-black scheme", Int. J. of High Performance Systems Architecture, 2012 Vol.4, No.2, pp.101 - 109) 
     134Larger halo allows to decrease the communication frequency in spite of the computation of a larger domain. It is needed to identify the best trade-off between communication time decrease and computation time increase, that is the halo dimension which minimizes the total execution time. This dimension could depend on the number of MPI processes, the domain size and some architectural parameters such as the communication latency, … (we have published a work on this aspect  "The performance model for a parallel SOR algorithm using the red-black scheme", Int. J. of High Performance Systems Architecture, 2012 Vol.4, no.2, pp.101 - 109)