Changes between Version 16 and Version 17 of ticket/0829
- Timestamp:
- 2011-10-24T12:15:26+02:00 (13 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
ticket/0829
v16 v17 1 [[PageOutline]] 2 Last edited [[Timestamp]] 3 4 [[BR]] 5 6 '''Author''' : rblod (Rachid Benshila) 1 [[PageOutline]] Last edited [[Timestamp]] 2 3 '''Author''' : rblod (Rachid Benshila) 7 4 8 5 '''ticket''' : #829 9 6 10 '''Branch''' : [https://forge.ipsl.jussieu.fr/nemo/browser/branches/2011/dev_r2769_LOCEAN_dynamic_mem dev_r2769_LOCEAN_dynamic_mem ]11 ---- 12 7 '''Branch''' : [https://forge.ipsl.jussieu.fr/nemo/browser/branches/2011/dev_r2769_LOCEAN_dynamic_mem dev_r2769_LOCEAN_dynamic_mem] 8 9 ---- 13 10 === Description === 14 15 Computing aspects of dynamic memory implementation are already described there http://forge.ipsl.jussieu.fr/nemo/wiki/2011WP/2011Stream2/DynamicMemory, possible consequences there https://forge.ipsl.jussieu.fr/nemo/wiki/2011Stream2/DynamicMemory_improvments . This branch deals with the first aspect, ie the practical implementation. [[BR]] 16 Dynamic memory implementation is clearly a step forward, an current implementation from branch dev_r2586_dynamic_mem; is quiet clean, with careful checks of availability of the work arrays. However: 11 Computing aspects of dynamic memory implementation are already described there http://forge.ipsl.jussieu.fr/nemo/wiki/2011WP/2011Stream2/DynamicMemory, possible consequences there https://forge.ipsl.jussieu.fr/nemo/wiki/2011Stream2/DynamicMemory_improvments . This branch deals with the first aspect, ie the practical implementation. [[BR]] Dynamic memory implementation is clearly a step forward, an current implementation from branch dev_r2586_dynamic_mem; is quiet clean, with careful checks of availability of the work arrays. However: 12 17 13 * Assignation of work arrays by hand leads to some difficulties considering the number of options and combinations of options available in NEMO 18 14 * In term of memory, the number of work arrays have to be hard-coded to the maximum combination, ie we always use more memory than needed 15 19 16 Investigation of improvements follows the following steps: 20 * Implementation of timing functionalities : this topic has been discussed within NEMO group for years, since the dynamic memory developments are impacting all the routines, implementing timing in the same time makes sense 17 18 * Implementation of timing functionalities : this topic has been discussed within NEMO group for years, since the dynamic memory developments are impacting all the routines, implementing timing in the same time makes sense 21 19 * Small changes in current implementation : the work-arrays are a in list automatically incremented and decremented, no more assigned by hand. This solves limitation 1 above. 22 20 * More radical change : the working space is build dynamically. This solves limitation 2. 23 21 24 22 ==== 1- Timing ==== 25 26 23 This functionality doesn't aim to replace advanced software used for optimisation but: 24 27 25 * to give a rough idea of performance (CPU and elapsed) 28 26 * use the same tools and format on all computers( fortran intrinsec CPU_TIME and WMPI_TIME) 29 It is bases on a linked chain of informations to be able to add dynamically new sections and add sub-sections[[BR]] 30 Implementation: 27 28 It is bases on a linked chain of informations to be able to add dynamically new sections and add sub-sections[[BR]] Implementation: 29 31 30 * CALL timing_init in nemogcm_init 32 31 * CALL timing_finalize at the end of nemoggcm 33 32 * at the end of step, IF( kt == nit000) CALL timing_reset (once the list of varibles has been built) 34 33 * in each routine to instrument : CALL timing_start('NAME') CALL timing_stop('NAME') 35 Imbricated sub-sections are allowed and their time is then subtracted from the mother section unless the call of timing of the section is done in the following way CALL timing_start('NAME') CALL timing_stop('NAME',section) 34 35 Imbricated sub-sections are allowed and their time is then subtracted from the mother section unless the call of timing of the section is done in the following way CALL timing_start('NAME') CALL timing_stop('NAME',section) 36 36 37 37 Sample of output: 38 38 39 {{{ 39 40 CNRS - NERC - Met OFFICE - MERCATOR-ocean - CMCC - INGV … … 80 81 81 82 }}} 82 83 83 Comparaison with prof output : 84 84 85 {{{ 85 86 Name %Time Seconds Cumsecs #Calls msec/call … … 91 92 .__traadv_tvd_NMOD_n 4.6 8.91 139.55 400 22.27 92 93 }}} 93 94 It was done in https://forge.ipsl.jussieu.fr/nemo/changeset/2771[[BR]] 95 NOt that timing is not needed to change dynamic allocation, It's just an opportunity, in case of we edit all the routines. 94 It was done in https://forge.ipsl.jussieu.fr/nemo/changeset/2771[[BR]] NOt that timing is not needed to change dynamic allocation, It's just an opportunity, in case of we edit all the routines. 96 95 97 96 ==== 2- Auto-assignement ==== 98 99 97 Instead of choosing by hand the number of a working array, we introduce for each type of work arrays a structure a arrays, with an associated increment: 98 100 99 {{{ 101 100 TYPE work_space_3d … … 108 107 }}} 109 108 Then in each routine, we declare local arrays as pointers 109 110 110 {{{ 111 111 REAL(wp), DIMENSION (:,:,:), POINTER :: zwi, zwz 112 112 }}} 113 113 And we call the subroutines nemo_allocate which points to a work arrays and increment the counter, and nemo_deallocate to decrement 114 114 115 {{{ 115 116 CALL nemo_allocate(zwi) ! begin routine 116 117 CALL nemo_deallocate(zwi) ! end routine 117 118 }}} 118 It was implemented in wrk_nemo_2 and implemented for test in traadv_tvd. To avoid changing all routines before a definitive choice, we choose to keep the old way and duplicate wrk_nemo in wrk_nemo_2 (later saved as wrk_nemo_2_simple), so we declare the double amount of memory, this would of course not be the case if it was implemented in all routines[[BR]] 119 To avoid memory leaks, we could check for instance at the end of step that each counter is equal to one.[[BR]] 120 This was implemented here : http://forge.ipsl.jussieu.fr/nemo/changeset/2775 [[BR]] 121 and wrk_nemo_2 is there http://forge.ipsl.jussieu.fr/nemo/browser/branches/2011/dev_r2769_LOCEAN_dynamic_mem/NEMOGCM/NEMO/OPA_SRC/wrk_nemo_2.F90_simple 119 It was implemented in wrk_nemo_2 and implemented for test in traadv_tvd. To avoid changing all routines before a definitive choice, we choose to keep the old way and duplicate wrk_nemo in wrk_nemo_2 (later saved as wrk_nemo_2_simple), so we declare the double amount of memory, this would of course not be the case if it was implemented in all routines[[BR]] To avoid memory leaks, we could check for instance at the end of step that each counter is equal to one.[[BR]] This was implemented here : http://forge.ipsl.jussieu.fr/nemo/changeset/2775 [[BR]] and wrk_nemo_2 is there http://forge.ipsl.jussieu.fr/nemo/browser/branches/2011/dev_r2769_LOCEAN_dynamic_mem/NEMOGCM/NEMO/OPA_SRC/wrk_nemo_2.F90_simple 122 120 123 121 ==== 3- Dynamic dynamic memory ==== 124 125 The point here is to avoid to have hard-coded the maximum of potential work arrays in use, and to optimize the memory size, especially for applications expensive in memory (biogeochemistry, assimilation)[[BR]] 126 Here again, as a preliminary test, the implementation is done is in wrk_nemo_2.F90 and the previous one is renamed wrk_nemo_2_simple.[[BR]] 127 For each type of work arrays, we use an associated chained list to build the working arrays needed. But when we exit a routine, we do not destroy the working array created but we just point back to the beginning of the list. If the following routine need one array more, we just add an element in the chain. Actually at the end of the first time step we should have built exactly the total amount of memory needed. 122 The point here is to avoid to have hard-coded the maximum of potential work arrays in use, and to optimize the memory size, especially for applications expensive in memory (biogeochemistry, assimilation)[[BR]] Here again, as a preliminary test, the implementation is done is in wrk_nemo_2.F90 and the previous one is renamed wrk_nemo_2_simple.[[BR]] For each type of work arrays, we use an associated chained list to build the working arrays needed. But when we exit a routine, we do not destroy the working array created but we just point back to the beginning of the list. If the following routine need one array more, we just add an element in the chain. Actually at the end of the first time step we should have built exactly the total amount of memory needed. 123 128 124 {{{ 129 125 TYPE work_space_3d … … 136 132 TYPE(work_space_3d), POINTER :: s_wrk_3d_root, s_wrk_3d 137 133 }}} 138 Then same way than above 139 Then in each routine, we declare local arrays as pointers 134 Then same way than above Then in each routine, we declare local arrays as pointers 135 140 136 {{{ 141 137 REAL(wp), DIMENSION (:,:,:), POINTER :: zwi, zwz … … 144 140 CALL nemo_deallocate(zwi) ! to come back 145 141 }}} 146 147 142 At his point, this looks very nice, but I'm questioning myself if we shouldn't simply use a standard dynamic memory implementation, instead of trying to do complicated things. I don't have the knowledge to answer. I guess the answer can be formulated in term of performances : 143 148 144 * is it more expensive to allocate/deallocate at each call in a standard way 149 145 * or to CALL a subroutine pointing toward an existing already allocated array 150 I got alos some concerns about future evolutions. If we imagine having a large variety of arrays (not only jpi,jpj,jpk) it could become hard to maintain.[[BR]] 151 [[BR]] 152 146 147 I got alos some concerns about future evolutions. If we imagine having a large variety of arrays (not only jpi,jpj,jpk) it could become hard to maintain.[[BR]] [[BR]] 153 148 154 149 Anyway, an example can be found there http://forge.ipsl.jussieu.fr/nemo/changeset/2776[[BR]] 155 and wrk_nemo_2 is there http://forge.ipsl.jussieu.fr/nemo/browser/branches/2011/dev_r2769_LOCEAN_dynamic_mem/NEMOGCM/NEMO/OPA_SRC/wrk_nemo_2.F90 150 151 and wrk_nemo_2 is there http://forge.ipsl.jussieu.fr/nemo/browser/branches/2011/dev_r2769_LOCEAN_dynamic_mem/NEMOGCM/NEMO/OPA_SRC/wrk_nemo_2.F90 156 152 157 153 ---- … … 159 155 Testing could consider (where appropriate) other configurations in addition to NVTK]. 160 156 161 || NVTK Tested||!'''YES/NO!'''||162 || Other model configurations||!'''YES/NO!'''||163 || Processor configurations tested||[ Enter processor configs tested here ]||164 || If adding new functionality please confirm that the [[BR]]New code doesn't change results when it is switched off [[BR]]and !''works!'' when switched on||!'''YES/NO/NA!'''||157 || NVTK Tested || !'''YES/NO!''' || 158 || Other model configurations || !'''YES/NO!''' || 159 || Processor configurations tested || [ Enter processor configs tested here ] || 160 || If adding new functionality please confirm that the [[BR]]New code doesn't change results when it is switched off [[BR]]and !''works!'' when switched on || !'''YES/NO/NA!''' || 165 161 166 162 (Answering UNSURE is likely to generate further questions from reviewers.) 167 163 168 'Please add further summary details here' 164 '''Testing dynamical memory:''' 165 166 A sequence of tests has been done on Power6 (vargas) and titane (Bull novascale) in order to compare the 3 ways of coding the dynamical allocation.[[BR]]The GYRE configuration has been used, and the interface for nex dynamical allocation has been coded for this configuration.[[BR]]Testing has been done for 2 dimensions (CFG=24 and CFG=96 equivallent to global 1/4°). 167 168 For all tests, the model runs properly and Elapsed and CPU time are equivallent for the 3 solutions and for a given configuration. Since the interface is identical for the 2 new build routine. It seems reasonable to implement the best solution, I;E. the last one, optimising memory. 169 170 Implementation for GYRE took around 2 days. 169 171 170 172 * Processor configurations tested … … 172 174 173 175 === Bit Comparability === 174 || Does this change preserve answers in your tested standard configurations (to the last bit) ?||!'''YES/NO !'''||175 || Does this change bit compare across various processor configurations. (1xM, Nx1 and MxN are recommended)||!'''YES/NO!'''||176 || Is this change expected to preserve answers in all possible model configurations?||!'''YES/NO!'''||177 || Is this change expected to preserve all diagnostics? [[BR]]!,,!''Preserving answers in model runs does not necessarily imply preserved diagnostics. !''||!'''YES/NO!'''||176 || Does this change preserve answers in your tested standard configurations (to the last bit) ? || !'''YES/NO !''' || 177 || Does this change bit compare across various processor configurations. (1xM, Nx1 and MxN are recommended) || !'''YES/NO!''' || 178 || Is this change expected to preserve answers in all possible model configurations? || !'''YES/NO!''' || 179 || Is this change expected to preserve all diagnostics? [[BR]]!,,!''Preserving answers in model runs does not necessarily imply preserved diagnostics. !'' || !'''YES/NO!''' || 178 180 179 181 If you answered !'''NO!''' to any of the above, please provide further details: … … 187 189 ---- 188 190 === System Changes === 189 || Does your change alter namelists?||!'''YES/NO !'''||190 || Does your change require a change in compiler options?||!'''YES/NO !'''||191 || Does your change alter namelists? || !'''YES/NO !''' || 192 || Does your change require a change in compiler options? || !'''YES/NO !''' || 191 193 192 194 If any of these apply, please document the changes required here....... … … 198 200 ---- 199 201 === IPR issues === 200 || Has the code been wholly (100%) produced by NEMO developers staff working exclusively on NEMO?||!'''YES/ NO !'''||202 || Has the code been wholly (100%) produced by NEMO developers staff working exclusively on NEMO? || !'''YES/ NO !''' || 201 203 202 204 If No: