[[PageOutline]] Last edited [[Timestamp]] [[BR]] '''Author''' : rblod (Rachid Benshila) '''ticket''' : #829 '''Branch''' : [https://forge.ipsl.jussieu.fr/nemo/browser/branches/2011/dev_r2769_LOCEAN_dynamic_mem dev_r2769_LOCEAN_dynamic_mem ] ---- === Description === Computing aspects of dynamic memory implementation are already described there http://forge.ipsl.jussieu.fr/nemo/wiki/2011WP/2011Stream2/DynamicMemory, possible consequences there https://forge.ipsl.jussieu.fr/nemo/wiki/2011Stream2/DynamicMemory_improvments . This branch deals with the first aspect, ie the practical implementation. [[BR]] Dynamic memory implementation is clearly a step forward, an current implementation from branch dev_r2586_dynamic_mem; is quiet clean, with careful checks of availability of the work arrays. However: * Assignation of work arrays by hand leads to some difficulties considering the number of options and combinations of options available in NEMO * In term of memory, the number of work arrays have to be hard-coded to the maximum combination, ie we always use more memory than needed Investigation of improvements follows the following steps: * Implementation of timing functionalities : this topic has been discussed within NEMO group for years, since the dynamic memory developments are impacting all the routines, implementing timing in the same time makes sense * Small changes in current implementation : the work-arrays are a in list automatically incremented and decremented, no more assigned by hand. This solves limitation 1 above. * More radical change : the working space is build dynamically. This solves limitation 2. ==== 1- Timing ==== This functionality doesn't aim to replace advanced software used for optimisation but: * to give a rough idea of performance (CPU and elapsed) * use the same tools and format on all computers( fortran intrinsec CPU_TIME and WMPI_TIME) It is bases on a linked chain of informations to be able to add dynamically new sections and add sub-sections[[BR]] Implementation: * CALL timing_init in nemogcm_init * CALL timing_finalize at the end of nemoggcm * at the end of step, IF( kt == nit000) CALL timing_reset (once the list of varibles has been built) * in each routine to instrument : CALL timing_start('NAME') CALL timing_stop('NAME') Imbricated sub-sections are allowed and their time is then subtracted from the mother section unless the call of timing of the section is done in the following way CALL timing_start('NAME') CALL timing_stop('NAME',section) Sample of output: {{{ CNRS - NERC - Met OFFICE - MERCATOR-ocean - CMCC - INGV NEMO team Ocean General Circulation Model version 3.3 (2010) Timing Informations Total timing (sum) : -------------------- Elapsed Time (s) CPU Time (s) 779.309 773.780 Averaged timing on all processors : ----------------------------------- Section Elapsed Time (s) Elapsed Time (%) CPU Time(s) CPU Time (%) CPU/Elapsed Max Elapsed (%) Min elapsed (%) Frequency traldf_iso 25.732 13.207 25.758 13.315 1.001 13.247 13.126 200.00 ldf_slp 22.223 11.406 22.155 11.453 0.997 11.430 11.392 200.00 dynspg_ts 20.326 10.433 20.302 10.495 0.999 10.592 10.333 200.00 zdf_tke 16.236 8.334 16.280 8.416 1.003 8.334 8.334 200.00 traadv_tvd 15.721 8.069 15.752 8.143 1.002 8.071 8.067 200.00 nonosc 9.383 4.816 9.293 4.804 0.990 4.817 4.816 400.00 ssh_wzv 3.435 1.763 3.220 1.665 0.937 1.765 1.762 200.00 MPI summary report : -------------------- Process Rank | Elapsed Time (s) | CPU Time (s) | Ratio CPU/Elapsed -------------|------------------|--------------|------------------ 0 | 194.824 | 190.070 | 0.976 1 | 194.828 | 194.610 | 0.999 2 | 194.828 | 194.490 | 0.998 3 | 194.828 | 194.610 | 0.999 -------------|------------------|--------------|------------------ Total | 779.309 | 773.780 | 3.972 -------------|------------------|--------------|------------------ Minimum | 194.824 | 190.070 | 0.976 -------------|------------------|--------------|------------------ Maximum | 194.828 | 194.610 | 0.999 -------------|------------------|--------------|------------------ Average | 194.827 | 193.445 | 0.993 }}} Comparaison with prof output : {{{ Name %Time Seconds Cumsecs #Calls msec/call .__traldf_iso_NMOD_t 13.4 25.68 53.95 200 128.40 .__ldfslp_NMOD_ldf_s 10.8 20.79 74.74 200 103.95 .__dynspg_ts_NMOD_dy 10.1 19.50 94.24 200 97.50 .__traadv_tvd_NMOD_t 7.8 14.98 109.22 200 74.90 .__zdftke_NMOD_tke_t 6.1 11.81 121.03 200 59.05 .__traadv_tvd_NMOD_n 4.6 8.91 139.55 400 22.27 }}} It was done in https://forge.ipsl.jussieu.fr/nemo/changeset/2771[[BR]] NOt that timing is not needed to change dynamic allocation, It's just an opportunity, in case of we edit all the routines. ==== 2- Auto-assignement ==== Instead of choosing by hand the number of a working array, we introduce for each type of work arrays a structure a arrays, with an associated increment: {{{ TYPE work_space_3d LOGICAL :: in_use REAL(wp), DIMENSION(:,:,:), POINTER :: wrk END TYPE TYPE(work_space_3d), DIMENSION(num_3d_wrkspaces) :: s_wrk_3d INTEGER :: n_wrk_3d }}} Then in each routine, we declare local arrays as pointers {{{ REAL(wp), DIMENSION (:,:,:), POINTER :: zwi, zwz }}} And we call the subroutines nemo_allocate which points to a work arrays and increment the counter, and nemo_deallocate to decrement {{{ CALL nemo_allocate(zwi) ! begin routine CALL nemo_deallocate(zwi) ! end routine }}} It was implemented in wrk_nemo_2 and implemented for test in traadv_tvd. To avoid changing all routines before a definitive choice, we choose to keep the old way and duplicate wrk_nemo in wrk_nemo_2 (later saved as wrk_nemo_2_simple), so we declare the double amount of memory, this would of course not be the case if it was implemented in all routines[[BR]] To avoid memory leaks, we could check for instance at the end of step that each counter is equal to one.[[BR]] This was implemented here : http://forge.ipsl.jussieu.fr/nemo/changeset/2775 ==== 3- Dynamic dynamic memory ==== The point here is to avoid to have hard-coded the maximum of potential work arrays in use, and to optimize the memory size, especially for applications expensive in memory (biogeochemistry, assimilation)[[BR]] Here again, as a preliminary test, the implementation is done is w and the previous one is renamed wrk_nemo_2_simple.[[BR]] For each type of work arrays, we use an associated chained list to build the working arrays needed. But when we exit a routine, we do not destroy the working array created but we just point back to the beginning of the list. If the following routine need one array more, we just add an element in the chain. Actually at the end of the first time step we should have built exactly the total amount of memory needed. {{{ TYPE work_space_3d LOGICAL :: in_use INTEGER :: indic REAL(wp), DIMENSION(:,:,:), POINTER :: wrk TYPE (work_space_3d), POINTER :: next => NULL() TYPE (work_space_3d), POINTER :: prev => NULL() END TYPE TYPE(work_space_3d), POINTER :: s_wrk_3d_root, s_wrk_3d }}} Then same way than above Then in each routine, we declare local arrays as pointers {{{ REAL(wp), DIMENSION (:,:,:), POINTER :: zwi, zwz CALL nemo_allocate(zwi) ! begin routine CALL nemo_deallocate(zwi) ! to come back }}} At his point, this looks very nice, but I'm questioning myself if we shouldn't simply use a standard dynamic memory implementation, instead of trying to do complicated things. I don't have the knowledge to answer. I guess the answer can be formulated in term of performances : * is it more expensive to allocate/deallocate at each call in a standard way * or to CALL a subroutine pointing toward an existing already allocated array Anyway, an example can be found there http://forge.ipsl.jussieu.fr/nemo/changeset/2776 ---- === Testing === Testing could consider (where appropriate) other configurations in addition to NVTK]. ||NVTK Tested||!'''YES/NO!'''|| ||Other model configurations||!'''YES/NO!'''|| ||Processor configurations tested||[ Enter processor configs tested here ]|| ||If adding new functionality please confirm that the [[BR]]New code doesn't change results when it is switched off [[BR]]and !''works!'' when switched on||!'''YES/NO/NA!'''|| (Answering UNSURE is likely to generate further questions from reviewers.) 'Please add further summary details here' * Processor configurations tested * etc---- === Bit Comparability === ||Does this change preserve answers in your tested standard configurations (to the last bit) ?||!'''YES/NO !'''|| ||Does this change bit compare across various processor configurations. (1xM, Nx1 and MxN are recommended)||!'''YES/NO!'''|| ||Is this change expected to preserve answers in all possible model configurations?||!'''YES/NO!'''|| ||Is this change expected to preserve all diagnostics? [[BR]]!,,!''Preserving answers in model runs does not necessarily imply preserved diagnostics. !''||!'''YES/NO!'''|| If you answered !'''NO!''' to any of the above, please provide further details: * Which routine(s) are causing the difference? * Why the changes are not protected by a logical switch or new section-version * What is needed to achieve regression with the previous model release (e.g. a regression branch, hand-edits etc). If this is not possible, explain why not. * What do you expect to see occur in the test harness jobs? * Which diagnostics have you altered and why have they changed?Please add details here........ ---- === System Changes === ||Does your change alter namelists?||!'''YES/NO !'''|| ||Does your change require a change in compiler options?||!'''YES/NO !'''|| If any of these apply, please document the changes required here....... ---- === Resources === !''Please !''summarize!'' any changes in runtime or memory use caused by this change......!'' ---- === IPR issues === ||Has the code been wholly (100%) produced by NEMO developers staff working exclusively on NEMO?||!'''YES/ NO !'''|| If No: * Identify the collaboration agreement details * Ensure the code routine header is in accordance with the agreement, (Copyright/Redistribution etc).Add further details here if required..........