Version 8 (modified by rblod, 13 years ago) (diff) |
---|
Last edited Timestamp?
Author : rblod (Rachid Benshila)
ticket : #829
Branch : dev_r2769_LOCEAN_dynamic_mem
Description
Computing aspects of dynamic memory implementation are already described there http://forge.ipsl.jussieu.fr/nemo/wiki/2011WP/2011Stream2/DynamicMemory, possible consequences there https://forge.ipsl.jussieu.fr/nemo/wiki/2011Stream2/DynamicMemory_improvments . This branch deals with the first aspect, ie the practical implementation.
Dynamic memory implementation is clearly a step forward, an current implementation from branch dev_r2586_dynamic_mem; is quiet clean, with careful checks of availability of the work arrays. However:
- Assignation of work arrays by hand leads to some difficulties considering the number of options and combinations of options available in NEMO
- In term of memory, the number of work arrays have to be hard-coded to the maximum combination, ie we always use more memory than needed
Investigation of improvements follows the following steps:
- Implementation of timing functionalities : this topic has been discussed within NEMO group for years, since the dynamic memory developments are impacting all the routines, implementing timing in the same time makes sense
- Small changes in current implementation : the work-arrays are a in list automatically incremented and decremented, no more assigned by hand. This solves limitation 1 above.
- More radical change : the working space is build dynamically. This solves limitation 2.
1- Timing
This functionality doesn't aim to replace advanced software used for optimisation but:
- to give a rough idea of performance (CPU and elapsed)
- use the same tools and format on all computers( fortran intrinsec CPU_TIME and WMPI_TIME)
It is bases on a linked chain of informations to be able to add dynamically new sections and add sub-sections
Implementation:
- CALL timing_init in nemogcm_init
- CALL timing_finalize at the end of nemoggcm
- at the end of step, IF( kt == nit000) CALL timing_reset (once the list of varibles has been built)
- in each routine to instrument : CALL timing_start('NAME') CALL timing_stop('NAME')
Imbricated sub-sections are allowed and their time is then subtracted from the mother section unless the call of timing of the section is done in the following way CALL timing_start('NAME') CALL timing_stop('NAME',section)
Sample of output:
CNRS - NERC - Met OFFICE - MERCATOR-ocean - CMCC - INGV NEMO team Ocean General Circulation Model version 3.3 (2010) Timing Informations Total timing (sum) : -------------------- Elapsed Time (s) CPU Time (s) 779.309 773.780 Averaged timing on all processors : ----------------------------------- Section Elapsed Time (s) Elapsed Time (%) CPU Time(s) CPU Time (%) CPU/Elapsed Max Elapsed (%) Min elapsed (%) Frequency traldf_iso 25.732 13.207 25.758 13.315 1.001 13.247 13.126 200.00 ldf_slp 22.223 11.406 22.155 11.453 0.997 11.430 11.392 200.00 dynspg_ts 20.326 10.433 20.302 10.495 0.999 10.592 10.333 200.00 zdf_tke 16.236 8.334 16.280 8.416 1.003 8.334 8.334 200.00 traadv_tvd 15.721 8.069 15.752 8.143 1.002 8.071 8.067 200.00 nonosc 9.383 4.816 9.293 4.804 0.990 4.817 4.816 400.00 ssh_wzv 3.435 1.763 3.220 1.665 0.937 1.765 1.762 200.00 MPI summary report : -------------------- Process Rank | Elapsed Time (s) | CPU Time (s) | Ratio CPU/Elapsed -------------|------------------|--------------|------------------ 0 | 194.824 | 190.070 | 0.976 1 | 194.828 | 194.610 | 0.999 2 | 194.828 | 194.490 | 0.998 3 | 194.828 | 194.610 | 0.999 -------------|------------------|--------------|------------------ Total | 779.309 | 773.780 | 3.972 -------------|------------------|--------------|------------------ Minimum | 194.824 | 190.070 | 0.976 -------------|------------------|--------------|------------------ Maximum | 194.828 | 194.610 | 0.999 -------------|------------------|--------------|------------------ Average | 194.827 | 193.445 | 0.993
Comparaison with prof output :
Name %Time Seconds Cumsecs #Calls msec/call .__traldf_iso_NMOD_t 13.4 25.68 53.95 200 128.40 .__ldfslp_NMOD_ldf_s 10.8 20.79 74.74 200 103.95 .__dynspg_ts_NMOD_dy 10.1 19.50 94.24 200 97.50 .__traadv_tvd_NMOD_t 7.8 14.98 109.22 200 74.90 .__zdftke_NMOD_tke_t 6.1 11.81 121.03 200 59.05 .__traadv_tvd_NMOD_n 4.6 8.91 139.55 400 22.27
It was done in https://forge.ipsl.jussieu.fr/nemo/changeset/2771[[BR]] NOt that timing is not needed to change dynamic allocation, It's just an opportunity, in case of we edit all the routines.
2- Auto-assignement
Instead of choosing by hand the number of a working array, we introduce for each type of work arrays a structure a arrays, with an associated increment:
TYPE work_space_3d LOGICAL :: in_use REAL(wp), DIMENSION(:,:,:), POINTER :: wrk END TYPE TYPE(work_space_3d), DIMENSION(num_3d_wrkspaces) :: s_wrk_3d INTEGER :: n_wrk_3d
Then in each routine, we declare local arrays as pointers
REAL(wp), DIMENSION (:,:,:), POINTER :: zwi, zwz
And we call the subroutines nemo_allocate which points to a work arrays and increment the counter, and nemo_deallocate to decrement
CALL nemo_allocate(zwi) ! begin routine CALL nemo_deallocate(zwi) ! end routine
It was implemented in wrk_nemo_2 and implemented for test in traadv_tvd. To avoid changing all routines before a definitive choice, we choose to keep the old way and duplicate wrk_nemo in wrk_nemo_2 (later saved as wrk_nemo_2_simple), so we declare the double amount of memory.
To avoid memory leaks, we could check for instance at the end of step that each counter is equal to one.
3- Dynamic dynamic memory
Testing
Testing could consider (where appropriate) other configurations in addition to NVTK].
NVTK Tested | '''YES/NO''' |
Other model configurations | '''YES/NO''' |
Processor configurations tested | [ Enter processor configs tested here ] |
If adding new functionality please confirm that the New code doesn't change results when it is switched off and ''works'' when switched on | '''YES/NO/NA''' |
(Answering UNSURE is likely to generate further questions from reviewers.)
'Please add further summary details here'
- Processor configurations tested
- etc----
Bit Comparability
Does this change preserve answers in your tested standard configurations (to the last bit) ? | '''YES/NO ''' |
Does this change bit compare across various processor configurations. (1xM, Nx1 and MxN are recommended) | '''YES/NO''' |
Is this change expected to preserve answers in all possible model configurations? | '''YES/NO''' |
Is this change expected to preserve all diagnostics? ,,''Preserving answers in model runs does not necessarily imply preserved diagnostics. '' | '''YES/NO''' |
If you answered '''NO''' to any of the above, please provide further details:
- Which routine(s) are causing the difference?
- Why the changes are not protected by a logical switch or new section-version
- What is needed to achieve regression with the previous model release (e.g. a regression branch, hand-edits etc). If this is not possible, explain why not.
- What do you expect to see occur in the test harness jobs?
- Which diagnostics have you altered and why have they changed?Please add details here........
System Changes
Does your change alter namelists? | '''YES/NO ''' |
Does your change require a change in compiler options? | '''YES/NO ''' |
If any of these apply, please document the changes required here.......
Resources
''Please ''summarize'' any changes in runtime or memory use caused by this change......''
IPR issues
Has the code been wholly (100%) produced by NEMO developers staff working exclusively on NEMO? | '''YES/ NO ''' |
If No:
- Identify the collaboration agreement details
- Ensure the code routine header is in accordance with the agreement, (Copyright/Redistribution? etc).Add further details here if required..........
Attachments (1)
-
bench_allocdyn.pdf
(29.0 KB) -
added by clevy 13 years ago.
bench new codes dynamic allocation
Download all attachments as: .zip