Last edited Timestamp?

Author : rblod (Rachid Benshila)

ticket : #829

Branch : dev_r2769_LOCEAN_dynamic_mem


Computing aspects of dynamic memory implementation are already described there, possible consequences there . This branch deals with the first aspect, ie the practical implementation.
Dynamic memory implementation is clearly a step forward, an current implementation from branch dev_r2586_dynamic_mem; is quiet clean, with careful checks of availability of the work arrays. However:

  • Assignation of work arrays by hand leads to some difficulties considering the number of options and combinations of options available in NEMO
  • In term of memory, the number of work arrays have to be hard-coded to the maximum combination, ie we always use more memory than needed

Investigation of improvements follows the following steps:

  • Implementation of timing functionalities : this topic has been discussed within NEMO group for years, since the dynamic memory developments are impacting all the routines, implementing timing in the same time makes sense
  • Small changes in current implementation : the work-arrays are a in list automatically incremented and decremented, no more assigned by hand. This solves limitation 1 above.
  • More radical change : the working space is build dynamically. This solves limitation 2.

1- Timing

This functionality doesn't aim to replace advanced software used for optimisation but:

  • to give a rough idea of performance (CPU and elapsed)
  • use the same tools and format on all computers( fortran intrinsec CPU_TIME and WMPI_TIME)

It is bases on a linked chain of informations to be able to add dynamically new sections and add sub-sections

  • CALL timing_init in nemogcm_init
  • CALL timing_finalize at the end of nemoggcm
  • at the end of step, IF( kt == nit000) CALL timing_reset (once the list of varibles has been built)
  • in each routine to instrument : CALL timing_start('NAME') CALL timing_stop('NAME')

Imbricated sub-sections are allowed and their time is then subtracted from the mother section unless the call of timing of the section is done in the following way CALL timing_start('NAME') CALL timing_stop('NAME',section)

Sample of output:

       CNRS - NERC - Met OFFICE - MERCATOR-ocean - CMCC - INGV
                              NEMO team
                   Ocean General Circulation Model
                         version 3.3  (2010)

                         Timing Informations

 Total timing (sum) :
 Elapsed Time (s)  CPU Time (s)
          779.309       773.780

 Averaged timing on all processors :
 Section             Elapsed Time (s)  Elapsed Time (%)  CPU Time(s)  CPU Time (%)  CPU/Elapsed  Max Elapsed (%)  Min elapsed (%)  Frequency
 traldf_iso                    25.732            13.207       25.758        13.315        1.001           13.247           13.126     200.00
 ldf_slp                       22.223            11.406       22.155        11.453        0.997           11.430           11.392     200.00
 dynspg_ts                     20.326            10.433       20.302        10.495        0.999           10.592           10.333     200.00
 zdf_tke                       16.236             8.334       16.280         8.416        1.003            8.334            8.334     200.00
 traadv_tvd                    15.721             8.069       15.752         8.143        1.002            8.071            8.067     200.00
 nonosc                         9.383             4.816        9.293         4.804        0.990            4.817            4.816     400.00
 ssh_wzv                        3.435             1.763        3.220         1.665        0.937            1.765            1.762     200.00

 MPI summary report :

 Process Rank | Elapsed Time (s) | CPU Time (s) | Ratio CPU/Elapsed
    0         |     194.824      |     190.070  |      0.976
    1         |     194.828      |     194.610  |      0.999
    2         |     194.828      |     194.490  |      0.998
    3         |     194.828      |     194.610  |      0.999
 Total        |     779.309      |     773.780  |      3.972
 Minimum      |     194.824      |     190.070  |      0.976
 Maximum      |     194.828      |     194.610  |      0.999
 Average      |     194.827      |     193.445  |      0.993

Comparaison with prof output :

Name                 %Time     Seconds     Cumsecs  #Calls   msec/call
.__traldf_iso_NMOD_t  13.4       25.68       53.95     200    128.40
.__ldfslp_NMOD_ldf_s  10.8       20.79       74.74     200    103.95
.__dynspg_ts_NMOD_dy  10.1       19.50       94.24     200     97.50
.__traadv_tvd_NMOD_t   7.8       14.98      109.22     200     74.90
.__zdftke_NMOD_tke_t   6.1       11.81      121.03     200     59.05
.__traadv_tvd_NMOD_n   4.6        8.91      139.55     400     22.27

It was done in[[BR]] NOt that timing is not needed to change dynamic allocation, It's just an opportunity, in case of we edit all the routines.

1.1 test

Implementation has been done in GYRE configuration  works well. An important aspect for implementation: each routine starts with call start_timing() and ends with call_stop…. If those lines are not correct , le model hangs with no error message. This is probably related to the pointer usined in the timing routine.

2- Auto-assignement

Instead of choosing by hand the number of a working array, we introduce for each type of work arrays a structure a arrays, with an associated increment:

   TYPE work_space_3d
     LOGICAL ::  in_use
     REAL(wp), DIMENSION(:,:,:), POINTER :: wrk
   TYPE(work_space_3d), DIMENSION(num_3d_wrkspaces) :: s_wrk_3d
   INTEGER :: n_wrk_3d

Then in each routine, we declare local arrays as pointers

  REAL(wp), DIMENSION (:,:,:), POINTER ::   zwi, zwz

And we call the subroutines nemo_allocate which points to a work arrays and increment the counter, and nemo_deallocate to decrement

 CALL nemo_allocate(zwi)     ! begin routine
 CALL nemo_deallocate(zwi)     ! end routine

It was implemented in wrk_nemo_2 and implemented for test in traadv_tvd. To avoid changing all routines before a definitive choice, we choose to keep the old way and duplicate wrk_nemo in wrk_nemo_2 (later saved as wrk_nemo_2_simple), so we declare the double amount of memory, this would of course not be the case if it was implemented in all routines
To avoid memory leaks, we could check for instance at the end of step that each counter is equal to one.
This was implemented here :
and wrk_nemo_2 is there

3- Dynamic dynamic memory

The point here is to avoid to have hard-coded the maximum of potential work arrays in use, and to optimize the memory size, especially for applications expensive in memory (biogeochemistry, assimilation)
Here again, as a preliminary test, the implementation is done is in wrk_nemo_2.F90 and the previous one is renamed wrk_nemo_2_simple.
For each type of work arrays, we use an associated chained list to build the working arrays needed. But when we exit a routine, we do not destroy the working array created but we just point back to the beginning of the list. If the following routine need one array more, we just add an element in the chain. Actually at the end of the first time step we should have built exactly the total amount of memory needed.

   TYPE work_space_3d
     LOGICAL ::  in_use
     INTEGER :: indic
     REAL(wp), DIMENSION(:,:,:), POINTER :: wrk
     TYPE (work_space_3d), POINTER :: next => NULL()
     TYPE (work_space_3d), POINTER :: prev => NULL()
   TYPE(work_space_3d), POINTER :: s_wrk_3d_root, s_wrk_3d

Then same way than above Then in each routine, we declare local arrays as pointers

 REAL(wp), DIMENSION (:,:,:), POINTER ::   zwi, zwz

 CALL nemo_allocate(zwi)     ! begin routine
 CALL nemo_deallocate(zwi)     ! to come back 

At his point, this looks very nice, but I'm questioning myself if we shouldn't simply use a standard dynamic memory implementation, instead of trying to do complicated things. I don't have the knowledge to answer. I guess the answer can be formulated in term of performances :

  • is it more expensive to allocate/deallocate at each call in a standard way
  • or to CALL a subroutine pointing toward an existing already allocated array

I got alos some concerns about future evolutions. If we imagine having a large variety of arrays (not only jpi,jpj,jpk) it could become hard to maintain.

Anyway, an example can be found there[[BR]]

and wrk_nemo_2 is there


Testing could consider (where appropriate) other configurations in addition to NVTK].

NVTK Tested '''YES/NO'''
Other model configurations '''YES/NO'''
Processor configurations tested [ Enter processor configs tested here ]
If adding new functionality please confirm that the
New code doesn't change results when it is switched off
and ''works'' when switched on

(Answering UNSURE is likely to generate further questions from reviewers.)

Testing dynamical memory:

A sequence of tests has been done on Power6 (vargas) and titane (Bull novascale) in order to compare the 3 ways of coding the dynamical allocation.
The GYRE configuration has been used, and the interface for nex dynamical allocation has been coded for this configuration.
Testing has been done for 2 dimensions (CFG=24 and CFG=96 equivallent to global ¼°).

For all tests, the model runs properly and Elapsed and CPU time are equivallent for the 3 solutions and for a given configuration. Since the interface is identical for the 2 new build routine. It seems reasonable to implement the best solution, I;E. the last one, optimising memory.

Implementation for GYRE took around 2 days. Detailed resuts of tests are available in attached document

  • Processor configurations tested
  • etc——

Bit Comparability

Does this change preserve answers in your tested standard configurations (to the last bit) ? '''YES/NO '''
Does this change bit compare across various processor configurations. (1xM, Nx1 and MxN are recommended) '''YES/NO'''
Is this change expected to preserve answers in all possible model configurations? '''YES/NO'''
Is this change expected to preserve all diagnostics?
,,''Preserving answers in model runs does not necessarily imply preserved diagnostics. ''

If you answered '''NO''' to any of the above, please provide further details:

  • Which routine(s) are causing the difference?
  • Why the changes are not protected by a logical switch or new section-version
  • What is needed to achieve regression with the previous model release (e.g. a regression branch, hand-edits etc). If this is not possible, explain why not.
  • What do you expect to see occur in the test harness jobs?
  • Which diagnostics have you altered and why have they changed?Please add details here……..

System Changes

Does your change alter namelists? '''YES/NO '''
Does your change require a change in compiler options? '''YES/NO '''

If any of these apply, please document the changes required here…….


''Please ''summarize'' any changes in runtime or memory use caused by this change……''

IPR issues

Has the code been wholly (100%) produced by NEMO developers staff working exclusively on NEMO? '''YES/ NO '''

If No:

  • Identify the collaboration agreement details
  • Ensure the code routine header is in accordance with the agreement, (Copyright/Redistribution? etc).Add further details here if required……….
Last modified 9 years ago Last modified on 2011-11-08T13:41:45+01:00

Attachments (1)

Download all attachments as: .zip