New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
2011WP/2011Stream2/DynamicMemory – NEMO
wiki:2011WP/2011Stream2/DynamicMemory

Version 9 (modified by trackstand, 14 years ago) (diff)

--

Discussion of coding approach/style for dynamic memory

Last edited Timestamp?

S2-1 : Andrew P. : opening of the discusion

As a basis for the discussion, here's how I've currently coded NEMO (v.3.2) to use dynamic memory. I've used ''...'' to indicate places where I've missed-out chunks of code for clarity/brevity.

Highest-level changes are in OPA_SRC/opa.F90 and the routine opa_init():

#if defined key_mpp_dyndist
      ! Decide on size of grid now that we have our communicator size
      ! If we're not using dynamic memory then mpp_partition does nothing.

#if   defined key_mpp_mpi   ||   defined key_mpp_shmem
      CALL opa_partition()
#else
      jpni = 1
      jpnj = 1
      jpnij = jpni*jpnj
#endif
      ! Calculate domain dimensions given calculated jpi and jpj
      jpi = ( jpiglo-2*jpreci + (jpni-1) ) / jpni + 2*jpreci !: first  dim.
      jpj = ( jpjglo-2*jprecj + (jpnj-1) ) / jpnj + 2*jprecj !: second dim.
      jpim1 = jpi-1                                          !: inner domain indices
      jpjm1 = jpj-1                                          !:   "           "
      jpkm1 = jpk-1                                          !:   "           "
      jpij  = jpi*jpj                                        !:  jpi x j

      ! Now we know the dimensions of the grid, allocate arrays
      CALL opa_alloc()
#endif

The inclusion of the new code is currently controlled by the cpp key 'key_mpp_dyndist' but it would greatly improve the cleanliness of the code if we make a complete break from the static-memory version and thus can drop the use of this cpp key. My understanding of the conclusion of the meeting in Southampton was that this is what we're going to do.

This addition to opa_init() calls two further new routines, opa_partition() and opa_alloc():

   SUBROUTINE opa_partition
     USE par_oce
     IMPLICIT none
     INTEGER, PARAMETER :: nfactmax = 20
     INTEGER :: nfact ! The no. of factors returned
     INTEGER :: ierr  ! Error flag
     INTEGER :: i
     INTEGER :: idiff, mindiff, imin ! For choosing pair of factors that are
                                     ! closest in value
     INTEGER, DIMENSION(nfactmax) :: ifact ! Array of factors
     ierr = 0

#if ! defined key_mpp_dyndist
     RETURN ! If we aren't using dynamic memory then jpnj and jpni are set
            ! at compile time
#else

     CALL factorise(ifact, nfactmax, nfact, mppsize, ierr)

     IF(nfact <= 1)THEN
        WRITE (numout, *) 'WARNING: factorisation of number of PEs failed'
        WRITE (numout, *) '       : using grid of ',mppsize,' x 1'
        jpnj = 1
        jpni = mppsize
     ELSE
        ! Search through factors for the pair that are closest in value
        mindiff = 1000000
        imin    = 1
        DO i=1,nfact-1,2
           idiff = ABS(ifact(i) - ifact(i+1))
           IF(idiff < mindiff)THEN
              mindiff = idiff
              imin = i
           END IF
        END DO
        jpnj = ifact(imin)
        jpni = ifact(imin + 1)
     ENDIF
     jpnij = jpni*jpnj

     WRITE(*,*) 'ARPDBG: jpni = ',jpni,'jpnj = ',jpnj,'jpnij = ',jpnij

#endif

   END SUBROUTINE opa_partition

where factorise() returns the prime factors of its argument.

opa_alloc() oversees the allocation of all of the module-level arrays:

   SUBROUTINE opa_alloc
#if defined key_mpp_dyndist

#if defined key_lim2
     USE limwri_2,   ONLY: lim_wri_2_malloc
     USE limdmp_2,   ONLY: lim_dmp_2_malloc
     USE limhdf_2,   ONLY: lim_hdf_2_malloc
     USE ice_2,      ONLY: ice_2_malloc
     USE limsbc_2,   ONLY: lim_sbc_2_malloc
     USE thd_ice_2,  ONLY: thd_ice_2_malloc
     USE limdia_2,   ONLY: lim_dia_2_malloc
     USE dom_ice_2,  ONLY: dom_ice_2_malloc
#endif
     USE ldfslp,     ONLY: ldf_slp_malloc
     USE ldftra_oce, ONLY: ldftra_oce_malloc
     USE ldfdyn_oce, ONLY: ldfdyn_oce_malloc
#if defined key_vvl
     USE domvvl,     ONLY: dom_vvl_malloc
#endif
     USE dom_oce,    ONLY: dom_oce_malloc
...
...
     IMPLICIT none
     INTEGER, PARAMETER           :: NUMCALLS = 58
     INTEGER, DIMENSION(NUMCALLS) :: ierr
     INTEGER                      :: i, icount

     ierr = 0
     icount = 1

#if defined key_lim2
     ierr = ierr + lim_wri_2_malloc()
     ierr = ierr + lim_dmp_2_malloc()
     ierr = ierr + lim_hdf_2_malloc()
     ierr = ierr + ice_2_malloc()
     ierr = ierr + lim_sbc_2_malloc()
     ierr = ierr + thd_ice_2_malloc()
     ierr = ierr + lim_dia_2_malloc()
     ierr = ierr + dom_ice_2_malloc()
#endif
     ierr = ierr + ldf_slp_malloc()
     ierr = ierr + ldftra_oce_malloc()
     ierr = ierr + ldfdyn_oce_malloc()
#if defined key_vvl
     ierr = ierr + dom_vvl_malloc()
#endif
     ierr = ierr + dom_oce_malloc()
...
...
! Should do an MPI_SUM on ierr and then everyone can MPI_FINALIZE() if the
! memory allocation has failed on any one PE
     IF(ierr > 0)THEN
        WRITE(numout,*) 
        WRITE(numout,*) 'ERROR: Allocation of memory failed in opa_alloc'
        STOP 'ERROR: Allocation of memory failed in opa_alloc'
     END IF
#else
     RETURN ! Not using dynamic memory therefore do nothing
#endif

   END SUBROUTINE opa_alloc

Each of the modules must then be edited and the declaration of arrays with dimensions that depend upon jpi or jpj be changed. e.g. for OPA_SRC/DOM/dom_oce.F90:

MODULE dom_oce
   !!======================================================================
   !!                       ***  MODULE dom_oce  ***
   !!       
   !! ** Purpose :   Define in memory all the ocean space domain variables
   !!======================================================================
   !! History :  1.0  ! 2005-10  (A. Beckmann, G. Madec)  reactivate s-coordinate 
   !!----------------------------------------------------------------------
   !! NEMO/OPA 3.2 , LOCEAN-IPSL (2009) 
   !! $Id: dom_oce.F90 1601 2009-08-11 10:09:19Z ctlod $ 
   !! Software governed by the CeCILL licence (modipsl/doc/NEMO_CeCILL.txt)
   !!----------------------------------------------------------------------
   USE par_oce      ! ocean parameters

   IMPLICIT NONE
...
...
#if defined key_mpp_dyndist
   REAL(wp), PUBLIC, ALLOCATABLE, SAVE, DIMENSION(:,:,:) :: & !: 
#else
   REAL(wp), PUBLIC,              DIMENSION(jpi,jpj,jpk) :: & !:
#endif
      tmask, umask,    &  !: land/ocean mask at T-, U-, V- and F-points
      vmask, fmask        !:

   REAL(wp), PUBLIC, DIMENSION(jpiglo) ::   tpol, fpol          !: north fold mask (nperio= 3 or 4)

#if defined key_noslip_accurate
   INTEGER, PUBLIC,                    DIMENSION(4,jpk) ::   npcoa   !: ???
#if defined key_mpp_dyndist
   INTEGER, PUBLIC, ALLOCATABLE, SAVE, DIMENSION(:,:,:) ::   nicoa, njcoa        !: ???
#else
   INTEGER, PUBLIC,        DIMENSION(2*(jpi+jpj),4,jpk) ::   nicoa, njcoa        !: ???
#endif
                                         
#endif

   !!----------------------------------------------------------------------
   !! agrif domain
   !!----------------------------------------------------------------------
#if defined key_agrif
   LOGICAL, PUBLIC, PARAMETER ::   lk_agrif = .TRUE.    !: agrif flag
#else
   LOGICAL, PUBLIC, PARAMETER ::   lk_agrif = .FALSE.   !: agrif flag
#endif


#if defined key_mpp_dyndist
   PUBLIC dom_oce_malloc
#endif

#if defined key_mpp_dyndist

CONTAINS

  FUNCTION dom_oce_malloc()
    USE par_oce, Only: jpi, jpj, jpk, jpnij
    IMPLICIT none
    INTEGER :: dom_oce_malloc
    INTEGER :: ierr, ierr_tmp
    
    ierr = 0
...
...
    ALLOCATE(tmask(jpi,jpj,jpk), umask(jpi,jpj,jpk),    & 
             vmask(jpi,jpj,jpk), fmask(jpi,jpj,jpk), Stat=ierr_tmp)
    ierr = ierr + ierr_tmp

#if defined key_noslip_accurate
    ALLOCATE(nicoa(2*(jpi+jpj),4,jpk), njcoa(2*(jpi+jpj),4,jpk), &
             Stat=ierr_tmp)
    ierr = ierr + ierr_tmp
#endif

    IF(ierr != 0)THEN
      WRITE() "Some error message detailing which module's ..._malloc() routine failed"
    END IF

    ! Error status passed back up
    dom_oce_malloc = ierr

  END FUNCTION dom_oce_malloc
#endif

Finally (I think), there's the issue of work-space or local arrays. These are arrays local to subroutines that are declared with their dimensions in terms of jpi and jpj. Since jpi and jpj are no longer parameters, these arrays are now 'automatic' -- that is, they are allocated on the stack when the program enters the subroutine and free'd again when it leaves. Running out of stack memory will cause the program to abort abruptly and therefore it is more controlled to use allocate for big workspace arrays. (Although the amount of available stack can be set to 'unlimited' by using the shell's ulimit command, what limit this actually results in will be OS/set-up dependent.) I suggest therefore that we edit the code such that any large (3d or higher?) workspace arrays are made explicitly allocatable and then allocated the first time the subroutine is called. If we declare them with the SAVE attribute then the allocated memory will remain available to us in future calls, e.g.:

   SUBROUTINE dia_ptr( kt )
      !!----------------------------------------------------------------------
      !!                  ***  ROUTINE dia_ptr  ***
      !!----------------------------------------------------------------------
      IMPLICIT none
      INTEGER, INTENT(in) ::   kt   ! ocean time step index
      !!
      INTEGER  ::   jk, jj, ji   ! dummy loop
      REAL(wp) ::   zsverdrup    ! conversion from m3/s to Sverdrup
      REAL(wp) ::   zpwatt       ! conversion from W    to PW
      REAL(wp) ::   zggram       ! conversion from g    to Pg
#if defined key_mpp_dyndist
      REAL(wp), ALLOCATABLE, SAVE, DIMENSION(:,:,:) :: vt, vs   ! 3D workspace
#else
      REAL(wp),              DIMENSION(jpi,jpj,jpk) :: vt, vs   ! 3D workspace
#endif
      !!----------------------------------------------------------------------

      IF( kt == nit000 .OR. MOD( kt, nf_ptr ) == 0 )   THEN

#if defined key_mpp_dyndist
         IF(.not. ALLOCATED(vt))THEN
            ALLOCATE(vt(jpi,jpj,jpk), vs(jpi,jpj,jpk), Stat=ji)
            IF(ji .ne. 0)THEN
               ! Need an interface to MPI_Abort() here really
               STOP 'ERROR: Allocation of vt/vs failed in dia_ptr'
            END IF
         END IF
#endif
...
...

If an allocate() is done in this way then handling any failure is difficult since there's no guarantee that the allocate() call will have failed for all MPI processes. Therefore, I think the best bet is to print out an error message and then do an MPI_Abort on MPI_COMM_WORLD.


S2-2 : Richard H. comments

I agree with the sentiment about dropping 'key_mpp_dyndist' in favour of only supporting the dynamic memory code. (On the basis that proliferation of cpp keys makes maintenance and development difficult in the long term and implies the need to test model developments using equivalent configurations under both static and dynamic configurations).

Re the local array work space. Agree that we can't rely on ulimit. With regard to allocating and saving large workspace arrays, would it be viable to allocate space for these in some generic sense at the start of the run rather than locally within each subroutine or code area? That might give us the opportunity of recycling the same space rather than allocating space specifically for each subroutine or code area. It might, however imply the need to pass the generic array around through argument lists e.g.:

    Declare/allocate GEN_WORK_ARRAY1(jpi,jpj,jpk), GEN_WORK_ARRAY2(jpi,jpj,jpk), etc. somewhere during model initialisation

    bla bla bla
    "   "    "
    CALL SOME_ROUTINE_X(GEN_WORK_ARRAY1(1,1,1,1), GEN_WORK_ARRAY2(1,1,1,2), etc)
    "   "    "
    CALL SOME_ROUTINE_Y(GEN_WORK_ARRAY1(1,1,1,1), GEN_WORK_ARRAY2(1,1,1,2), etc)
    "   "    "
    CALL SOME_ROUTINE_Z(GEN_WORK_ARRAY1(1,1,1,1), GEN_WORK_ARRAY2(1,1,1,2), etc)

Then in the routines we have:

    SUBROUTINE SOME_ROUTINE_X(local_work_array_X1, local_work_array_X2, etc)
    bla bla bla
    "   "   "     
    END SUBROUTINE    

    SUBROUTINE SOME_ROUTINE_Y(local_work_array_Y1, local_work_array_Y2, etc)
    bla bla bla
    "   "   "     
    END SUBROUTINE    
    SUBROUTINE SOME_ROUTINE_Z(local_work_array_Z1, local_work_array_Z2, etc)
    bla bla bla
    "   "   "     
    END SUBROUTINE    

Anyway, just a thought.

Aborting using MPI_COMM_WORLD is particularly pertinent to coupled (OASIS based) models (otherwise things just tend to dangle).


S2-3 : Gurvan M. comments

  • Definitively, we have to make a complete break from static-memory version. The key_mpp_dyndist should disappear. We have all agreed on that at the developer committee.
  • A namelist (namcfg, cfg stands for ConFiGuration?) will provide the domain size (jpiglo, jpjglo, jpk) and the total number of processors and the cutting in i and j direction (jpnij, jpni, jpnj), as well as the configuration name and resolution (cp_cfg, jp_cfg) etc... In fact all the information given in par_oce_...h90
  • Obviously the full dynamical allocation will result in the suppression of almost all CPP keys. A priori, the only exceptions are the keys related to the CPP "substitute" (vertical coordinate, eddy coefficient 2D and 3D...). Please, for the first implementation, do not suppress the keys. It should be done as a second step. This will require to significantly change the namelists and the OPA documentation... In other word, this will be a version 4.0 not simply a 3.4....
  • Note that since v3.3 there is a step_oce.F90 module corresponding to almost all modules used in step.F90. A simple USE step_oce in opa_alloc will simplify the routine.
  • Note also that, for modularity reasons, the sea-ice and biogeochemical tracers should not be allocated by the opa_alloc routine. Instead, a lim_alloc can be created in LIM_SRC_3 (lim_alloc_2 in LIM_SRC_2) called by sbc_init (itself called in opa_init) if sea-ice is activated and trc_alloc (or even pisces_alloc, lobster_alloc etc for TOP_SRC...) called by trc_init.
  • issue of work-space or local arrays:

In my opinion, we can simply return back to what was done in earlier versions of OPA (v1.0 to v6.0 !!). Declare and allocate one for all 4 3D work arrays, and 4 2D wok arrays. Then use them as workspace in the subroutines. I say 4, as ti was sufficient in those release. Currently, some more can be required, and with the Griffies operator and the merge of TRA and TRC routines some 4D local arrays have appeared arrays.

We can check in the code the maximum number of 4D, 3D and 2D arrays are required to decide the exact number. It should not be that large.

Note that such a technique is already used in some modules.For example in zdftke, I use the fact that after field (ua, va, ta, sa) are only used in the momentum and tracer part, so that in the computation of the physics there are considered as workspace.

So what I suggest a new module wrk_nemo (_nemo since it will be probably used in OPA, LIM, CICE, TOP...) :

MODULE wrk_nemo
   !!======================================================================
   !!                       ***  MODULE  wrk_nemo  ***
   !! NEMO work space:  define and allocated work space arrays used in all component of NEMO
   !!=====================================================================
   !! History :  4.0  !  2011-01  (xxxx)  Original code
   !!----------------------------------------------------------------------

   !!----------------------------------------------------------------------
   !!   wrk_malloc    : update momentum and tracer Kz from a tke scheme
   !!----------------------------------------------------------------------
   USE par_oce        ! ocean parameters
   USE in_out_manager ! I/O manager

   IMPLICIT NONE
   PRIVATE

   PUBLIC   wrk_malloc   ! routine called in opa module (opa_init routine)

   LOGICAL , PUBLIC, PARAMETER              ::   lk_zdftke = .TRUE.  !: TKE vertical mixing flag

   REAL(wp), DIMENSION(:,:)    , PUBLIC ::   wrk_2d_1, wrk_2d_1, ...   !: 2D workspace
   REAL(wp), DIMENSION(:,:,:)  , PUBLIC ::   wrk_3d_1, wrk_3d_1, ...   !: 3D workspace
   REAL(wp), DIMENSION(:,:,:,:), PUBLIC ::   wrk_4d_1, wrk_4d_1, ...   !: 4D workspace

   !! * Substitutions
#  include "domzgr_substitute.h90"
#  include "vectopt_loop_substitute.h90"
   !!----------------------------------------------------------------------
   !! NEMO/OPA 4.0 , NEMO Consortium (2010)
   !! $Id$
   !! Software governed by the CeCILL licence     (NEMOGCM/NEMO_CeCILL.txt)
   !!----------------------------------------------------------------------
CONTAINS

   SUBROUTINE wrk_malloc
      !!----------------------------------------------------------------------
      !!                   ***  ROUTINE wrk_malloc  ***
      !!
      !! ** Purpose :   Define in memory one for all the NEMO 2D, 3D and 4d work space arrays
      !!----------------------------------------------------------------------
      INTEGER :: ierror   ! local integer
      !!----------------------------------------------------------------------
      !
      ALLOCATE(wrk_2d_1(jpi,jpj)           , wrk_2d_1(jpi,jpj)          , ...     & 
         &     wrk_3d_1(jpi,jpj,jpk)       , wrk_3d_1(jpi,jpj,jpk)      , ...     &
         &     wrk_4d_1(jpi,jpj,jpk, jpts) , wrk_4d_1(jpi,jpj,jpk,jpts) , ...     , Stat=ierror )
      !
      IF( ierror /= 0 )   CALL ctl_stop( 'wrk_malloc: unable to allocate work arrays' )
      !    
   END SUBROUTINE wrk_malloc

   !!======================================================================
END MODULE wrk_nemo

Then, your example of dia_ptr routine becomes:

   SUBROUTINE dia_ptr( kt )
      !!----------------------------------------------------------------------
      !!                  ***  ROUTINE dia_ptr  ***
      ...
      !!----------------------------------------------------------------------
      USE wrk_nemo,   vt  =>   wrk_3D_1   ! use ua as workspace
      USE wrk_nemo,   vs  =>   wrk_3D_2   ! use va as workspace
      !
      INTEGER, INTENT(in) ::   kt   ! ocean time step index
      !
      INTEGER  ::   jk, jj, ji   ! dummy loop
      REAL(wp) ::   zsverdrup    ! conversion from m3/s to Sverdrup
      REAL(wp) ::   zpwatt       ! conversion from W    to PW
      REAL(wp) ::   zggram       ! conversion from g    to Pg
      !!----------------------------------------------------------------------
      !! 
      IF( kt == nit000 .OR. MOD( kt, nf_ptr ) == 0 )   THEN

      ...
      ...

Note that in this example, I have already introduced a 'USE oce, vt => ua' ... since dia_ptr is a diagnostics, so that after arrays are available as work space.

The main DANGER of this approach is that the developer must carefull check that the work space he wants to use is not already used to store some information. In particular in case of a subroutine, sub1, calling another one, sub2, if sub2 needs a work space it will be more secured to have the work space in argument of sub2.

I think, we can impose a rule for all subroutine called a lower level than step or opa_init, the required work space have to be passed in argument.

I don't see any other drawbacks for this technique... I also think that if we carefully check and rewrote some part of the code, then hopefully, only 2D and 3D work space arrays will be necessary.


S2-4 : Italo E. comments

Hi all, I have just a couple of comments.

Re the opa_partition routine and the policy for choosing the "best" partition, I suggest to set jpni and jpnj such that the local subdomain is as much "square" as possible. Indeed the "best" performance, with the current domain decomposition, is reached when the local subdomain has a square shape. I suggest to modify the opa_patition as follows

...

    CALL factorise(ifact, nfactmax, nfact, mppsize, ierr)


    IF(nfact <= 1)THEN
       WRITE (numout, *) 'WARNING: factorisation of number of PEs failed'
       WRITE (numout, *) '       : using grid of ',mppsize,' x 1'
       jpnj = 1
       jpni = mppsize
    ELSE

       ! Search through factors for the pair that are closest in value
       mindiff = 1000000
       imin    = 1
       DO i=1,nfact-1,2
          idiff = ABS(jpjglo/ifact(i) - jpiglo/ifact(i+1))
          IF(idiff < mindiff)THEN
             mindiff = idiff
             imin = i
          END IF
       END DO
       jpnj = ifact(imin)
       jpni = ifact(imin + 1)
    ENDIF
...

Re the allocation of work arrays. The sharing of work arrays among different routines gives us the possibility to save relevant memory space; so the idea to have a module such as wrk_nemo could be useful. However the usage of those arrays could introduce several contraindications: 1. the code could be less readable; 2. when I write a new routine that calls some other already available, I must be sure that I will not use the same work arrays.

Some actions can be adopted in order to reduce the dangerously of such work arrays, but I would avoid to use routine arguments for passing work arrays. Typically the usage of work arrays is strictly related to the kind of implementation of the routine; on the other hand, the routine prototype should be as stable as possible during the refinement/optimization/modification of the routine implementation. The maintenance of the code becomes very heavy if updating the implementation of one routine implies also the modification of its prototype.

For those routines, at lower level, I suggest to declare locally their work allocatable arrays with the SAVE attribute


S2-5 : Andrew P's follow-up comments

I like Gurvan's suggestion of a module containing globally-accessible work-space arrays. We could add some error-checking functionality to this by having an 'in_use' flag for each work-space array in the module. Before using a work-space array, a developer should check that the appropriate flag is .FALSE. and if it is, set it to .TRUE. while they are using it. Once they are done using the array the flag should be set back to .FALSE.

I also note that there are different views on how to specify the domain decomposition (jpni and jpnj values). I was thinking that we would calculate them, choosing values such that the domains are as square as possible (following Italo's suggestion above) whereas Gurvan mentions reading them from a namelist file. I suggest that we support both options: by default jpni and jpnj can be calculated at run-time (this saves the user from having to change the namelist file when changing the number of processors they're running on) but this behaviour can be overridden by specifying them in the namelist file.

(Note that this choice of domain decomposition and the shape of the sub-domains forms a part of the optimisation project presented by Stephen Pickles at the meeting in Southampton.)


S2-x : XXX' comments


Attachments (1)

Download all attachments as: .zip