Version 13 (modified by trackstand, 14 years ago) (diff) |
---|
Discussion of coding approach/style for dynamic memory
Last edited Timestamp?
S2-1 : Andrew P. : opening of the discusion
As a basis for the discussion, here's how I've currently coded NEMO (v.3.2) to use dynamic memory. I've used ''...'' to indicate places where I've missed-out chunks of code for clarity/brevity.
Highest-level changes are in OPA_SRC/opa.F90 and the routine opa_init():
#if defined key_mpp_dyndist ! Decide on size of grid now that we have our communicator size ! If we're not using dynamic memory then mpp_partition does nothing. #if defined key_mpp_mpi || defined key_mpp_shmem CALL opa_partition() #else jpni = 1 jpnj = 1 jpnij = jpni*jpnj #endif ! Calculate domain dimensions given calculated jpi and jpj jpi = ( jpiglo-2*jpreci + (jpni-1) ) / jpni + 2*jpreci !: first dim. jpj = ( jpjglo-2*jprecj + (jpnj-1) ) / jpnj + 2*jprecj !: second dim. jpim1 = jpi-1 !: inner domain indices jpjm1 = jpj-1 !: " " jpkm1 = jpk-1 !: " " jpij = jpi*jpj !: jpi x j ! Now we know the dimensions of the grid, allocate arrays CALL opa_alloc() #endif
The inclusion of the new code is currently controlled by the cpp key 'key_mpp_dyndist' but it would greatly improve the cleanliness of the code if we make a complete break from the static-memory version and thus can drop the use of this cpp key. My understanding of the conclusion of the meeting in Southampton was that this is what we're going to do.
This addition to opa_init() calls two further new routines, opa_partition() and opa_alloc():
SUBROUTINE opa_partition USE par_oce IMPLICIT none INTEGER, PARAMETER :: nfactmax = 20 INTEGER :: nfact ! The no. of factors returned INTEGER :: ierr ! Error flag INTEGER :: i INTEGER :: idiff, mindiff, imin ! For choosing pair of factors that are ! closest in value INTEGER, DIMENSION(nfactmax) :: ifact ! Array of factors ierr = 0 #if ! defined key_mpp_dyndist RETURN ! If we aren't using dynamic memory then jpnj and jpni are set ! at compile time #else CALL factorise(ifact, nfactmax, nfact, mppsize, ierr) IF(nfact <= 1)THEN WRITE (numout, *) 'WARNING: factorisation of number of PEs failed' WRITE (numout, *) ' : using grid of ',mppsize,' x 1' jpnj = 1 jpni = mppsize ELSE ! Search through factors for the pair that are closest in value mindiff = 1000000 imin = 1 DO i=1,nfact-1,2 idiff = ABS(ifact(i) - ifact(i+1)) IF(idiff < mindiff)THEN mindiff = idiff imin = i END IF END DO jpnj = ifact(imin) jpni = ifact(imin + 1) ENDIF jpnij = jpni*jpnj WRITE(*,*) 'ARPDBG: jpni = ',jpni,'jpnj = ',jpnj,'jpnij = ',jpnij #endif END SUBROUTINE opa_partition
where factorise() returns the prime factors of its argument.
opa_alloc() oversees the allocation of all of the module-level arrays:
SUBROUTINE opa_alloc #if defined key_mpp_dyndist #if defined key_lim2 USE limwri_2, ONLY: lim_wri_2_malloc USE limdmp_2, ONLY: lim_dmp_2_malloc USE limhdf_2, ONLY: lim_hdf_2_malloc USE ice_2, ONLY: ice_2_malloc USE limsbc_2, ONLY: lim_sbc_2_malloc USE thd_ice_2, ONLY: thd_ice_2_malloc USE limdia_2, ONLY: lim_dia_2_malloc USE dom_ice_2, ONLY: dom_ice_2_malloc #endif USE ldfslp, ONLY: ldf_slp_malloc USE ldftra_oce, ONLY: ldftra_oce_malloc USE ldfdyn_oce, ONLY: ldfdyn_oce_malloc #if defined key_vvl USE domvvl, ONLY: dom_vvl_malloc #endif USE dom_oce, ONLY: dom_oce_malloc ... ... IMPLICIT none INTEGER, PARAMETER :: NUMCALLS = 58 INTEGER, DIMENSION(NUMCALLS) :: ierr INTEGER :: i, icount ierr = 0 icount = 1 #if defined key_lim2 ierr = ierr + lim_wri_2_malloc() ierr = ierr + lim_dmp_2_malloc() ierr = ierr + lim_hdf_2_malloc() ierr = ierr + ice_2_malloc() ierr = ierr + lim_sbc_2_malloc() ierr = ierr + thd_ice_2_malloc() ierr = ierr + lim_dia_2_malloc() ierr = ierr + dom_ice_2_malloc() #endif ierr = ierr + ldf_slp_malloc() ierr = ierr + ldftra_oce_malloc() ierr = ierr + ldfdyn_oce_malloc() #if defined key_vvl ierr = ierr + dom_vvl_malloc() #endif ierr = ierr + dom_oce_malloc() ... ... ! Should do an MPI_SUM on ierr and then everyone can MPI_FINALIZE() if the ! memory allocation has failed on any one PE IF(ierr > 0)THEN WRITE(numout,*) WRITE(numout,*) 'ERROR: Allocation of memory failed in opa_alloc' STOP 'ERROR: Allocation of memory failed in opa_alloc' END IF #else RETURN ! Not using dynamic memory therefore do nothing #endif END SUBROUTINE opa_alloc
Each of the modules must then be edited and the declaration of arrays with dimensions that depend upon jpi or jpj be changed. e.g. for OPA_SRC/DOM/dom_oce.F90:
MODULE dom_oce !!====================================================================== !! *** MODULE dom_oce *** !! !! ** Purpose : Define in memory all the ocean space domain variables !!====================================================================== !! History : 1.0 ! 2005-10 (A. Beckmann, G. Madec) reactivate s-coordinate !!---------------------------------------------------------------------- !! NEMO/OPA 3.2 , LOCEAN-IPSL (2009) !! $Id: dom_oce.F90 1601 2009-08-11 10:09:19Z ctlod $ !! Software governed by the CeCILL licence (modipsl/doc/NEMO_CeCILL.txt) !!---------------------------------------------------------------------- USE par_oce ! ocean parameters IMPLICIT NONE ... ... #if defined key_mpp_dyndist REAL(wp), PUBLIC, ALLOCATABLE, SAVE, DIMENSION(:,:,:) :: & !: #else REAL(wp), PUBLIC, DIMENSION(jpi,jpj,jpk) :: & !: #endif tmask, umask, & !: land/ocean mask at T-, U-, V- and F-points vmask, fmask !: REAL(wp), PUBLIC, DIMENSION(jpiglo) :: tpol, fpol !: north fold mask (nperio= 3 or 4) #if defined key_noslip_accurate INTEGER, PUBLIC, DIMENSION(4,jpk) :: npcoa !: ??? #if defined key_mpp_dyndist INTEGER, PUBLIC, ALLOCATABLE, SAVE, DIMENSION(:,:,:) :: nicoa, njcoa !: ??? #else INTEGER, PUBLIC, DIMENSION(2*(jpi+jpj),4,jpk) :: nicoa, njcoa !: ??? #endif #endif !!---------------------------------------------------------------------- !! agrif domain !!---------------------------------------------------------------------- #if defined key_agrif LOGICAL, PUBLIC, PARAMETER :: lk_agrif = .TRUE. !: agrif flag #else LOGICAL, PUBLIC, PARAMETER :: lk_agrif = .FALSE. !: agrif flag #endif #if defined key_mpp_dyndist PUBLIC dom_oce_malloc #endif #if defined key_mpp_dyndist CONTAINS FUNCTION dom_oce_malloc() USE par_oce, Only: jpi, jpj, jpk, jpnij IMPLICIT none INTEGER :: dom_oce_malloc INTEGER :: ierr, ierr_tmp ierr = 0 ... ... ALLOCATE(tmask(jpi,jpj,jpk), umask(jpi,jpj,jpk), & vmask(jpi,jpj,jpk), fmask(jpi,jpj,jpk), Stat=ierr_tmp) ierr = ierr + ierr_tmp #if defined key_noslip_accurate ALLOCATE(nicoa(2*(jpi+jpj),4,jpk), njcoa(2*(jpi+jpj),4,jpk), & Stat=ierr_tmp) ierr = ierr + ierr_tmp #endif IF(ierr != 0)THEN WRITE() "Some error message detailing which module's ..._malloc() routine failed" END IF ! Error status passed back up dom_oce_malloc = ierr END FUNCTION dom_oce_malloc #endif
Finally (I think), there's the issue of work-space or local arrays. These are arrays local to subroutines that are declared with their dimensions in terms of jpi and jpj. Since jpi and jpj are no longer parameters, these arrays are now 'automatic' -- that is, they are allocated on the stack when the program enters the subroutine and free'd again when it leaves. Running out of stack memory will cause the program to abort abruptly and therefore it is more controlled to use allocate for big workspace arrays. (Although the amount of available stack can be set to 'unlimited' by using the shell's ulimit command, what limit this actually results in will be OS/set-up dependent.) I suggest therefore that we edit the code such that any large (3d or higher?) workspace arrays are made explicitly allocatable and then allocated the first time the subroutine is called. If we declare them with the SAVE attribute then the allocated memory will remain available to us in future calls, e.g.:
SUBROUTINE dia_ptr( kt ) !!---------------------------------------------------------------------- !! *** ROUTINE dia_ptr *** !!---------------------------------------------------------------------- IMPLICIT none INTEGER, INTENT(in) :: kt ! ocean time step index !! INTEGER :: jk, jj, ji ! dummy loop REAL(wp) :: zsverdrup ! conversion from m3/s to Sverdrup REAL(wp) :: zpwatt ! conversion from W to PW REAL(wp) :: zggram ! conversion from g to Pg #if defined key_mpp_dyndist REAL(wp), ALLOCATABLE, SAVE, DIMENSION(:,:,:) :: vt, vs ! 3D workspace #else REAL(wp), DIMENSION(jpi,jpj,jpk) :: vt, vs ! 3D workspace #endif !!---------------------------------------------------------------------- IF( kt == nit000 .OR. MOD( kt, nf_ptr ) == 0 ) THEN #if defined key_mpp_dyndist IF(.not. ALLOCATED(vt))THEN ALLOCATE(vt(jpi,jpj,jpk), vs(jpi,jpj,jpk), Stat=ji) IF(ji .ne. 0)THEN ! Need an interface to MPI_Abort() here really STOP 'ERROR: Allocation of vt/vs failed in dia_ptr' END IF END IF #endif ... ...
If an allocate() is done in this way then handling any failure is difficult since there's no guarantee that the allocate() call will have failed for all MPI processes. Therefore, I think the best bet is to print out an error message and then do an MPI_Abort on MPI_COMM_WORLD.
S2-2 : Richard H. comments
I agree with the sentiment about dropping 'key_mpp_dyndist' in favour of only supporting the dynamic memory code. (On the basis that proliferation of cpp keys makes maintenance and development difficult in the long term and implies the need to test model developments using equivalent configurations under both static and dynamic configurations).
Re the local array work space. Agree that we can't rely on ulimit. With regard to allocating and saving large workspace arrays, would it be viable to allocate space for these in some generic sense at the start of the run rather than locally within each subroutine or code area? That might give us the opportunity of recycling the same space rather than allocating space specifically for each subroutine or code area. It might, however imply the need to pass the generic array around through argument lists e.g.:
Declare/allocate GEN_WORK_ARRAY1(jpi,jpj,jpk), GEN_WORK_ARRAY2(jpi,jpj,jpk), etc. somewhere during model initialisation bla bla bla " " " CALL SOME_ROUTINE_X(GEN_WORK_ARRAY1(1,1,1,1), GEN_WORK_ARRAY2(1,1,1,2), etc) " " " CALL SOME_ROUTINE_Y(GEN_WORK_ARRAY1(1,1,1,1), GEN_WORK_ARRAY2(1,1,1,2), etc) " " " CALL SOME_ROUTINE_Z(GEN_WORK_ARRAY1(1,1,1,1), GEN_WORK_ARRAY2(1,1,1,2), etc)
Then in the routines we have:
SUBROUTINE SOME_ROUTINE_X(local_work_array_X1, local_work_array_X2, etc) bla bla bla " " " END SUBROUTINE SUBROUTINE SOME_ROUTINE_Y(local_work_array_Y1, local_work_array_Y2, etc) bla bla bla " " " END SUBROUTINE SUBROUTINE SOME_ROUTINE_Z(local_work_array_Z1, local_work_array_Z2, etc) bla bla bla " " " END SUBROUTINE
Anyway, just a thought.
Aborting using MPI_COMM_WORLD is particularly pertinent to coupled (OASIS based) models (otherwise things just tend to dangle).
S2-3 : Gurvan M. comments
- Definitively, we have to make a complete break from static-memory version. The key_mpp_dyndist should disappear. We have all agreed on that at the developer committee.
- A namelist (namcfg, cfg stands for ConFiGuration?) will provide the domain size (jpiglo, jpjglo, jpk) and the total number of processors and the cutting in i and j direction (jpnij, jpni, jpnj), as well as the configuration name and resolution (cp_cfg, jp_cfg) etc... In fact all the information given in par_oce_...h90
- Obviously the full dynamical allocation will result in the suppression of almost all CPP keys. A priori, the only exceptions are the keys related to the CPP "substitute" (vertical coordinate, eddy coefficient 2D and 3D...). Please, for the first implementation, do not suppress the keys. It should be done as a second step. This will require to significantly change the namelists and the OPA documentation... In other word, this will be a version 4.0 not simply a 3.4....
- Note that since v3.3 there is a step_oce.F90 module corresponding to almost all modules used in step.F90. A simple USE step_oce in opa_alloc will simplify the routine.
- Note also that, for modularity reasons, the sea-ice and biogeochemical tracers should not be allocated by the opa_alloc routine. Instead, a lim_alloc can be created in LIM_SRC_3 (lim_alloc_2 in LIM_SRC_2) called by sbc_init (itself called in opa_init) if sea-ice is activated and trc_alloc (or even pisces_alloc, lobster_alloc etc for TOP_SRC...) called by trc_init.
- issue of work-space or local arrays:
In my opinion, we can simply return back to what was done in earlier versions of OPA (v1.0 to v6.0 !!). Declare and allocate one for all 4 3D work arrays, and 4 2D wok arrays. Then use them as workspace in the subroutines. I say 4, as ti was sufficient in those release. Currently, some more can be required, and with the Griffies operator and the merge of TRA and TRC routines some 4D local arrays have appeared arrays.
We can check in the code the maximum number of 4D, 3D and 2D arrays are required to decide the exact number. It should not be that large.
Note that such a technique is already used in some modules.For example in zdftke, I use the fact that after field (ua, va, ta, sa) are only used in the momentum and tracer part, so that in the computation of the physics there are considered as workspace.
So what I suggest a new module wrk_nemo (_nemo since it will be probably used in OPA, LIM, CICE, TOP...) :
MODULE wrk_nemo !!====================================================================== !! *** MODULE wrk_nemo *** !! NEMO work space: define and allocated work space arrays used in all component of NEMO !!===================================================================== !! History : 4.0 ! 2011-01 (xxxx) Original code !!---------------------------------------------------------------------- !!---------------------------------------------------------------------- !! wrk_malloc : update momentum and tracer Kz from a tke scheme !!---------------------------------------------------------------------- USE par_oce ! ocean parameters USE in_out_manager ! I/O manager IMPLICIT NONE PRIVATE PUBLIC wrk_malloc ! routine called in opa module (opa_init routine) LOGICAL , PUBLIC, PARAMETER :: lk_zdftke = .TRUE. !: TKE vertical mixing flag REAL(wp), DIMENSION(:,:) , PUBLIC :: wrk_2d_1, wrk_2d_1, ... !: 2D workspace REAL(wp), DIMENSION(:,:,:) , PUBLIC :: wrk_3d_1, wrk_3d_1, ... !: 3D workspace REAL(wp), DIMENSION(:,:,:,:), PUBLIC :: wrk_4d_1, wrk_4d_1, ... !: 4D workspace !! * Substitutions # include "domzgr_substitute.h90" # include "vectopt_loop_substitute.h90" !!---------------------------------------------------------------------- !! NEMO/OPA 4.0 , NEMO Consortium (2010) !! $Id$ !! Software governed by the CeCILL licence (NEMOGCM/NEMO_CeCILL.txt) !!---------------------------------------------------------------------- CONTAINS SUBROUTINE wrk_malloc !!---------------------------------------------------------------------- !! *** ROUTINE wrk_malloc *** !! !! ** Purpose : Define in memory one for all the NEMO 2D, 3D and 4d work space arrays !!---------------------------------------------------------------------- INTEGER :: ierror ! local integer !!---------------------------------------------------------------------- ! ALLOCATE(wrk_2d_1(jpi,jpj) , wrk_2d_1(jpi,jpj) , ... & & wrk_3d_1(jpi,jpj,jpk) , wrk_3d_1(jpi,jpj,jpk) , ... & & wrk_4d_1(jpi,jpj,jpk, jpts) , wrk_4d_1(jpi,jpj,jpk,jpts) , ... , Stat=ierror ) ! IF( ierror /= 0 ) CALL ctl_stop( 'wrk_malloc: unable to allocate work arrays' ) ! END SUBROUTINE wrk_malloc !!====================================================================== END MODULE wrk_nemo
Then, your example of dia_ptr routine becomes:
SUBROUTINE dia_ptr( kt ) !!---------------------------------------------------------------------- !! *** ROUTINE dia_ptr *** ... !!---------------------------------------------------------------------- USE wrk_nemo, vt => wrk_3D_1 ! use ua as workspace USE wrk_nemo, vs => wrk_3D_2 ! use va as workspace ! INTEGER, INTENT(in) :: kt ! ocean time step index ! INTEGER :: jk, jj, ji ! dummy loop REAL(wp) :: zsverdrup ! conversion from m3/s to Sverdrup REAL(wp) :: zpwatt ! conversion from W to PW REAL(wp) :: zggram ! conversion from g to Pg !!---------------------------------------------------------------------- !! IF( kt == nit000 .OR. MOD( kt, nf_ptr ) == 0 ) THEN ... ...
Note that in this example, I have already introduced a 'USE oce, vt => ua' ... since dia_ptr is a diagnostics, so that after arrays are available as work space.
The main DANGER of this approach is that the developer must carefull check that the work space he wants to use is not already used to store some information. In particular in case of a subroutine, sub1, calling another one, sub2, if sub2 needs a work space it will be more secured to have the work space in argument of sub2.
I think, we can impose a rule for all subroutine called a lower level than step or opa_init, the required work space have to be passed in argument.
I don't see any other drawbacks for this technique... I also think that if we carefully check and rewrote some part of the code, then hopefully, only 2D and 3D work space arrays will be necessary.
S2-4 : Italo E. comments
Hi all, I have just a couple of comments.
Re the opa_partition routine and the policy for choosing the "best" partition, I suggest to set jpni and jpnj such that the local subdomain is as much "square" as possible. Indeed the "best" performance, with the current domain decomposition, is reached when the local subdomain has a square shape. I suggest to modify the opa_patition as follows
... CALL factorise(ifact, nfactmax, nfact, mppsize, ierr) IF(nfact <= 1)THEN WRITE (numout, *) 'WARNING: factorisation of number of PEs failed' WRITE (numout, *) ' : using grid of ',mppsize,' x 1' jpnj = 1 jpni = mppsize ELSE ! Search through factors for the pair that are closest in value mindiff = 1000000 imin = 1 DO i=1,nfact-1,2 idiff = ABS(jpjglo/ifact(i) - jpiglo/ifact(i+1)) IF(idiff < mindiff)THEN mindiff = idiff imin = i END IF END DO jpnj = ifact(imin) jpni = ifact(imin + 1) ENDIF ...
Re the allocation of work arrays. The sharing of work arrays among different routines gives us the possibility to save relevant memory space; so the idea to have a module such as wrk_nemo could be useful. However the usage of those arrays could introduce several contraindications: 1. the code could be less readable; 2. when I write a new routine that calls some other already available, I must be sure that I will not use the same work arrays.
Some actions can be adopted in order to reduce the dangerously of such work arrays, but I would avoid to use routine arguments for passing work arrays. Typically the usage of work arrays is strictly related to the kind of implementation of the routine; on the other hand, the routine prototype should be as stable as possible during the refinement/optimization/modification of the routine implementation. The maintenance of the code becomes very heavy if updating the implementation of one routine implies also the modification of its prototype.
For those routines, at lower level, I suggest to declare locally their work allocatable arrays with the SAVE attribute
S2-5 : Andrew P's follow-up comments
I like Gurvan's suggestion of a module containing globally-accessible work-space arrays. We could add some error-checking functionality to this by having an 'in_use' flag for each work-space array in the module. Before using a work-space array, a developer should check that the appropriate flag is .FALSE. and if it is, set it to .TRUE. while they are using it. Once they are done using the array the flag should be set back to .FALSE.
I also note that there are different views on how to specify the domain decomposition (jpni and jpnj values). I was thinking that we would calculate them, choosing values such that the domains are as square as possible (following Italo's suggestion above) whereas Gurvan mentions reading them from a namelist file. I suggest that we support both options: by default jpni and jpnj can be calculated at run-time (this saves the user from having to change the namelist file when changing the number of processors they're running on) but this behaviour can be overridden by specifying them in the namelist file.
(Note that this choice of domain decomposition and the shape of the sub-domains forms a part of the optimisation project presented by Stephen Pickles at the meeting in Southampton.)
S2-6 : Marie-Alice Foujols' comments
As this modification will impact all the code, I suggest to use a script to easily redo modification. It'll be usefull for NEMO users to compare old part of their own copie of code with new one. If this script is distributed, they could use it to change their code and to easily incorpore their modifications to the new version. I suggest also to avoid cosmetic changse (move of comments, line splitting, ....) for the same reason : reduce time for users to compare their own copie with new version of NEMO including dynamic allocation.
In the past, we regret to not do that for : conversion from fixed format to free format, switching iim, jjm to jpi, jpj, ... and other modification that affected large subset of routines.
Hope this helps.
S2-7 : Andy Porter's 3rd set of comments
I can appreciate that an almost global change like this will be difficult for users who have locally modified versions. Ideally the source-code revision-control system would facilitate applying the changes to a locally-modified version/branch - one that isn't in the official repository. Unfortunately I don't think subversion has this functionality (although I'd be very pleased to learn otherwise). Certainly I'll do my best to avoid unnecessary cosmetic changes. However, while I can imagine that scripting the change of module arrays from static to dynamic might be possible, I don't think the same can be said of the work-space arrays and they account for a lot of the code changes.
In fact, I'm discovering that some routines have an awful lot of workspace arrays. e.g.:
SUBROUTINE dyn_spg_ts( kt ) !!---------------------------------------------------------------------- !! *** routine dyn_spg_ts *** !! !! ** Purpose : Compute the now trend due to the surface pressure ... !! References : Griffies et al., (2003): A technical guide to MOM4. NOAA/GFDL !!--------------------------------------------------------------------- INTEGER, INTENT(in) :: kt ! ocean time-step index !! INTEGER :: ji, jj, jk, jn ! dummy loop indices INTEGER :: icycle ! temporary scalar REAL(wp) :: zraur, zcoef, z2dt_e, z2dt_b ! temporary scalars ... !! REAL(wp), DIMENSION(jpi,jpj) :: zhdiv, zsshb_e !! REAL(wp), DIMENSION(jpi,jpj) :: zbfru , zbfrv ! 2D workspace !! REAL(wp), DIMENSION(jpi,jpj) :: zsshun_e, zsshvn_e ! 2D workspace !! REAL(wp), DIMENSION(jpi,jpj) :: zcu, zwx, zua, zun ! 2D workspace REAL(wp), DIMENSION(jpi,jpj) :: zcv, zwy, zva, zvn ! - - REAL(wp), DIMENSION(jpi,jpj) :: zun_e, zub_e, zu_sum ! 2D workspace REAL(wp), DIMENSION(jpi,jpj) :: zvn_e, zvb_e, zv_sum ! - - REAL(wp), DIMENSION(jpi,jpj) :: zssh_sum ! - - !!---------------------------------------------------------------------- IF( kt == nit000 ) THEN !* initialisation }} I make that 21 2D workspace arrays! Should the global workspace module contain that many or should we make some of these into module-wide arrays? Do people want jpk to be treated like jpi and jpj and have it become a run-time variable or is it OK to leave it as a compile-time parameter? My thinking is that one doesn't change the no. of levels in a model lightly and it has no bearing on the MPP domain decomposition. ---- == S2-x : XXX' comments == ----
Attachments (1)
-
Dynamic_Memory_in_NEMO.pdf
(75.5 KB) -
added by rblod 14 years ago.
Andrew Porter Presentation
Download all attachments as: .zip