Skilled (#3) - [land-proc] ORCA2_LIM setup fails while removing - land only grid cells (#65) - Message List
Hi all,
This topic is follow-up to the - http://forge.ipsl.jussieu.fr/nemo/discussion/topic/64. Now - i am trying out a standard input configuration (based on - ORCA2_LIM configuration) of nemo-3.6 + xios-2.0 (compiled with gnu compilers) with minimal tweaks, hence posting this issue as a new topic. The purpose is to see if the setup will work after eliminating land processors.
as keys were not getting picked, i had to manually add keys in FPPKEYS variable (in - ./EXTERNAL/fcm/lib/Fcm/Config.pm)
key_trabbl key_lim2 key_dynspg_flt key_diaeiv key_ldfslp key_traldf_c2d key_traldf_eiv key_dynldf_c3d key_zdftke key_zdfddm key_zdftmx key_iomput key_mpp_mpi key_diaobs key_asminc key_nosignedzero key_xios2
Here are the steps i followed to set up nemo's ORCA2_LIM configuration -
./makenemo -m XC40_NCM_gnu -r ORCA2_LIM -n MY_ORCA2_LIM 2>&1|tee logs
cd MY_ORCA2_LIM/EXP00
cp ../../ORCA2_LIM/EXP00/* .
cp ../../GYRE_XIOS/EXP00/*.xml .
tar xvf ../../ORCA2_LIM_nemo_v3.6.tar
gunzip *.gz */*.gz
Here are the modifications carried out in namelist_cfg file -
!----------------------------------------------------------------------- &nammpp ! Massively Parallel Processing ("key_mpp_mpi") !----------------------------------------------------------------------- cn_mpi_send = 'I' ! mpi send/recieve type ='S', 'B', or 'I' for standard send, ! buffer blocking send or immediate non-blocking sends, resp. nn_buffer = 0 ! size in bytes of exported buffer ('B' case), 0 no exportation ln_nnogather= .true. ! activate code to avoid mpi_allgather use at the northfold jpni = 10 ! jpni number of processors following i (set automatically if < 1) jpnj = 8 ! jpnj number of processors following j (set automatically if < 1) jpnij = 76 ! jpnij number of local domains (set automatically if < 1)
on running simulation with -
aprun -n 76 ./nemo.exe
the error message i get is -
.... -> info : If domain grid_T does not have overlapped regions between processes something must be wrong with mask index *** Error in `*** Error in `./nemo.exe': *** Error in `./nemo.exe': malloc(): memory corruption: 0x0000000005e2a590 *** *** Error in `./nemo.exe': malloc(): memory corruption: 0x0000000005e297a0 *** ./nemo.exe': malloc(): memory corruption: 0x0000000005e297f0 *** malloc(): memory corruption: 0x0000000005db6980 *** ======= Backtrace: ========= ======= Backtrace: ========= ======= Backtrace: ========= ======= Backtrace: ========= /lib64/libc.so.6/lib64/libc.so.6(+0x721af)[0x2aaaadde71af] ....
i have uploaded relevant files for this run at - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/ORCA2_LIM/1672152/
standard error/output - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/ORCA2_LIM/1672152/ORCA2_LIM.o1672152
namelist - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/ORCA2_LIM/1672152/namelist_cfg
ocean.output - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/ORCA2_LIM/1672152/ocean.output
on changing namelist_cfg's jpnij to 80 -
!----------------------------------------------------------------------- &nammpp ! Massively Parallel Processing ("key_mpp_mpi") !----------------------------------------------------------------------- cn_mpi_send = 'I' ! mpi send/recieve type ='S', 'B', or 'I' for standard send, ! buffer blocking send or immediate non-blocking sends, resp. nn_buffer = 0 ! size in bytes of exported buffer ('B' case), 0 no exportation ln_nnogather= .true. ! activate code to avoid mpi_allgather use at the northfold jpni = 10 ! jpni number of processors following i (set automatically if < 1) jpnj = 8 ! jpnj number of processors following j (set automatically if < 1) jpnij = 80 ! jpnij number of local domains (set automatically if < 1)
aprun -n 80 ./nemo.exe
The simulation runs to completion ( 5475 steps ). relevant output/configuration files for this simulation can be found at - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/ORCA2_LIM/1672307/
I am unable to figure out - that why land processor eliminated setup is failing to run, am i missing out on something in input file?.
1) Is there a thumb-rule to select jpnij while eliminating land processors?
2) Also, could you please suggest appropriate (any working) combination of jpni & jpnj where jpni x jpnj > jpnij - for (this) ORCA2_LIM configuration?
please let me know if i can provide any further information from my end.
Eagerly awaiting your replies.
-
Message #188
Both the commands were failing with error
aprun -n 76 ./nemo.exe aprun -n 76 ./nemo.exe : -n 4 ./xios_server.exe (iodef.xml : use_server = true)
with namelist_cfg 's key_mpp_mpi section having -
jpni = 10 ! jpni number of processors following i (set automatically if < 1) jpnj = 8 ! jpnj number of processors following j (set automatically if < 1) jpnij = 76 ! jpnij number of local domains (set automatically if < 1)
Found out that when multiple gcc versions are loaded in environment while compiling xios - the issue shows up.
Why i had to load multiple gcc: My default version of gcc (/usr/bin/gcc) was 4.8 and i was trying to compile nemo 3.6 + xios r1630 (with intel 2017). xios r1630 uses "regex" type/constructs in source code due to which i had to load gcc>=4.9 (module load).
Solution: nemo_3.6 + xios_r1322 compiled fine with gcc 4.8 (hence no need to load additional modules). And ORCA2_LIM setup under MPMD/ non MPMD configuration worked fine. Please mark this issue as solved.
highly likely that multiple gcc versions are root cause for http://forge.ipsl.jussieu.fr/nemo/discussion/topic/64 .
puneet3362018-12-31 06:48 CET (6 years ago)