Skilled (#3) - [land-proc] NEMO 3.6 + XIOS 2.0 segfault issue while removing land grid cells (invalid pointer - xios_server) (#64) - Message List
Hi,
I am trying to run a nemo v3.6 (with xios in detached/MPMD mode ) on cray machine, the nemo 3.6 was compiled with intelv17 + xios - 2.0 support.
1) ISSUE: while trying to remove land only grid cells (https://www.researchgate.net/publication/265020219_The_NEMO_Ocean_Modelling_Code_A_Case_Study ), the simulation terminates abruptly with following error message in the stdout file -
-> info : If domain grid_T does not have overlapped regions between processes something must be wrong with mask index
......
Error in `./xios_server.exe': free(): invalid pointer: 0x000000000211e490
======= Backtrace: =========
/lib64/libc.so.6(+0x721af)[0x2aaaaf58a1af]
.....
I have uploaded Error log at bit bucket repo - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/fail_notraceback_intel/1650382/test_exp.o1650382
Cfg file for this simulation at - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/fail_notraceback_intel/1650382/
and xios's xml file - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/iodef.xml
So far, i have tried following combinations and both have failed with similar error message - a) jpni= 59 jpnj= 24 jpnij= 1080 b) jpi= 76 jpj= 134 jpnij= 1080
here are more details on combination used :
iresti= 13 irestj= 1 Total number of domains 1416 Number of ocean processors 1080 Number of land processors 336 Mean ocean coverage per domain 0.7769847 Minimum ocean coverage 2.9457975E-04 Maximum ocean coverage 0.9925373 nb of proc with coverage < 10 % 81 nb of proc with coverage 10 < nb < 30 % 78 nb of proc with coverage 30 < nb < 50 % -18 Number of computed points 10998720 Overhead of computed points -2602614 % sup (computed / global) 0.8086501
i am using aprun -n1080 -N36 ./nemo.exe : -n108 -N9 ./xios_server.exe to run nemo & xios_server in MPMD mode.
2) I recompiled xios-2.0 with intel's traceback option to get some more details on failure, I could see some traceback (https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/fail_traceback_intel/1650393/test_exp.o1650393) which indicates some issue with domain.cpp file
Image PC Routine Line Source
libifcore.so.5 00002AAAADBE6956 for__signal_handl Unknown Unknown
libpthread-2.22.s 00002AAAAF00EB10 Unknown Unknown Unknown
xios_server.exe 00000000007A8F08 _ZN4xios7CDomain8 1162 hashtable.h
xios_server.exe 00000000007A73D0 _ZN4xios7CDomain1 2530 domain.cpp
xios_server.exe 0000000000776DEC _ZN4xios14CContex 271 context_server.cpp
xios_server.exe 0000000000C1E4E3 _ZN4xios7CServer9 750 server.cpp
xios_server.exe 000000000077C4DA _ZN4xios5CXios14i 190 cxios.cpp
xios_server.exe 0000000000439E2E MAIN__ 7 xios_server.f90
xios_server.exe 0000000000DD2222 Unknown Unknown Unknown
libc-2.22.so 00002AAAAF5386E5 __libc_start_main Unknown Unknown
xios_server.exe 0000000000439D29 Unknown Unknown Unknown
relevant xios files & configuration can be located at - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/fail_traceback_intel/1650393/
Though, the issue seems similar like - http://forge.ipsl.jussieu.fr/ioserver/ticket/140, but i am already using xios-2.0.
3) Suspecting issues with intel's memory allocation strategy, i tried recompilation of nemo with GNU/GCC compilers, It seems that simulation failed at same point https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/fail_gnu/1651839/ar_new_input.o1651839
4) This issue shows up only while trying to eliminate land processors (jpni x jpnj > jpnij). While running with jpni x jpnj = jpnij - the simulation (both intel & GCC) runs to completion. logs/configuration for succesful run at - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/good_traceback_intel/1650413/ , following combinations of jpni & jpnj have been tested and are working fine -
jpni = 0 jpnj = 0 jpnij = 0 (/default) jpni = 40 jpnj = 27 jpnij = 1080 jpni = 2 jpnj = 540 jpnij = 1080 jpni = 10 jpnj = 108 jpnij = 1080 jpni = 20 jpnj = 54 jpnij = 1080 jpni = 40 jpnj = 27 jpnij = 1080 jpni = 120 jpnj = 9 jpnij = 1080 jpni = 540 jpnj = 2 jpnij = 1080
aprun -n1080 -N36 ./nemo.exe : -n108 -N9 ./xios_server.exe
Please let me know if i could provide more input/details on this issue, Any help/hint on this issue would be very useful.