Beginner (#1) - nemo3.6 + xios 2.0 Segfault (invalid pointer - xios_server) Issue - while removing land grid cells (#64) - Message List

nemo3.6 + xios 2.0 Segfault (invalid pointer - xios_server) Issue - while removing land grid cells
 unsolved

Hi, I am trying to run a nemo v3.6 (with xios in detached/MPMD mode ) on cray machine, the nemo 3.6 was compiled with intelv17 + xios - 2.0 support.

1) ISSUE: while trying to remove land only grid cells (https://www.researchgate.net/publication/265020219_The_NEMO_Ocean_Modelling_Code_A_Case_Study ), the simulation terminates abruptly with following error message in the stdout file -

    -> info : If domain grid_T does not have overlapped regions between processes something must be wrong with mask index
    ......
     Error in `./xios_server.exe': free(): invalid pointer: 0x000000000211e490
    ======= Backtrace: =========
    /lib64/libc.so.6(+0x721af)[0x2aaaaf58a1af]
    .....

I have uploaded Error log at bit bucket repo - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/fail_notraceback_intel/1650382/test_exp.o1650382 Cfg file for this simulation at - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/fail_notraceback_intel/1650382/ and xios's xml file - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/iodef.xml

So far, i have tried following combinations and both have failed with similar error message -
a)  jpni=          59     jpnj=          24       jpnij=        1080
b)  jpi=           76     jpj=          134       jpnij=        1080


here are more details on combination used :

  iresti=          13  irestj=           1
  Total number of domains         1416
  Number of ocean processors               1080
  Number of land processors                 336
  Mean ocean coverage per domain     0.7769847
  Minimum ocean coverage             2.9457975E-04
  Maximum ocean coverage             0.9925373
  nb of proc with coverage         < 10 %           81
  nb of proc with coverage 10 < nb < 30 %           78
  nb of proc with coverage 30 < nb < 50 %          -18
  Number of computed points            10998720
  Overhead of computed points          -2602614
  % sup (computed / global)          0.8086501

i am using aprun -n1080 -N36 ./nemo.exe : -n108 -N9 ./xios_server.exe to run nemo & xios_server in MPMD mode.



2) I recompiled xios-2.0 with intel's traceback option to get some more details on failure, I could see some traceback (https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/fail_traceback_intel/1650393/test_exp.o1650393) which indicates some issue with domain.cpp file

 Image              PC                Routine            Line        Source
   libifcore.so.5     00002AAAADBE6956  for__signal_handl     Unknown  Unknown
   libpthread-2.22.s  00002AAAAF00EB10  Unknown               Unknown  Unknown
   xios_server.exe    00000000007A8F08  _ZN4xios7CDomain8        1162  hashtable.h
   xios_server.exe    00000000007A73D0  _ZN4xios7CDomain1        2530  domain.cpp
   xios_server.exe    0000000000776DEC  _ZN4xios14CContex         271  context_server.cpp
   xios_server.exe    0000000000C1E4E3  _ZN4xios7CServer9         750  server.cpp
   xios_server.exe    000000000077C4DA  _ZN4xios5CXios14i         190  cxios.cpp
   xios_server.exe    0000000000439E2E  MAIN__                      7  xios_server.f90
   xios_server.exe    0000000000DD2222  Unknown               Unknown  Unknown
   libc-2.22.so       00002AAAAF5386E5  __libc_start_main     Unknown  Unknown
   xios_server.exe    0000000000439D29  Unknown               Unknown  Unknown

relevant xios files & configuration can be located at - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/fail_traceback_intel/1650393/

Though, the issue seems similar like - http://forge.ipsl.jussieu.fr/ioserver/ticket/140, but i am already using xios-2.0.

3) Suspecting issues with intel's memory allocation strategy, i tried recompilation of nemo with GNU/GCC compilers, It seems that simulation failed at same point https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/fail_gnu/1651839/ar_new_input.o1651839

4) This issue shows up only while trying to eliminate land processors (jpni x jpnj > jpnij). While running with jpni x jpnj = jpnij - the simulation (both intel & GCC) runs to completion. logs/configuration for succesful run at - https://bitbucket.org/puneet336/nemo3.6_issue1/src/master/good_traceback_intel/1650413/ , following combinations of jpni & jpnj have been tested and are working fine -

   jpni  =  0    jpnj  =  0    jpnij    =  0    (/default)
   jpni  =  40   jpnj  =  27   jpnij    =  1080
   jpni  = 2     jpnj  =  540  jpnij    =  1080
   jpni  = 10    jpnj  =  108  jpnij    =  1080
   jpni  = 20    jpnj  =  54   jpnij    =  1080
   jpni  = 40    jpnj  =  27   jpnij    =  1080
   jpni  = 120   jpnj  =  9    jpnij    =  1080
   jpni  = 540   jpnj  =  2    jpnij    =  1080

aprun -n1080 -N36 ./nemo.exe : -n108 -N9 ./xios_server.exe

Please let me know if i could provide more input/details on this issue, Any help/hint on this issue would be very useful.

Attachments

No attachments created.