Opened 2 months ago

Last modified 2 months ago

#172 new defect

Segmentation violation with trunk on ARCHER2 (AMD/Cray compilers). Bug in xios_set_domain_attr?

Reported by: acc Owned by: jderouillat
Priority: major Component: XIOS
Version: trunk Keywords:
Cc:

Description

Problems have been encountered when testing the latest development of NEMO on ARCHER2. This includes tiling options so needs a recent trunk version of XIOS. Problems occur with r2136 and also with tests using r2133. Only one of the NEMO SETTE tests fails and it turns out to be the only one that activates the ln_mskland option. This option sets masks via xios_set_domain_attr and xios_set_grid_attr (mask_1D in the former and mask_3D in the latter). The calls to these routines return cleanly but xios_server.exe crashes with a segmentation violation and no information on the first attempt to write data to disk. The code is well behaved and applies the mask correctly without the call to xios_set_domain_attr (i.e. with just xios_set_grid_attr call) the but crashes whenever xios_set_domain_attr used to set a mask. The arguments have been checked and are confirmed to be valid.

This looks to be a generic XIOS issue on this architecture/compiler since the generic_testcase.exe executable also fails with what looks like a very similar trace.:

srun -n 12 ./generic_testcase.exe
srun: error: nid001013: task 6: Segmentation fault
srun: Terminating job step 261079.3

gdb generic_testcase.exe core

Reading symbols from generic_testcase.exe...

Core was generated by `/lus/cls01095/work/n01/shared/acc/xios-test/generic_testcase/./generic_testcase'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00002b4f84e823ba in std::local_Rb_tree_decrement (__x=0x3002d20) at ../../../../../cray-gcc-10.1.0-202007190429.e9e165705b91a/libstdc++-v3/src/c++98/tree.cc:123
123	../../../../../cray-gcc-10.1.0-202007190429.e9e165705b91a/libstdc++-v3/src/c++98/tree.cc: No such file or directory.
[Current thread is 1 (Thread 0x2b4f84382bc0 (LWP 106856))]


(gdb) where
#0  0x00002b4f84e823ba in std::local_Rb_tree_decrement (__x=0x3002d20) at ../../../../../cray-gcc-10.1.0-202007190429.e9e165705b91a/libstdc++-v3/src/c++98/tree.cc:123
#1  std::_Rb_tree_decrement (__x=0x3002d20) at ../../../../../cray-gcc-10.1.0-202007190429.e9e165705b91a/libstdc++-v3/src/c++98/tree.cc:123
#2  0x00000000005fbe17 in std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_get_insert_hint_unique_pos(std::_Rb_tree_const_iterator<std::pair<int const, int> >, int const&) ()
#3  0x00000000005fbc0f in std::_Rb_tree_iterator<std::pair<int const, int> > std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<int const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<int const, int> >, std::piecewise_construct_t const&, std::tuple<int const&>&&, std::tuple<>&&) ()
#4  0x0000000000649a74 in xios::CDomain::computeWrittenCompressedIndex(int) ()
#5  0x00000000008a7bf5 in xios::CNc4DataOutput::writeDomain_(xios::CDomain*) ()
#6  0x0000000000a4f84d in xios::CDataOutput::writeGrid(xios::CGrid*, bool) ()
#7  0x0000000000698031 in xios::CFile::createHeader() ()
#8  0x0000000000696d63 in xios::CFile::checkWriteFile() ()
#9  0x00000000006776cc in xios::CField::writeField() ()
#10 0x0000000000677437 in xios::CField::writeUpdateData(xios::CArray<double, 1> const&) ()
#11 0x000000000087d5fc in xios::CInputPin::setInput(unsigned long, std::shared_ptr<xios::CDataPacket>) ()
#12 0x00000000009b3872 in xios::COutputPin::deliverOuput(std::shared_ptr<xios::CDataPacket>) ()
#13 0x00000000009b345d in xios::COutputPin::onOutputReady(std::shared_ptr<xios::CDataPacket>) ()
#14 0x00000000009e4cbd in void xios::CSourceFilter::streamData<1>(xios::CDate, xios::CArray<double, 1> const&, bool) ()
#15 0x0000000000690357 in void xios::CField::setData<1>(xios::CArray<double, 1> const&, int) ()
#16 0x0000000000676bf0 in xios::CField::recvUpdateData(std::map<int, xios::CBufferIn*, std::less<int>, std::allocator<std::pair<int const, xios::CBufferIn*> > >&) ()
#17 0x0000000000675046 in xios::CField::recvUpdateData(xios::CEventServer&) ()
#18 0x0000000000674ba7 in xios::CField::dispatchEvent(xios::CEventServer&) ()
#19 0x0000000000627ea2 in xios::CContextServer::dispatchEvent(xios::CEventServer&) ()
#20 0x0000000000627166 in xios::CContextServer::processEvents() ()
--Type <RET> for more, q to quit, c to continue without paging--
#21 0x0000000000626b32 in xios::CContextServer::eventLoop(bool) ()
#22 0x00000000009c53f8 in xios::CServer::contextEventLoop(bool) ()
#23 0x00000000009c402d in xios::CServer::eventLoop() ()
#24 0x000000000062a42e in xios::CXios::initServerSide() ()
#25 0x000000000043b1bc in main () at /lus/cls01095/work/n01/shared/acc/xios-test/src/test/generic_testcase.f90:99

For the records, this is using:

CC --version
Cray clang version 10.0.4 (ffb772459f6195ccd74395703e557b253fc8ee27)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/cray/pe/cce/10.0.4/cce-clang/x86_64/share/../bin

ftn --version
Cray Fortran : Version 10.0.4

Other partners have successfully run this test with the same NEMO/XIOS combination elsewhere with GNU or Intel compilers.

Change History (4)

comment:1 Changed 2 months ago by ymipsl

  • Owner changed from ymipsl to jderouillat

comment:2 Changed 2 months ago by jderouillat

Thank you for this report.
Did you try to run NEMO or the testcase compiling XIOS in debug mode (./make_xios --debug ...) to have an even more explicit trace ?
For the testcase, can you provide the set of XML files that you are using ?

comment:3 Changed 2 months ago by acc

This is a sensible suggestion but the code completes successfully when running with --debug option on which rather points to an unsafe optimisation as the cause. I'll try to work out at which level of optinmisation it breaks. --prod is using -O2 and --debug -g

The XMLs I'm using are just those in the generic_testcase subdirectory. I simply copy in the binary from the bin directory and run:

srun -n 12 ./generice_testcase.exe

For the debug version this runs:

srun -n 12 ./generic_testcase.exe
-> info : The number of secondary server pools is 1
-> info : The number of secondary server pools is 1
-> info : The number of secondary server pools is 1
-> info : The number of secondary server pools is 1
-> info : intercommCreate::client 1 intraCommSize : 4 intraCommRank :1  clientLeader 4
-> info : intercommCreate::client 2 intraCommSize : 4 intraCommRank :2  clientLeader 4
-> info : intercommCreate::client 3 intraCommSize : 4 intraCommRank :3  clientLeader 4
-> info : intercommCreate::server (server level 1) 5 intraCommSize : 4 intraCommRank :1  clientLeader 0
-> info : intercommCreate::server (server level 1) 6 intraCommSize : 4 intraCommRank :2  clientLeader 0
-> info : intercommCreate::server (server level 2) 9 intraCommSize : 4 intraCommRank :1  clientLeader 4
-> info : intercommCreate::server (server level 2) 10 intraCommSize : 4 intraCommRank :2  clientLeader 4
-> info : intercommCreate::server (server level 2) 11 intraCommSize : 4 intraCommRank :3  clientLeader 4
-> info : intercommCreate::client 0 intraCommSize : 4 intraCommRank :0  clientLeader 4
-> info : intercommCreate::server (server level 1) 4 intraCommSize : 4 intraCommRank :0  clientLeader 0
-> info : intercommCreate::server (server level 1) 7 intraCommSize : 4 intraCommRank :3  clientLeader 0
-> info : intercommCreate::server (server level 2) 8 intraCommSize : 4 intraCommRank :0  clientLeader 4
-> info : intercommCreate::client (server level 1) 4 intraCommSize : 4 intraCommRank :0  clientLeader 8
-> info : intercommCreate::client (server level 1) 5 intraCommSize : 4 intraCommRank :1  clientLeader 8
-> info : intercommCreate::client (server level 1) 6 intraCommSize : 4 intraCommRank :2  clientLeader 8
-> info : intercommCreate::client (server level 1) 7 intraCommSize : 4 intraCommRank :3  clientLeader 8
 finished Successfully
 finished Successfully
 finished Successfully
 finished Successfully
 finished Successfully
 finished Successfully
 finished Successfully
 finished Successfully
 finished Successfully
 finished Successfully
 finished Successfully
 finished Successfully
Version 0, edited 2 months ago by acc (next)

comment:4 Changed 2 months ago by acc

Also works at -O1 -we could live with this if it is the only solution available. The crayCC man page doesn't provide any useful information that might help pin-point the type of optimisation at fault, just this:

              Specify which optimization level to use:
                 -O0 Means "no optimization": this level compiles the fastest and generates the most debuggable code.

                 -O1 Somewhere between -O0 and -O2.

                 -O2 Moderate level of optimization which enables most optimizations.

                 -O3  Like -O2, except that it enables optimizations that take longer to perform or that may generate larger code (in an attempt to make the pro-
                 gram run faster)

Wonderful!

Note: See TracTickets for help on using tickets.