Opened 16 months ago

Closed 6 months ago

#757 closed defect (fixed)

XIOS crash in global debug mode

Reported by: mmcgrath Owned by: somebody
Priority: minor Milestone: ORCHIDEE 4.1
Component: Tools Version:
Keywords: Cc:

Description

When running r6999 of the TRUNK and compiling in debug mode on Irene with 128 processors (1 for XIOS, 127 for ORCHIDEE), the following crash happens very quickly in the first year of SPINUP_ANALYTIC.FG1 SVN version.

==== backtrace ====
 2 0x000000000006bc9c mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.7.3112/src/mxm/util/debug/debug.c:641
 3 0x000000000006c1ec mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.7.3112/src/mxm/util/debug/debug.c:616
 4 0x00000000000363f0 killpg()  ??:0
 5 0x00000000039331b0 _ZNK4xios5CTypeINS_9CDurationEE3getEv()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/XIOS/src/type/type_impl.hpp:96
 6 0x0000000003b0f074 _ZNK4xios18CAttributeTemplateINS_9CDurationEE8getValueEv()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/XIOS/src/attribute_template_impl.hpp:86
 7 0x0000000003fee21e _ZN4xios6CField19checkTimeAttributesEPNS_9CDurationE()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/XIOS/src/node/field.cpp:2061
 8 0x0000000004008549 _ZN4xios6CField21getTemporalDataFilterERNS_17CGarbageCollectorENS_9CDurationE()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/XIOS/src/node/field.cpp:1510
 9 0x00000000055347b2 _ZNK4xios28CFilterTemporalFieldExprNode6reduceERNS_17CGarbageCollectorERNS_6CFieldExx()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/XIOS/src/parse_expr/filter_expr_node.cpp:88
10 0x0000000005538dd9 _ZNK4xios28CFilterFieldScalarOpExprNode6reduceERNS_17CGarbageCollectorERNS_6CFieldExx()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/XIOS/src/parse_expr/filter_expr_node.cpp:165
11 0x0000000003ff4382 _ZN4xios6CField16buildFilterGraphERNS_17CGarbageCollectorEbxx()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/XIOS/src/node/field.cpp:1223
12 0x000000000416838a _ZN4xios5CFile31buildFilterGraphOfEnabledFieldsERNS_17CGarbageCollectorE()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/XIOS/src/node/file.cpp:974
13 0x0000000003c97c5e _ZN4xios8CContext31buildFilterGraphOfEnabledFieldsEv()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/XIOS/src/node/context.cpp:846
14 0x0000000003c5d491 _ZN4xios8CContext15closeDefinitionEv()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/XIOS/src/node/context.cpp:709
15 0x00000000048db5c8 cxios_context_close_definition()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/XIOS/src/interface/c/icdata.cpp:122
16 0x00000000034dc136 idata_mp_xios_close_context_definition_()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/XIOS/ppsrc/xios/interface/fortran/idata.f90:440
17 0x0000000001b33439 xios_orchidee_mp_xios_orchidee_close_definition_()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/ORCHIDEE/build/ppsrc/parallel/xios_orchidee.f90:670
18 0x00000000005953fe intersurf_mp_intersurf_initialize_2d_()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/ORCHIDEE/build/ppsrc/sechiba/intersurf.f90:273
19 0x00000000004ddf00 MAIN__()  /ccc/work/cont003/gen6328/mcgrathm/TRUNK.HEAD/modeles/ORCHIDEE/build/ppsrc/orchidee_ol/dim2_driver.f90:1105
20 0x000000000044d68e main()  ??:0
21 0x0000000000022545 __libc_start_main()  ??:0
22 0x000000000044d5a9 _start()  ??:0

This line in intersurf corresponds to

    CALL xios_orchidee_close_definition

which takes no arguments from ORCHIDEE, and therefore it's not at all clear what might be causing the problem. Some initial analysis of XIOS suggests that it occurs when the code is getting ready to create the database of variables on the server, perhaps in the indexing.

Not too long ago, some changes were made to orchidee_xios.f90 in the trunk. These were re-examined, but nothing obvious was found.

I attempted to revert to revision 6616 of xios_orchidee.f90 (just after the integration of CAN with the TRUNK) while keeping the rest of the files at r6999. This does not seem to crash there (a crash happens in XIOS but later, in stomate_lpj...see ticket #756). Therefore, it appears to be a problem introduced since then.

Change History (1)

comment:1 Changed 6 months ago by jgipsl

  • Resolution set to fixed
  • Status changed from new to closed

Revision 7345 has been tested successfully at irene in debug mode with experience SPINUP_ANALYTIC_FG1 for 5 days (PeriodLength=1D) and 1 year (PeriodLength=1Y). Test have successfully been done with 32 and 128 processors.

Revision 6999 was also tested for the same experiment but crashed with another error message pointing out the line xios_orchidee_send_field for rhSoil in stomate_lpj.f90, during the first day. When that variable was deactivated in file_def_orchidee.xml, the line above for rhLitter crashed.

We expect the error has been resolved in one of commits done since 6999. Nothing more will be done.

Note: See TracTickets for help on using tickets.