Opened 2 years ago

Closed 2 years ago

Last modified 2 years ago

#752 closed defect (fixed)

XIOS crash in maint_resp

Reported by: mmcgrath Owned by: somebody
Priority: blocker Milestone: ORCHIDEE 4.1
Component: Anthropogenic processes Version:
Keywords: Cc:

Description

When launching FG1 for 15PFTs and one age class with r6980 on Irene with 128 CPUs in production mode (127 CPUs on ORCHIDEE, 1 CPU for XIOS), I ran into an XIOS issue at the end of the first year for a SPINUP_ANALYTIC_FG1 run. The first year completes (according to out_orchidee), and then the crash happens.


In file "nc4_data_output.cpp", function "void xios::CNc4DataOutput::writeFieldData_(xios::CField *)", line 2669 -> On writing field data: maint_resp
In the context : orchidee_server
Error when calling function ncPutVaraType(ncid, varId, start, count, data)
NetCDF: Numeric conversion not representable
Unable to write data given the location id: 65536 and the variable whose id: 104 and name: maint_resp

(1) void cxios_init_server()

(2) bool xios::CContext::checkBuffersAndListen(bool)
Object id="orchidee_server" object type="context"
*** XIOS attributes as defined in XML file(s) or via Fortran interface:
[]
*** Additional information:
[enabled files="sechiba1 sechiba3 stomate1 stomate2 "]

(3) static bool xios::CField::dispatchEvent(xios::CEventServer &)

(4) static void xios::CField::recvUpdateData(xios::CEventServer &)

(5) void xios::CField::recvUpdateData(std::map<int, xios::CBufferIn *, std::less<int>, std::allocator<st...)
Object id="field_undef_id_99" object type="field"
*** XIOS attributes as defined in XML file(s) or via Fortran interface:
[compression_level="2" default_value="9.96921e+36" detect_missing_value="true" enabled="true" field_ref="maint_resp" freq_offset="0ts" freq_op="1ts" grid_ref="grid_nvm_out" level="0" long_name="Maintenance respiration per PFT" name="maint_resp" operation="average" ]

terminate called after throwing an instance of 'xios::CException'


I tested by putting write statements before and after

CALL xios_orchidee_send_field("MAINT_RESP",resp_maint)

line 2352 in stomate_lpj. All processors passed this point. There are too many points to check visually as this code is passed every day, so I included code to check for values greater than 1.0, less than 0.0001, and which are not equal to zero (many values are zero).

do ipts=1,npts

do ivm=1,nvm

if ((resp_maint(ipts,ivm) .GT. 10000.0) .OR. (resp_maint(ipts,ivm) .LT. 1d-16) .and. (resp_maint(ipts,ivm) .NE. 0.0)) THEN

write(numout,*) "TESTTT ",ipts,ivm,resp_maint(ipts,ivm)

endif

enddo

enddo

There is also a similar line at 963 in slowproc.

The line in slowproc is giving occasional values of 1e-323, 1e+200. Seems to be something initialized. If I comment out the slowproc line, the code runs until the XIOS call to npp later on. This makes sense, as the npp is calculated from the maint_resp in a previous line, and so bad values in maint_rest will give bad values in the npp.

Change History (3)

comment:1 Changed 2 years ago by mmcgrath

  • Milestone set to ORCHIDEE 4.1

comment:2 Changed 2 years ago by mmcgrath

There were a few arrays, including resp_maint, that are passed from stomate_main to slowproc and then passed to XIOS. However, they are not initialized in stomate_main except at the stomate timestep. So XIOS is accumulating 47 values during the day that are uninitialized.

I changed this in r6985.

Last edited 2 years ago by mmcgrath (previous) (diff)

comment:3 Changed 2 years ago by mmcgrath

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.