Opened 6 years ago

Last modified 6 years ago

#136 new defect

Intel compiler optimization causing floating invalid in average.cpp function

Reported by: mcastril Owned by: ymipsl
Priority: major Component: XIOS
Version: 2.0 Keywords: floating invalid Intel average
Cc:

Description

Hi,

Sorry, I don't know if this bug has been previously reported. I tried some simple searches and I didn't find anything related.

After the update of XIOS2 in EC-Earth to r1475 we are having problems to complete coupled runs due to a floating invalid in average.cpp.

This problem has been reported both in CCA and MN4, using -O3 optimizations. Seems to disappear when using -O1.

What's strange to me is that this function doesn't seem to have changed so much in the last months, so I wonder if this could be a collateral effect of some other change.

The problem seems to be that the optimization is skipping the evaluation of one conditional in CAverage:final (see screenshot).

Attachments (1)

XIOS_average_cpp.png (217.3 KB) - added by mcastril 6 years ago.

Download all attachments as: .zip

Change History (4)

Changed 6 years ago by mcastril

comment:1 Changed 6 years ago by ymipsl

Hi Miguel,

What is the value of nc ?

In optimized it may be difficult to evaluate data value in the debugger.
I think there is a invalid value in the averaging buffer and the crash is occur when you try to make the division by nc.
Could you check the values of the array *out, especially the indices when the crash occur ?

The possibility is that in optimized version (-O3) an uninitialized value incoming from model can have an different value than for non optimized. The first thing to do is to check the values of the incoming flux. A simple way to do that is to make a checksum of the incriminated field in the model and to print out.

On our side, we use XIOS for CMIP6 production compiled with -O3, output thousand of variables by simulation and we don't see such problem.

Are you using DEV_CMIP6 branch or trunk ?

Regards,

Yann

comment:2 Changed 6 years ago by mcastril

Hi Yann,

Thank you for your promptly response. We are using the trunk version, as far as I know.

The value of *nc pointer can be seen in the screenshot from ARM DDT. The value is 0 at that point. I put a breakpoint in lines 73 and 77, and it was stopped on line 79. That strengths out assumption that the conditional was not being evaluated due to the optimization.

The value of nc was also analysed with gdb (ARM DDT is also using gdb) in CCA (ECMWF) and was also 0. We didn't check the *out array because the crash is happening just when it is doing the division by 0 (but if you think it could be useful we can do it). I also checked the values of the incoming array and there were lots of zeroes (because the call is done from lim_rhg) but I didn't see any NaN.

We had problems in the past with NEMO LIM non initialised variables, and for that purpose we are initialising all variables with zeroes in ice.F90 routine, so I think the problem is more due to skipping the condition than to using uninitialized variables.

In ARM DDT is possible to see the assembly code of the routine, so another option is to analyse how the compiler is compiling that routine.

comment:3 Changed 6 years ago by mcastril

Hi,

This is to clarify that this issue is related to XIOS2.0 (trunk version r1475), which is the one used in EC-Earth trunk.

The problem has been reproduced in other clusters, always or almost always using Intel compilers.

Some have reported that they could circumvent the problem by changing netCDF version. Others have removed the -fpe0 flag from NEMO because even using no optimization they cannot avoid the issue by other means.

Note: See TracTickets for help on using tickets.