wiki:Documentation/UserGuide/flags

Version 7 (modified by mmcgrath, 5 years ago) (diff)

In response to ticket #149

How to prepare for debugging

You are running ORCHIDEE, just like every other day, when it stops for no apparent reason. You don't have output files from the simulation, and the run.card lists "Fatal". What can you do?

In order to pin down exactly what the problem is, you can recompile ORCHIDEE with debug flags. These flags enable extra checks on code execution to identify unwanted behavior.

Why aren't these checks enabled by default? Speed. If you make certain assumptions, things will go much faster. For example, let's say we have a one-dimensional array with ten elements: ARRAY(1:10). Asking the question, "Are we correctly accessing this array?" is not too difficult to ask (checking to see if the element number we are trying to access is between 1 and 10), and Fortran is actually ahead of other languages here in requiring that each element of the array be within bounds (See Section 6.5.3 of the Fortran 2008 standard, for example: "The value of a subscript in an array element shall be within the bounds for its dimension."). Fortran should die with a segmentation fault if you have an array A(1:10,1:5) and you try to access A(11,1), while other languages will accept it because the offset of that address is still within the memory allocated for the array. In general, checks like this take time, in particular if you want to know exactly which line number is causing the crash.

To turn on all these checks, we change the compiler flags. Adding these checks can make your code run 10 times slower, so after turning on these flags, the first step is often to find the conditions that reproduce your crash in the shortest time possible (reducing the number of processors, reducing the spatial domain or using restart files to start the simulation the day before the crash).

With the new FCM, adding debug flags is fairly straightforward.

cd modipsl/config/ORCHIDEE_OL
vi AA_make                       => change -prod into -debug everywhere you see it (three locations?)
../../util/ins_make
gmake clean && gmake             ! "gmake clean" only needs to be done if you have previously compiled with the "prod" line

While the lines are scrolling by, you can look for things like, "-DBZ_DEBUG -g -fno-inline" and "-fpe0 -O0 -g -traceback -fp-stack-check -ftrapuv -check bounds -check all -check noarg_temp_created" to reassure yourself that the debug flags are being used.

Note that this will NOT compile IOIPSL in debug mode. Sometimes it is useful to have IOIPSL in debug mode as well. This can be done by modifying lines in modipsl:util/AA_make.gdef.

For example, the following lines are for the obelix machine at LSCE, using the Intel fortran compiler. The first line (with a single #) is the one used, while the second line (with multiple #s) is ignored.

#-Q- lxiv8    F_O = -DCPP_PARA -O3 $(F_D) $(F_P) -I$(MODDIR) -module $(MODDIR) -fp-model precise
#####-Q- lxiv8    F_O = -DCPP_PARA -p -g -traceback -fp-stack-check -ftrapuv -check bounds $(F_D) $(F_P) -I$(MODDIR) -module $(MODDIR)

The debug flags are in the second line. In order to use debug flags with IOIPSL on obelix, the first line should be commented out (by adding ####) and the second should be uncommented (by removing # until there is only one at the beginning of the line), i.e.

#####-Q- lxiv8    F_O = -DCPP_PARA -O3 $(F_D) $(F_P) -I$(MODDIR) -module $(MODDIR) -fp-model precise
#-Q- lxiv8    F_O = -DCPP_PARA -p -g -traceback -fp-stack-check -ftrapuv -check bounds $(F_D) $(F_P) -I$(MODDIR) -module $(MODDIR)

In case you find non-desired NaN values in the code, it is recommended to compile orchidee using the flag below. It triggers an exception and stops execution instead of trying to use NaN values. NaN means "Not a Number", and happens when you try to divide by zero (or take the square root of a negative number).

Note: if you modify this file you have to apply ins_make again. This way it spreads the changes to all Orchidee's folders. Then you must recompile from zero (using the "gmake clean && gmake" above).

With the above, you should be able to get XIOS, IOIPSL, and ORCHIDEE to all compile with the same flags. I noticed this was necessary to catch a memory error one time: the error was in ORCHIDEE, but it was showing up as a crash in XIOS, until I compiled everything with full debug flags...then it pointed straight to the line number in ORCHIDEE that was writing out of bounds.

For XIOS,

cd ../modeles/XIOS
vi ../arch/arch-ifort_LSCE.fcm
%DEBUG_CFLAGS   -DBZ_DEBUG -g -fno-inline -ggdb --debug
%DEBUG_FFLAGS   -g -ggdb -debug all -traceback
  ./make_xios --arch ifort_LSCE --debug

Other useful tips for debugging

Can change l_dbg = .TRUE. in errioipsl.f90 to get more information printed out about reading in .nc files.

Can also make the following changes to iodef.xml to get more information printed out from XIOS.

  <variable id="info_level"                type="int">100</variable>
  <variable id="print_file" type="bool">true</variable>

After running libIGCM/ins_job to submit a job with libIGCM, the Job_ file is created. This file has a setting for Verbosity. It's good to make sure that is as high as possible (3 is the maximum value, currently) if you are running into issues that seem to be related to libIGCM.

The printlev flags can be very useful in finding out which routine the code is crashing in. If you can't get a line number any other way, turn up the printlev as high as possible, check to see the last line printed out, and then check to see the next line which should be printed out. The crash must be happening somewhere between those two lines.

And if you are up against a bug that seems to change every single time you run the code, even with all the above flags on, you might want to check out Valgrind.