wiki:Documentation/UserGuide/valgrind

Version 3 (modified by luyssaert, 4 years ago) (diff)

--

Debugging with Valgrind

Objective

Background of this item: This is a programmer's worst nightmare: your code is crashing, but the crash is not reproducible. Even with an executable that was compiled with full debug flags, checking array bounds, and checking uninitialized values. You are seeing errors on lines which cannot possibly contain errors, and when you add a WRITE statement, the error moves to a different line. Sound familiar? You probably have a memory bug. One common way these occur is when your program attempts to write to memory outside of the block of memory allocated to a particular variable. For example, my_vector(1:10), and your code attempts to set the value of my_vector(11). Compiler flags catch a lot of these problems (e.g., check-bounds=all). But compilers are just tools, written by people, and they may be buggy themselves. They may catch 99.9% of bugs, but leave you hanging on the last 0.1%. Or, the way the code is structured, the compiler may not be able to tell that memory outside of allocated memory is being used. Valgrind is a powerful tool for cases like this. And, thankfully for us, it is currently installed on Obelix.

Valgrind on Obelix

Authors: M. McGrath
Last revision: M. McGrath (2019/03/08)

Valgrind works by tracking every single piece of memory reference in the code. As such, it knows if you try to write to (or read from) a bad piece of memory, e.g. something which has already been deallocated or which was never allocated in the first place. The downside is that this is really expensive. Tracking every piece of memory requires a lot of computational power and memory. Expect a code run with Valgrind to be at least 10 times slower than a normal debug code (which is already 10 times slower than optimized code). Just getting to the first timestep of the model takes about 15 minutes, due to loading in maps (from what I can tell). I ran a 5x6 grid (30 pixels over Europe) for one month, and the run took 3 hours. I also do not believe it can be run in parallel.

On the other hand, the information you can get is worth its weight in gold. Here is an example.

valgrind --leak-check=full --expensive-definedness-checks=yes --track-origins=yes -v ./orchidee_ol

Sometimes the run crashes with a regular stack trace, which you can treat like a normal debugging run (even if the error did not happen in a normal run). This is pretty straightforward if you are used to debugging. In this way, Valgrind acts like a compiler on steroids.

Additionally, Valgrind will print out information at the end of the run (if it crashes, or if it completes). The lines look something like this.

==20278== Invalid write of size 8
==20278==    at 0x1A0BDCD: stomate_prescribe_mp_make_saplings_ (stomate_prescribe.f90:1406)
==20278==    by 0x198C4F0: stomate_prescribe_mp_prescribe_ (stomate_prescribe.f90:728)
==20278==    by 0x16196EE: stomate_lpj_mp_stomate_lpj_vegetation_ (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0xF384A2: stomate_mp_stomate_main_ (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0xCA35B9: slowproc_mp_slowproc_main_ (slowproc.f90:957)
==20278==    by 0xA480F9: sechiba_mp_sechiba_main_ (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0x5F9E19: intersurf_mp_intersurf_main_2d_ (intersurf.f90:580)
==20278==    by 0x51D0C9: MAIN__ (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0x40B96B: main (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==  Address 0x23b2bab8 is 8 bytes inside a block of size 760 free'd
==20278==    at 0x19D43CBD: free (vg_replace_malloc.c:530)
==20278==    by 0x476A6CA: for__free_vm (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0x474FDBC: for__release_lun (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0x476DD6D: for_write_int_fmt_xmit (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0x476BC9B: for_write_int_fmt (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0x4489EF6: getincom_mp_getin_dump_ (getincom.f90:1974)
==20278==    by 0x5E92FC: intersurf_mp_intersurf_initialize_2d_ (intersurf.f90:409)
==20278==    by 0x4D5D1A: MAIN__ (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0x40B96B: main (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==  Block was alloc'd at
==20278==    at 0x19D42BC3: malloc (vg_replace_malloc.c:299)
==20278==    by 0x476A4D2: for__get_vm (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0x475089B: for__acquire_lun (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0x476B2FD: for_write_int_fmt (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0x4489EF6: getincom_mp_getin_dump_ (getincom.f90:1974)
==20278==    by 0x5E92FC: intersurf_mp_intersurf_initialize_2d_ (intersurf.f90:409)
==20278==    by 0x4D5D1A: MAIN__ (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)
==20278==    by 0x40B96B: main (in /home/scratch01/mmcgrath/DEBUG/MEMBUG_SMALL/orchidee_ol)

NOTE: Valgrind writes to standard output. It helps to change numout in ORCHIDEE to be equal to 6 so that all of the out_orchidee output and the Valgrid output line up, so you can see where the problem statements are (Valgrid writes error information on a line before it executes).

"Invalid write" is one of the errors that Valgrind reports. "Invalid read" and "Conditional jump or move depends on uninitialised value(s)" are two others which I have found useful. "Invalid write" and "invalid read" are given when the program tries to write to (or read from) memory that it should not be accessing. The "conditional jump" is something like an "IF(blah)" where the variable "blah" has not yet been set to "TRUE" or "FALSE" (by default, it will have one value or the other, but Valgrind warns you that you haven't set this yourself, so this behavior may not be what you expect, as the default value may change between runs).

In the case above, Valgrind tells us that a piece of memory is being written to on line 1406 of stomate_prescribe.f90. This particular piece of memory has previously been allocated (line 1974 of getincom.f90) and then deallocated (by the same line...it was a temporary variable used by a built-in function). The line of code in question is:

      bm_sapl_2D_loc(icarbres,icarbon)=zero

By checking the declared array dimensions, nothing looked out of bounds with this line (icarbres=7, icarbon=1, array declared from (1:9,1:2)). Even using the FORTRAN functions UBOUND and LBOUND showed the array was not out of bounds. I added a few more statements to see if the error was real (e.g., putting lines of "bm_sapl_2D_loc(icarbres,icarbon)=1.0" and "WRITE(numout,*) bm_sapl_2D_loc(icarbres,icarbon)" before this line), and ran tests with different numbers of pixels. This line was consistently flagged as being a problem regardless of number of pixels, as were other lines that I added beforehand which accessed the same variable. Interestingly, accessing other elements of this array (e.g., bm_sapl_2D_loc(ileaf,icarbon)) did not cause error statements.

With this, one can use write statements as a tracker. By writing out the values of ipts, ivm, and icir, I was able to see that bm_sapl_2D(4,2,1,icarbres,icarbon) used "bad" memory (i.e., Valgrind threw an error), whereas bm_sapl_2D(4,2,1,1,1) used "good" memory. Therefore, I sprinkled lines like the following through the code:

  WRITE(numout,*) 'RIEUORE  GOOD 1 ',&
               bm_sapl_2D(test_grid,test_pft,1,1,1)
  WRITE(numout,*) 'RIEUORE  BAD  1 ',&
               bm_sapl_2D(test_grid,test_pft,1,icarbres,icarbon)

which produced output (after setting numout=6) of

 RIEUORE  GOOD 1   0.000000000000000E+000
==11219== Invalid read of size 8
==11219==    at 0x198C5D5: stomate_prescribe_mp_prescribe_ (stomate_prescribe.f90:732)
==11219==    by 0x16196F6: stomate_lpj_mp_stomate_lpj_vegetation_ (in /home/scratch01/mmcgrath/DEBUG/MEMBUG/orchidee_ol)
==11219==    by 0xF384AA: stomate_mp_stomate_main_ (in /home/scratch01/mmcgrath/DEBUG/MEMBUG/orchidee_ol)
==11219==    by 0xCA35C1: slowproc_mp_slowproc_main_ (slowproc.f90:957)
==11219==    by 0xA48101: sechiba_mp_sechiba_main_ (in /home/scratch01/mmcgrath/DEBUG/MEMBUG/orchidee_ol)
==11219==    by 0x5F9E19: intersurf_mp_intersurf_main_2d_ (intersurf.f90:580)
==11219==    by 0x51D0C9: MAIN__ (in /home/scratch01/mmcgrath/DEBUG/MEMBUG/orchidee_ol)
==11219==    by 0x40B96B: main (in /home/scratch01/mmcgrath/DEBUG/MEMBUG/orchidee_ol)
==11219==  Address 0x24275688 is 24 bytes after a block of size 256 in arena "client"
==11219==
 RIEUORE  BAD  1   1.581010066691989E-321

By putting these at various points throughout the code (and changing the number and string of letters in the write string to make each place uniquely identifiable), I figured I could see when the memory first became corrupted. For example, putting one set of write statements before a subroutine call and another after a subroutine call. If the first set of statements produced no Valgrind error and the second set did, the memory must have been corrupted at the subroutine call in-between.

In summary, Valgrind can be a powerful tool for finding memory errors in ORCHIDEE, though the expensive nature of the tracking it does prevents it from being used on long runs, global runs, and runs in parallel (to my knowledge). Even if your bug doesn't show up in a small test case, it's still useful to try a small test case if you're experiencing strange behavior. This particular bug did not appear in a year-long run on a single pixel, but when run with Valgrind, Valgrind produced an error pointing to the above line on the first day...even though the code did not crash.