wiki:Documentation/UserGuide/ProfileValgrind

Version 1 (modified by mmcgrath, 7 years ago) (diff)

--

Once you have a working model, it can be helpful to profile it to see where it's taking a lot of time. It doesn't make sense to optimize parts of the code which only run for 0.001% of the time and leave something un-optimized which runs for 10% of the total runtime. Here is an introduction of how to profile using valgrind, which is installed on obelix.

The first step is to compile with the proper options. It's good to compile a production executable, although use of some debug flags will automatically remove optimization. Still, I compile with the following in modeles/ORCHIDEE/arch.fcm

%PROD_FFLAGS         -O3 -p -g -fno-inline-functions

I also make sure I have the following line in util/AA_make.gdef

#-Q- lxiv8    F_O = -p -g -fno-inline-functions -O3 $(F_D) $(F_P) -I$(MODDIR) -module $(MODDIR) -fp-model precise

since it appears that IOIPSL uses this line to create the makefiles. Then ins_make with the -prod option in config/ORCHIDEE_OL/AA_make. Notice the -fno-inline-functions command. If you inline functions, it makes your code faster, but you also lose line information. Since we would like to see which lines are taking the most time, it's good to disable inlining.

Once you have done this, you need to run ORCHIDEE without using libIGCM. You can find that information elsewhere on the wiki. I create a run.def for a single pixel but multiple years. At the beginning of the run, there is a lot of overhead involved in doing other tasks, so we want to add as much computation as possible in sechiba and stomate and not so much in interpolating maps (unless that is what you are trying to profile, of course).

This method is expensive. Very expensive. A good rule of thumb seems to be to multiply the expected optimized runtime by a factor of 100. So if it takes 30 seconds to run your run.def normally, it will take an hour with valgrind. This is because valgrind collects a lot of information instead of just doing random sampling like many profilers. I use a submission file for obelix like the following:

######################
## OBELIX      LSCE ##
######################
#PBS -N AGEC
#PBS -m a
#PBS -j oe
#PBS -q long
#PBS -o out_execution
#PBS -S /bin/ksh
#PBS -v BATCH_NUM_PROC_TOT=1
#PBS -l nodes=1:ppn=1

cd /home/orchidee03/mmcgrath/PROFILE/AGEC
rm -fr *rest_out.nc out_execution* out_orchidee callgrind.out*
valgrind --tool=callgrind --dump-instr=yes --simulate-cache=yes --collect-jumps=yes ./orchidee_ol

This creates a file called callgrind.out.????, which I open in kcachegrind (on Linux) to examine. Instructions on that are elsewhere on the Web.

One word of caution. According to an expert at Intel, valgrind was not originally developed for FORTRAN (it seems that very few debuggers/profilers are). Therefore the information is not 100% reliable. It still seems to be useful, though.