Profiling with Valgrind
Objective
Background of this item: Valgrind is a type of tool called a profiler. It is available on Obelix and Irene and it has been successfully used to analyze single and multi-processor jobs. Once you have a working model, it can be helpful to profile it to see where it's taking a lot of time. It doesn't make sense to optimize parts of the code which only run for 0.001% of the time and leave something un-optimized which runs for 10% of the total runtime. Profiling allows you to learn where your program spent its time and which functions called which other functions while it was executing. This information can show you which pieces of your program are slower than you expected and might be candidates for rewriting to make your program execute faster. One word of caution. According to an expert at Intel, valgrind was not originally developed for FORTRAN (it seems that very few debuggers/profilers are). Therefore the information is not 100% reliable. It still seems to be useful, though.
Valgrind on Irene
Authors: M.McMgrath
Last minor revision: S. Luyssaert (2020/03/19)
The first step is to compile with the proper options. It's good to compile a production executable, although use of some debug flags will automatically remove optimization. Still, I compile with the following in modeles/ORCHIDEE/arch.fcm
%PROD_FFLAGS -O3 -p -g -fno-inline-functions
I also make sure I have the following line in util/AA_make.gdef
#-Q- lxiv8 F_O = -p -g -fno-inline-functions -O3 $(F_D) $(F_P) -I$(MODDIR) -module $(MODDIR) -fp-model precise
since it appears that IOIPSL uses this line to create the makefiles. Then ins_make with the -prod option in config/ORCHIDEE_OL/AA_make. Notice the -fno-inline-functions command. If you inline functions, it makes your code faster, but you also lose line information. Since we would like to see which lines are taking the most time, it's good to disable inlining. Note that around revision 6610 the way of compiling the model has changed. The instructions provided above are no longer valid for the compilation using the ./compile_orchidee_ol.sh script. The remainder of this item should still be valid.
Once you have done this, you need to run ORCHIDEE without using libIGCM. You can find that information elsewhere on the wiki. I create a run.def for a single pixel but multiple years. At the beginning of the run, there is a lot of overhead involved in doing other tasks, so we want to add as much computation as possible in sechiba and stomate and not so much in interpolating maps (unless that is what you are trying to profile, of course). I also turn off (or minimize) writing history files, since file I/O also takes a lot of time.
This method is expensive. Very expensive. A good rule of thumb seems to be to multiply the expected optimized runtime by a factor of 100. So if it takes 30 seconds to run your run.def normally, it will take an hour with valgrind. This is because valgrind collects a lot of information instead of just doing random sampling like many other profilers. I use a submission file for obelix like the following:
###################### ## OBELIX LSCE ## ###################### #PBS -N AGEC #PBS -m a #PBS -j oe #PBS -q long #PBS -o out_execution #PBS -S /bin/ksh #PBS -v BATCH_NUM_PROC_TOT=1 #PBS -l nodes=1:ppn=1 cd /home/orchidee03/mmcgrath/PROFILE/AGEC rm -fr *rest_out.nc out_execution* out_orchidee callgrind.out* valgrind --tool=callgrind --dump-instr=yes --simulate-cache=yes --collect-jumps=yes ./orchidee_ol
This creates a file called callgrind.out.????, which I open in kcachegrind (on Linux) to examine. Instructions on that are elsewhere on the Web.