C++ Performance Analysis & Profiling Tools

Posted on August 13, 2012



At the moment (10th Aug, 2012) I working at CERN with the computing department of ALICE project. “Root” is a framework developed by CERN to handle and analyze large amounts which are generated from the collides here at CERN. AliRoot (Alice+ROOT) is a framework based on Root which has specialized classes for analysis of ALICE experiment (A Large Ion Collider Experiment). My task here is to make an analysis of performance of AliRoot and to suggest places where optimizations can be made in terms of memory as well as CPU usage. To accomplish this task, it was needed that I profile the code. There are a few excellent tools for profiling, heap analysis, memory utilization analysis, CPU performance analysis etc. In this article I am summarizing the usage of the tools that I had to get accustomed to.

This article shall go on to explain the basic usages and other references related to following :

  1. GDB / DDD (GNU Debugger)
  2. time
  3. valgrind
  4. GNU gprof (GNU Profiler)
  5. gperftools (Google Performance Tools)

If you are looking to get your hands on a fairly long and complex code, you may download my code for connected component labelling which is available here. Please follow the instructions in README file in the tar ball.

GDB / DDD (GNU Debugger)

GDB, the GNU Project debugger, allows you to see what is going on `inside’ another program while it executes. In the sense it shall allow you to pause the code wherever you like and inspect contents of variables. It shall also allow you to step through the code, watch changes to a variable. It is a commandline utility. To enable GDB to inspect code, it is needed that you compile the code with -g option.

$ gcc -g -o testProg testProg.c

A GDB tutorial which teaches you to use GDB commands can be found here. Although there are several others. I have followed that above listed tutorial. Yet another way to debug the program is to use DDD (Data Display Debugger). DDD provides a GUI front-end for GDB. Often when your program grows large GDB can be painful to use (my personal option) with have to remember all the places you put breakpoints etc. However, DDD provides a GUI front-end to GDB, which is much easier to use and visualize your code. DDD is fairly straight forward to install and use. A DDD usage tutorial can be found from here.

DDD Screenshot

Unix `time` Utility

One of the simplest thing that can be done to analyse the running time of the program is to use the Unix command `time`. Assume that your executable is called “a.out” in the current directory run it as

$ time ./a.out
real 0m0.088s
user 0m0.001s
sys 0m0.002s

The Unix `time` command writes a message to standard output giving timing statistics about the program run. The term “real time” in this context refers to elapsed “wall clock” time, like using a stop watch. “user” or “user CPU time” is the time required by the CPU to perform user-mode operations viz add, subtract, loop in array etc etc. “sys” or “system CPU time” is the time required by the CPU in kernel mode, ie. for execution of system calls like fork(), exec() etc. More information of this is available on the Wikipedia page.


Valgrind provides tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. The Valgrind distribution currently includes six production-quality tools: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache and branch-prediction profiler, and a heap profiler.

If you are totally unfamiliar with valgrind, you can have a look at the Valgrind Quick Start Guide. The quick start guide will give you enough information to get you started. If you are looking for a detailed valgrind tutorial describing each of its tools you should have a look at the Valgrind User Manual. The tools available with valgrind are —

  • Memcheck is a memory error detector. It helps you make your programs, particularly those written in C and C++, more correct.
  • Cachegrind is a cache and branch-prediction profiler. It helps you make your programs run faster.
  • Callgrind is a call-graph generating cache profiler. It has some overlap with Cachegrind, but also gathers some information that Cachegrind does not.
  • Helgrind is a thread error detector. It helps you make your multi-threaded programs more correct.
  • DRD is also a thread error detector. It is similar to Helgrind but uses different analysis techniques and so may find different problems.
  • Massif is a heap profiler. It helps you make your programs use less memory.
  • DHAT is a different kind of heap profiler. It helps you understand issues of block lifetimes, block utilisation, and layout inefficiencies.
  • SGcheck is an experimental tool that can detect overruns of stack and global arrays. Its functionality is complementary to that of Memcheck: SGcheck finds problems that Memcheck can’t, and vice versa..
  • BBV is an experimental SimPoint basic block vector generator. It is useful to people doing computer architecture research and development.

Few commands in daily use for me :

$ gcc -g -o myProg myProg.c
# Assume you normally run the program like $ myProg arg1 arg2
$ valgrind --tool=memcheck myProg arg1 arg2
$ valgrind --tool=callgrind myProg arg1 arg2

Please be informed that when running programs under valgrind simulation the program tend to be slowed down by a factor of 10-50. Slow down really depends on how complex is your program, number of variables functions etc. The programs which executed instantly may required a few seconds. Programs which required seconds may consume minutes and programs which required minutes may run into hours. Just to share an experience, one of the AliRoot test program “Gun” requires about 2 minutes normal executed for about 2 days under valgrind simulation.

Tools I have been using here at CERN are (i) memcheck, (ii) cachegrind, (iii) callgrind. Following few sub-sections shall give usage details on these tools.


Memcheck is a memory error detector. It can detect the following problems that are common in C and C++ programs. Complete manual/tutorial on “memcheck” can be accessed here.

  • Accessing memory you shouldn’t, e.g. overrunning and underrunning heap blocks, overrunning the top of the stack, and accessing memory after it has been freed.
  • Using undefined values, i.e. values that have not been initialised, or that have been derived from other undefined values.
  • Incorrect freeing of heap memory, such as double-freeing heap blocks, or mismatched use of malloc/new/new[] versus free/delete/delete[]
  • Overlapping src and dst pointers in memcpy and related functions.
  • Memory leaks.
# Basic Usage 
$ valgrind --tool=memcheck myProg arg1 arg2

# Memory Leak Detection. Use, "yes", "full", "summary", "no" 
$ valgrind --leak-check=yes.


Cachegrind simulates how your program interacts with a machine’s cache. In particular it tells you exactly how many times data was retrieved from CPU cache (L1, L2, L3) and how many times there was a miss. It also gives an analysis of CPU branch predictions. This utility is particularly help if in detecting slow downs in execution due to frequent cache misses. Usually a mis percentage of 2% to 5% is acceptable. A complete cachegrind manual can be accessed from here.

It is helpful to know and understand the following terms / shorts when using callgrind. Please also consult the cachegrind manual for more details. It is worth mentioning here that cachegrind simulates two levels of caches only L1 and LL (Last level). The author of this article is very well aware that most CPUs do have 3 levels of caches L1, L2, L3 (LL). But from an analysis point of view only two levels are significant. This is because L1 miss shall cause about 10 cycles. Where as L3 (LL) cache shall cause about 200 cycles and a mis-predicted branch may cause 20-30 cycles.

  • Ir : Instruction cache reads
  • I1mr : L1 cache misses (Instruction Cache)
  • LLmr : Last level (L3) cache misses. (Instruction Cache)
  • Dr : Data cache reads
  • D1mw : 1st level data cache miss
  • DLmw : Last level data cache miss
  • Bc : Conditional branches executed
  • Bcm : Conditional branches mis-predicted
  • Bi : Indirect branches executed
  • Bim : Indirect branches missed

If you are not familiar with terms like cache, L1 cache, branch prediction, Instruction cache, Data cache, Conditional branches, Indirect branches etc please refer to a Computer Organization book like the one by Andrew Tanenbaum. Alternatively there are articles on wikipedia of each of these terms.

Execution of the tool produces a file containing profile data which may not be human-readable. Use tools like “cg_annotate” and/or a GUI tool “KCachegrind” for visualization of profile data obtained from cachegrind.  Here is a summary of the commands that are commonly executed :

$ valgrind --tool=cachegrind ./prog                   
$ valgrind --tool=cachegrind --branch-sim=yes ./prog  # with branch prediction stats

# Upon completion produces a file "cachegrind.out.<pid> (not human readable) also 
# ouputs cumulative stats

# Function-wise cache stats
$ cg_annotate cachegrind.out.15886 
# Source code annotation. Gives cache miss-hit analysis of each line of C-code
$ cg_annotate --auto=yes cachegrind.out.15886 file1.c file2.c > annotation_dump

More details and usage available in cachegrind manual.


It is a CPU profiler. It collects data, which consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls. Optionally, cache simulation and/or branch prediction (similar to Cachegrind) can produce further information about the runtime behavior of an application. The profile data is written out to a file at program termination. For presentation of the data use of tool “callgrind_annotate” is imminent. It may also be of interest to know that a GUI tool “KCachegrind” is also available for visualization of profile data.

Common commands :

$ valgrind --tool=callgrind ./myProg
# produces callgrind.out.<pid>

# Basic CPU profile (function-wise)
$ callgrind_annotate callgrind.out.16010

# Source code annotation
$ callgrind_annotate callgrind.out.16010 file1.c > callgrind.dump

# Visualization with Kcachegrind. I usually use this and it is nicer to see with this one
$ kcachegrind callgrind.out.16010 &

Please be aware that I have not put up basic usage commands for “callgrind_control”, for callgrind for programs with threads. More details of which can be obtained from callgrind manual which is available here.

GNU gprof (GNU Profiler)

This is another tool which allows you to learn where your program spent its time and which functions called which other functions while it was executing. Although its working are essentially overlapping with callgrind it is a fundamental different program. It might be useful to know that unlike callgrind, “gprof” does not do a simulation of the run. An excellent tutorial on gprof written by Richard Stallman himself can be found here.

In order that gprof is usable, it is required that the programs be compiled with “-pg” argument of GCC.

$ gcc -g -pg -o myprog myprog.c 
$ ./myprog arg1 arg2

Upon successful termination of the program a file called “gmon.out” is automatically written in the current directory. It contains the profile data which is not human-readable. It basically contains the “Flat Profile” and the “Call-Graph“. Use the command “gprof” to read the profile data.

$ gprof ./myprog > gprofdump.txt

The call-graph which is produced may be difficult to read and visualize. It is possible to visualize this call-graph. A python script called `pgrof2dot.py` converts the output of gprof into the dot format which can be converted easily to either an image or a pdf. You may need to have the `graphviz` libraries installed for this visualization. More info on pgprof2dot can be found here.

# Installation of graphviz
$ sudo yum install graphviz 
#       or
$ sudo apt-get install graphviz

Make sure you download the python script `gprof2dot.py` and make sure it is searchable in the PATH.

$ gprof ./myProg | gprof2dot.py | dot -Tpng -o out.png

You should be able to get an output similar to one shown in the figure below :

gperftools (Google Performance Tools)

Performance analysis tools by Google which are in use at Google. It includes  TCMalloc, heap-checker, heap-profiler and cpu-profiler. You can find more information of gperftools by visiting its home page. Source code or compiled binaries of gperftools can be downloaded from here. Best way to install these tools is using source code. If you do not know how to compile Linux softwares with source code you may like to have a look at here. Please note that you need to install `libunwind` before installing gperftools, it is a pre-requisite. Find it here.

In order that gperftools analyse your code you need to compile your source with `-ltcmalloc` (for heapprofiler and heapchecker) and `-lprofiler`. A tutorial of these tools can be found here. This para explains the everyday use commands. In use are the following 3 tools

  1. Heap Checker : Just set the environment variable “HEAPCHECK” to point to the executable. Run the executable normally. gperftools shall output after exit of the program. Details.
  2. Heap Profiler : Set environment variable “HEAPPROFILE”. Run executable normally, outcome by gperftools produced at the end of execution. Details.
  3. CPU Profiler : Set environment variable “CPUPROFILE” before you run the program. It is also possible to convert the outcome into callgrind format and to visualize the results with kcachegrind and/or to convert to call-graphs. Details.

Please be aware that “Heap profiler” and “Heap checker” cannot be run at the same time. Google performance tools are significantly faster than valgrind. But beware that google tools may give an approximate answer.

Finally, I would also like to sugest the readers to also read a performance analysis tutorial from Lawrence Livermore National Laboratory which can be found here. This concludes my discussion on the performance analysis and profiling tools. Reviews, comments, suggestions welcome. If you may please also share any experience that you may have in performance analysis and profiling.

Perf – Linux kernel profiling with perf

Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurements and presents a simple commandline interface. The perf tool offers a rich set of commands to collect and analyze performance and trace data. With perf it is possible to get readings directly from hardware registers which measure performance. Note that this is only available if you have Linux Kernel 2.6 or up. Refer to Perf Wiki to know detailed usage for perf.

Basic usage for perf is as follows :

$ perf stat -B ./myProg arg1 arg2

Most likely use of ‘perf’ is to have an accurate idea about number of CPU cycles used, cache performance, branch prediction performance, memory access patterns (bus cycles). Following is the measurements list that can be obtained for your program using ‘perf’

cpu-cycles OR cycles                       [Hardware event]
 instructions                               [Hardware event]
 cache-references                           [Hardware event]
 cache-misses                               [Hardware event]
 branch-instructions OR branches            [Hardware event]
 branch-misses                              [Hardware event]
 bus-cycles                                 [Hardware event]

 cpu-clock                                  [Software event]
 task-clock                                 [Software event]
 page-faults OR faults                      [Software event]
 minor-faults                               [Software event]
 major-faults                               [Software event]
 context-switches OR cs                     [Software event]
 cpu-migrations OR migrations               [Software event]
 alignment-faults                           [Software event]
 emulation-faults                           [Software event]

 L1-dcache-loads                            [Hardware cache event]
 L1-dcache-load-misses                      [Hardware cache event]
 L1-dcache-stores                           [Hardware cache event]
 L1-dcache-store-misses                     [Hardware cache event]
 L1-dcache-prefetches                       [Hardware cache event]
 L1-dcache-prefetch-misses                  [Hardware cache event]
 L1-icache-loads                            [Hardware cache event]
 L1-icache-load-misses                      [Hardware cache event]
 L1-icache-prefetches                       [Hardware cache event]
 L1-icache-prefetch-misses                  [Hardware cache event]
 LLC-loads                                  [Hardware cache event]
 LLC-load-misses                            [Hardware cache event]
 LLC-stores                                 [Hardware cache event]
 LLC-store-misses                           [Hardware cache event]

 LLC-prefetch-misses                        [Hardware cache event]
 dTLB-loads                                 [Hardware cache event]
 dTLB-load-misses                           [Hardware cache event]
 dTLB-stores                                [Hardware cache event]
 dTLB-store-misses                          [Hardware cache event]
 dTLB-prefetches                            [Hardware cache event]
 dTLB-prefetch-misses                       [Hardware cache event]
 iTLB-loads                                 [Hardware cache event]
 iTLB-load-misses                           [Hardware cache event]
 branch-loads                               [Hardware cache event]
 branch-load-misses                         [Hardware cache event]

Other Resources

Here are some resources, which I happen to know after writing this blog entry.