gprof GNU Profile tool
GNU Profile (gprof) is a performance analysis tool that helps developers identify code bottlenecks and optimize their programs. It provides detailed information about the execution time and call frequency of functions within a program.
gprof can be used to:
Detect performance bottlenecks in your code
Identify which functions consume the most execution time
Analyze the call graph of your program
Help prioritize optimization efforts
Usage
QEMU example
For this example, we’re using QEMU and aarch64-none-elf-gcc with the qemu-armv8a board.
Configure
./tools/configure.sh -E qemu-armv8a:nshand make sureCONFIG_SYSTEM_GPROFandCONFIG_PROFILE_MINIare enabledBuild
make -jLaunch qemu:
qemu-system-aarch64 -cpu cortex-a53 -smp 4 -nographic \ -machine virt,virtualization=on,gic-version=3 \ -chardev stdio,id=con,mux=on -serial chardev:con \ -mon chardev=con,mode=readline -semihosting -kernel ./nuttx
Mount hostfs for saving data later:
nsh> mount -t hostfs -o fs=. /mnt
Start profiling:
nsh> gprof start
Do some test and stop profiling:
nsh> gprof stop
Dump profiling data:
nsh> gprof dump /mnt/gmon.out
Analyze the data on host using gprof tool:
$ aarch64-none-elf-gprof nuttx gmon.out -b
Note
The saved file format complies with the standard gprof format. For detailed instructions on gprof command usage, please refer to the GNU gprof manual: https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_mono/gprof.html
Example output:
$ aarch64-none-elf-gprof nuttx gmon.out -b
Flat profile:
Each sample counts as 0.001 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
75.58 12.44 12.44 12462 0.00 0.00 up_idle
24.30 16.44 4.00 4 1.00 1.00 up_ndelay
0.05 16.45 0.01 177 0.00 0.00 pl011_txint
0.02 16.45 0.00 35 0.00 0.00 uart_readv
This output shows the performance profile of the program, including execution time and call counts for each function. The flat profile table provides a quick overview of where the program spends most of its time. This information can be used to identify performance bottlenecks and optimize critical parts of the code.
Real board example
Let take esp32s3-devkit as an example.
Test the flat profile
Configure
./tools/configure.sh -E esp32s3-devkit:nshand make sure these items are enabled:# for gprof CONFIG_PROFILE_MINI=y CONFIG_SYSTEM_GPROF=y # save and transfer data CONFIG_FS_TMPFS=y CONFIG_SYSTEM_YMODEM=y
Build and flash
make flash ESPTOOL_PORT=/dev/ttyUSB0 -jRun
minicom -D /dev/ttyUSB0 -b 115200to connect to the boardStart profiling:
nsh> gprof start # do some test here, such as ostest nsh> gprof stop nsh> gprof dump /tmp/gmon.out nsh> sb /tmp/gmon.out
Receive the file on PC, and analyze the data on host:
$ cp nuttx nuttx_prof $ xtensa-esp32s3-elf-objcopy -I elf32-xtensa-le --rename-section .flash.text=.text nuttx_prof $ xtensa-esp32s3-elf-gprof nuttx_prof gmon.out
Test the call graph profile
Add compiler option
-pgto the component, such as ostest Makefile, like:CFLAGS += -pgEnable the configuration item
CONFIG_FRAME_POINTER
The other steps are the same as the flat profile.