介绍 perf 工具的使用方式、局限性和问题。
计数 vs 采样 vs 追踪 #
计数 counting:统计特定事件的次数,如 perf top 和 stat,不产生 perf.data 文件。
采样 profing: 使用 perf record 按照特定频率或 CPU 事件(如指令 instrument 或 cycle)来记录 CPU 寄存器值,产生 perf.data 文件,然后在用户空间谁用 perf report 或 perf script 进行分析;
追踪 tracing:调用特定内核或用户函数(kprobe、uprobe、tracepoint、syscall)或返回时进行记录(event tracing),获得参数、返回值、延迟、执行频率等;
采样系统或进程的整体分析,对 CPU 和磁盘的消耗较大,同时对程序性能也有影响。所以,最好使用 perf probe/trace 来对特定函数进行追踪记录。
perf 是通过内核的 perf_event_output 系统调用来获取数据的。
查看 perf 构建参数 #
# perf version --build-options
perf version 4.19.91-007.ali4000.alios7.x86_64.debug
dwarf: [ on ] # HAVE_DWARF_SUPPORT
dwarf_getlocations: [ on ] # HAVE_DWARF_GETLOCATIONS_SUPPORT
glibc: [ on ] # HAVE_GLIBC_SUPPORT
gtk2: [ OFF ] # HAVE_GTK2_SUPPORT
syscall_table: [ on ] # HAVE_SYSCALL_TABLE_SUPPORT
libbfd: [ on ] # HAVE_LIBBFD_SUPPORT
libelf: [ on ] # HAVE_LIBELF_SUPPORT
libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
libperl: [ on ] # HAVE_LIBPERL_SUPPORT
libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
libslang: [ on ] # HAVE_SLANG_SUPPORT
libcrypto: [ on ] # HAVE_LIBCRYPTO_SUPPORT
libunwind: [ OFF ] # HAVE_LIBUNWIND_SUPPORT
libdw-dwarf-unwind: [ on ] # HAVE_DWARF_SUPPORT
zlib: [ on ] # HAVE_ZLIB_SUPPORT
lzma: [ OFF ] # HAVE_LZMA_SUPPORT
get_cpuid: [ on ] # HAVE_AUXTRACE_SUPPORT
bpf: [ on ] # HAVE_LIBBPF_SUPPORT
常用 perf 命令 #
perf list
to find events.perf stat
to count the events.perf record
to write events to a file.perf report
to browse the recorded file.perf script
to dump events after processing.
perf list 查看支持的事件 #
perf list 显示如下事件类型:
- hw: 硬件,CPU performance monitoring counters;
- sw:软件,基于 Kernel 的计数,如 CPU migrations, minor faults, major faults 等;
- cache
- tracepoint:追踪,基于 kernel 定义的静态追踪点;
- pmu:CPU PMU
- sdt
- event_glob: 事件名称 glob 匹配
- usdt:用户程序定义的静态追踪点;
- 动态追踪:如 kprobes、uprobes;
- 计时采样:按照特定频率产生中断事件,记录 CPU 执行情况;
[root@devops-110 ~]# perf list -h
Usage: perf list [<options>] [hw|sw|cache|tracepoint|pmu|sdt|event_glob]
-d, --desc Print extra event descriptions. --no-desc to not print.
-v, --long-desc Print longer event descriptions.
--debug Enable debugging output
--details Print information on the perf event names and expressions used internally by events.
[root@devops-110 ~]# perf list |wc -l
1172
[root@devops-110 ~]# perf list |& head
alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
[root@devops-110 ~]# perf list 'sched:*'
[root@devops-110 ~]# perf list software # 只显示 software 事件类型
List of pre-defined events (to be used in -e):
alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
[root@devops-110 ~]# perf list tracepoint |head
block:block_bio_backmerge [Tracepoint event]
block:block_bio_bounce [Tracepoint event]
block:block_bio_complete [Tracepoint event]
block:block_bio_frontmerge [Tracepoint event]
block:block_bio_queue [Tracepoint event]
block:block_bio_remap [Tracepoint event]
block:block_dirty_buffer [Tracepoint event]
block:block_getrq [Tracepoint event]
block:block_plug [Tracepoint event]
block:block_rq_abort [Tracepoint event]
[root@devops-110 ~]#
后续进行采样或追踪时,可以使用 -e event_type
来指定事件类型。
[root@devops-110 ~]# perf top -e net:netif_receive_skb
perf stat 事件计数 #
PMU(Performance Monitoring Unit)是 CPU 的一组专用寄存器,当执行每个 CPU 指令时递增,当 PMU 计数器溢出时,CPU 触发一个特殊中断,然后被 kernel 处理,kernel 进而获得当前真正执行的线程状态数据。
perf record -e instructions -c 1000000 -p 1234
perf stat 用于统计整个系统或特定命令、进程的 CPU performance counter 统计值,或指定事件类型(-e event)的执行次数,可以为后续采样频率提供参考。
-e 指定一些固定的 PMU event 类型,如 task-clock,context-Switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses。
采集整个系统:
[root@devops-110 ~]# perf stat -a sleep 5
Performance counter stats for 'system wide':
20,002.77 msec cpu-clock # 4.000 CPUs utilized
16,887 context-switches # 0.844 K/sec
157 cpu-migrations # 0.008 K/sec
181 page-faults # 0.009 K/sec
<not supported> cycles
<not supported> instructions
<not supported> branches
<not supported> branch-misses
5.000764361 seconds time elapsed
采集执行特定命令:
[root@devops-110 ~]# perf stat ls
Performance counter stats for 'ls':
0.69 msec task-clock # 0.734 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
265 page-faults # 0.386 M/sec
<not supported> cycles
<not supported> instructions
<not supported> branches
<not supported> branch-misses
0.000934209 seconds time elapsed
0.000000000 seconds user
0.000978000 seconds sys
也可以指定 -p PID 来采集特定进程的 CPU 性能数据。
使用 -e event_glob 来指定采集指定事件类型的统计数据,如各系统调用的执行次数:
[root@devops-110 ~]# sudo perf stat -e 'syscalls:sys_enter_*' -a sleep 5 |& grep -E -v '^[[:blank:]]+0'
Performance counter stats for 'system wide':
1 syscalls:sys_enter_sendto
2 syscalls:sys_enter_epoll_ctl
4,981 syscalls:sys_enter_epoll_wait
9 syscalls:sys_enter_select
5 syscalls:sys_enter_ppoll
30 syscalls:sys_enter_getdents
1,226 syscalls:sys_enter_ioctl
...
3,220 syscalls:sys_enter_nanosleep
7 syscalls:sys_enter_times
11 syscalls:sys_enter_rt_sigprocmask
7 syscalls:sys_enter_rt_sigaction
1 syscalls:sys_enter_exit_group
1 syscalls:sys_enter_wait4
12 syscalls:sys_enter_mmap
5.006476383 seconds time elapsed
# Various basic CPU statistics, system wide, for 10 seconds:
$ sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10
# Count syscalls per-second system-wide:
$ sudo perf stat -e raw_syscalls:sys_enter -I 1000 -a
# Count system calls by type for the specified PID, until Ctrl-C:
$ sudo perf stat -e 'syscalls:sys_enter_*' -p PID
# Count scheduler events for the specified PID, for 10 seconds:
$ sudo perf stat -e 'sched:*' -p PID sleep 10
perf top CPU 热点统计 #
和 linux top 命令类似,perf top 实时采样和显示 CPU 执行的函数,适用于分析 CPU 负载和热点的情况。
perf top 默认采集所有 CPU,可以使用 -C 来指定 CPU 核。
- 按 q 键退出;
- 按 h 显示帮助;
# Trace system calls by process, showing a summary refreshing every 2 seconds:
perf top -e raw_syscalls:sys_enter -ns comm
perf record 采样 #
perf record 采样是按照一定频率来记录内核及应用程序的执行状态以及调用栈,它底层使用内核提供的 perf_event_output 系统调用来记录各种事件。
选项说明:
- -a:采样所有 CPU 核;
- -C:采样指定的 CPU 核;
- -p PID:采样指定进程,否则是所有进程。后续一般展示特定进程的火焰图,所以在 perf record 时需要通过 -p 来指定进程 PID;
- -F 99:采样频率,perf 使用该频率暂停程序,获得它的调用栈等信息;
- -g 表示记录函数调用栈:
- 对于内函数调用,内核不使用 DWARF,而是使用 FP 或 ORC (4.14 及以后内核),perf record 会自动选择;
- 对于用户函数,默认使用
--call-graph=fp
,即使用 frame pointer 来进行 stack unwinding,其它可选值包括 dwarf 和 lbr。
对于用户函数,如果 ELF 没有使用 FP (frame pointer),则需要指定 --call-graph=dwarf
,它使用 ELF 的 .eh_frame
或 debuginfo 文件中的 .debug_frame
数据进行 stack unwinding。
- 各发行版一般在单独的 XX-debuginfo 或 XX-dbg/XX-dbgsym 包中包含 DWARF 调试信息,对于支持 FP 的二进制,需要先安装这些包。
gcc 默认生成 .eh_frame
而且不能被 strip,所以通常指定 --call-grap=dwarf
比 fp 更通用。
对于 --call-graph dwarf
方法,perf 实际是将函数调用信息从内核空间拷贝并保存到 perf.data 文件中,然后使用 libdw 库从 DWARF、FP 中解析出调用栈详情,所以性能低一些。
除了 stack unwinding 外,perf record 还使用符号表(.symtab、.dynsym)和 DWARF 符号表(.debug_xx)来对程序运行时符号地址进行解析,需要采样的进程二进制以及依赖的共享库(如 libc6-dbg,zlib1g-dbg)包含它们。
这两类符号表不是程序运行所必须的,同时为了减少二进制大小,各发行版一般把它们从 ELF 中 strip(dynsym 不能被 strip),然后保存到单独的 debuginfo 包中。
所以为了 perf、bpfrace 正常工作,需要安装对应的 XX-debuginfo 或 XX-dbg 或 XX-dbgsym 包,用于提供程序 ELF 文件对应的符号表和 DWARF 调试符号表:
- 使用 file 或 objedump 或 readelf 命令查看 ELF 是否包含 debuginfo 和符号表;
- 如果不能符号解析,则后续生成的火焰图会有大量的 unkonwn block;
- 内核符号表使用的是 /proc/kallsyms 或者 vmlinux ELF 文件;
如果未指定 -F 频率,则按照 CPU 指令级进行记录,可以用来评估系统整体的 CPU 繁忙情况。
$ sudo perf record -a -g sleep 5
$ sudo perf report
$ sudo perf report --stdio --comm fvUnisocket --percent-limit 0.5
如果指定了 -F 频率,则按照对应的频率进行采样(sampling):
- -F 指定的频率越高,对 CPU、磁盘 IO 的压力越大。
- 提高系统最大采集频率:
$ echo {rate} > /proc/sys/kernel/perf_event_max_sample_rate
$ sudo perf record -F 99 -g --call-graph=dwarf -p 35775 -- sleep 10
[ perf record: Woken up 584 times to write data ]
Warning:
2 out of order events recorded.
[ perf record: Captured and wrote 172.619 MB perf.data (1128041 samples) ]
$ du -sh perf.data
173M perf.data
采样的数据保存在 perf.data
文件中,后续通过 perf report/script
命令来读取该文件,进一步分析性能和调用栈。
bpftrace 也支持 profile 采样,底层也是用的 perf 数据源(perf_event_output 系统调用):
bpftrace -e 'profile:hz:99 {@[kstack]=count();}'
Attaching 1 probe...
^C@[
native_safe_halt+14
__cpuidle_text_start+28
do_idle+497
cpu_startup_entry+111
start_secondary+420
secondary_startup_64+164
]: 112
@[
native_safe_halt+14
__cpuidle_text_start+28
do_idle+497
cpu_startup_entry+111
start_kernel+1292
secondary_startup_64+164
]: 227
perf record 指定事件类型 #
使用 -e event_glob 来指定采样的事件类型,而非所有事件:
- cycles: CPU 硬件事件类型;
$ sudo perf record -e cycles find /
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.091 MB perf.data (2195 samples) ]
$ sudo perf record -e block:block_rq_issue -ag
可以为 -e 指定的事件指定过滤参数(可以通过 perf probe -V 来显示):
# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'nr_sector > 200'
# Trace all block completions, synchronous writes only, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs == "WS"'
# Trace all block completions, all types of writes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'
# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C:
perf record -e 'ext4:*' -o /tmp/perf.data -a
# Sample stacks to 2 levels deep, and, context switch stacks to 5 levels (needs 4.8):
perf record -F99 -e cpu-clock/max-stack=2/ -e cs/max-stack=5/ -a -g
perf report 分析采样结果 #
分析 perf.data 文件中哪些函数占用的 CPU 最多:
- [k] 表示内核函数;
- [.] 表示用户函数;
$ perf report
Samples: 24K of event 'cycles', Event count (approx.): 4868947877
Children Self Command Shared Object Symbol
+ 17.08% 0.23% swapper [kernel.kallsyms] [k] do_idle
+ 5.38% 5.38% swapper [kernel.kallsyms] [k] intel_idle
+ 4.21% 0.02% kubelet [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 4.08% 0.00% kubelet kubelet [.] k8s.io/kubernetes/vendor/github.com/google/...
+ 4.06% 0.00% dockerd dockerd [.] net/http.(*conn).serve
+ 3.96% 0.00% dockerd dockerd [.] net/http.serverHandler.ServeHTTP
...
# perf record -e block:block_rq_issue -ag
^C
# ls -l perf.data
-rw------- 1 root root 3458162 Jan 26 03:03 perf.data
# perf report
[...]
# Samples: 2K of event 'block:block_rq_issue'
# Event count (approx.): 2216
#
# Overhead Command Shared Object Symbol
# ........ ............ ................. ....................
#
32.13% dd [kernel.kallsyms] [k] blk_peek_request
|
--- blk_peek_request
virtblk_request
__blk_run_queue
|
|--98.31%-- queue_unplugged
| blk_flush_plug_list
| |
| |--91.00%-- blk_queue_bio
| | generic_make_request
| | submit_bio
| | ext4_io_submit
| | |
| | |--58.71%-- ext4_bio_write_page
| | | mpage_da_submit_io
| | | mpage_da_map_and_submit
| | | write_cache_pages_da
| | | ext4_da_writepages
| | | do_writepages
| | | __filemap_fdatawrite_range
| | | filemap_flush
| | | ext4_alloc_da_blocks
| | | ext4_release_file
| | | __fput
| | | ____fput
| | | task_work_run
| | | do_notify_resume
| | | int_signal
| | | close
| | | 0x0
| | |
| | --41.29%-- mpage_da_submit_io
[...]
这是一个 TUI 交互式的界面,可以选中具体函数展开查看详情。
添加 –stdio 参数后直接在 stdout 显示结果,使用 –sort sudo perf report -n --sort comm,symbol --stdio
demangle #
Rust 函数名 mangling 后的格式: https://rust-lang.github.io/rfcs/2603-rust-symbol-name-mangling-v0.html
_RNvNtCs1234_7mycrate3foo3bar
<>^^^^^<----><------><--><-->
|||||| | | | |
|||||| | | | +--- "bar" identifier
|||||| | | +------- "foo" identifier
|||||| | +------------- "mycrate" identifier
|||||| +-------------------- disambiguator for "mycrate"
|||||+------------------------ start-tag for "mycrate"
||||+------------------------- namespace tag for "foo"
|||+-------------------------- start-tag for "foo"
||+--------------------------- namespace tag for "bar"
|+---------------------------- start-tag for "bar"
+----------------------------- common Rust symbol prefix
https://stackoverflow.com/questions/34234354/perf-shows-mangled-function-names
perf report 默认支持和开启 C++ 函数名称 demangle:
- –demangle: Demangle symbol names to human readable form. It’s enabled by default, disable with –no-demangle.
- –demangle-kernel: Demangle kernel symbol names to human readable form (for C++ kernels).
当 perf report 显示的函数名是 _Z*, _ZN*, _ZL*
等,说明该 perf 版本在编译时不支持 demangling,而这个特性需要先安装 libbfd 后才会编译。
- 查看内核 tools/perf/Makefile.config 中的配置参数。
perf script 打印调用栈 #
按照时间顺序,打印 perf.data 中各命令程序的函数调用栈:
isc-socket 15964 [003] 1912865.267505: 699832 cycles:ppp:
ffffffffa0422fb2 _copy_from_user+0x2 ([kernel.kallsyms])
ffffffffaFailed to open /tmp/perf-38107.map, continuing without symbols
02efbbe __x64_sys_epoll_ctl+0x4e ([kernel.kallsyms])
ffffffffa00042e5 do_syscall_64+0x55 ([kernel.kallsyms])
ffffffffa0a00088 entry_SYSCALL_64_after_hwframe+0x44 ([kernel.kallsyms])
7f460786349a epoll_ctl+0xa (/usr/lib64/libc-2.17.so)
swapper 0 [005] 1912865.267508: 20833 cycles:ppp:
ffffffffa0067b76 native_write_msr+0x6 ([kernel.kallsyms])
ffffffffa000c987 __intel_pmu_enable_all.constprop.26+0x47 ([kernel.kallsyms])
ffffffffa06e9492 net_rx_action+0x292 ([kernel.kallsyms])
ffffffffa0c00108 __softirqentry_text_start+0x108 ([kernel.kallsyms])
ffffffffa009f594 irq_exit+0xf4 ([kernel.kallsyms])
ffffffffa0a01cb2 do_IRQ+0x52 ([kernel.kallsyms])
ffffffffa0a009cf ret_from_intr+0x0 ([kernel.kallsyms])
7f67d9be12e7 vfprintf+0x3df7 (/usr/lib64/libc-2.17.so)
7f67d9c0cfa9 _IO_vsnprintf+0x79 (/usr/lib64/libc-2.17.so)
546f4db8 [unknown] (/tmp/perf-38107.map)
sqlonline_worke 38771 [006] 1912865.267542: 1 cycles:ppp:
ffffffffa0067b76 native_write_msr+0x6 ([kernel.kallsyms])
ffffffffa000c987 __intel_pmu_enable_all.constprop.26+0x47 ([kernel.kallsyms])
ffffffffa01d27d3 event_function+0x83 ([kernel.kallsyms])
ffffffffa01d3e69 remote_function+0x39 ([kernel.kallsyms])
ffffffffa0139f50 flush_smp_call_function_queue+0x70 ([kernel.kallsyms])
ffffffffa0a023b4 smp_call_function_single_interrupt+0x34 ([kernel.kallsyms])
ffffffffa0a01b5f call_function_single_interrupt+0xf ([kernel.kallsyms])
ffffffffc0362a76 ipt_do_table+0x2a6 ([kernel.kallsyms])
ffffffffa073abbd nf_hook_slow+0x3d ([kernel.kallsyms])
ffffffffa07471c3 ip_rcv+0xa3 ([kernel.kallsyms])
ffffffffa06e8b00 __netif_receive_skb_one_core+0x50 ([kernel.kallsyms])
ffffffffa06e7d02 netif_receive_skb_internal+0x42 ([kernel.kallsyms])
ffffffffa06e9cf5 napi_gro_receive+0xb5 ([kernel.kallsyms])
ffffffffc051da6b ixgbe_clean_rx_irq+0x49b ([kernel.kallsyms])
ffffffffc051ef1b ixgbe_poll+0x2ab ([kernel.kallsyms])
ffffffffa06e9492 net_rx_action+0x292 ([kernel.kallsyms])
ffffffffa0c00108 __softirqentry_text_start+0x108 ([kernel.kallsyms])
ffffffffa009f594 irq_exit+0xf4 ([kernel.kallsyms])
ffffffffa0a01cb2 do_IRQ+0x52 ([kernel.kallsyms])
ffffffffa0a009cf ret_from_intr+0x0 ([kernel.kallsyms])
7f67d9be12e7 vfprintf+0x3df7 (/usr/lib64/libc-2.17.so)
7f67d9c0cfa9 _IO_vsnprintf+0x79 (/usr/lib64/libc-2.17.so)
546f4db8 [unknown] (/tmp/perf-38107.map)
perf script -c bash
:只显示 bash 命令的调用栈。
生成火焰图 #
一般是特定进程的火焰图,所以在 perf record 时需要通过 -p 来指定进程 PID.
生成火焰图依赖 debuginfo 数据。需要采样的进程二进制以及依赖库如 libc,包含 .debug_XX 调试符号表或系统安装有对应的 debuginfo 包。
将 perf script
的输出重定向到 perl 脚本做进一步处理,生成火焰图:
perf record -F 99 -g --call-graph=dwarf -p 35775 -g -- sleep 10
perf report
perf script > out.perf
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph/
# 使用 ./stackcollapse-perf.pl 工具将 perf 数据转换
./stackcollapse-perf.pl ../out.perf > out.folded
./flamegraph.pl out.folded > process.svg
open process.svg
对于 bpftrace 产生的采样数据,也可以使用 flamegraph 提供的转换工具进行可视化:
sudo bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > trace.data
cd FlameGraph
# 使用 stackcollapse-bpftrace.pl 工具进行转换
./stackcollapse-bpftrace.pl trace.data > trace.folded
./flamegraph.pl --inverted trace.folded > traceflamegraph.svg
# 也可以对 uprobe 指定的函数调用栈生成 tarce 数据,然后生成火焰图
$ grep libwebp /proc/25760/maps
7f7bc6af3000-7f7bc6af6000 r--p 00000000 09:01 38281904 /usr/lib/x86_64-linux-gnu/libwebp.so.6.0.2
$ sudo bpftrace -e 'uprobe:/usr/lib/x86_64-linux-gnu/libwebp.so.6.0.2:* {time("%H:%M:%S "); printf("%s %d\n",comm,pid);}' | tee /tmp/libwebp.trace
perf annotate 显示事件的源码和反汇编 #
sudo perf annotate --stdio -v
perf probe 跟踪函数调用栈 #
perf probe
用于对内核或用户程序进行 dynamic probe(trace),即 kprobe、tracepoint 和 uprobe。
示例函数:
# cat test.c
#
#include <stdio.h>
#include <unistd.h>
void func_d() {
int msec=1;
printf("%s","Hello world from D\n");
usleep(10000*msec);
}
void func_c() {
printf("%s","Hello from C\n");
func_d();
}
void func_b() {
printf("%s","Hello from B\n");
func_c();
}
void func_a() {
printf("%s","Hello from A\n");
func_b();
}
int main() {
func_a();
}
编译:
- 未开启优化,故 ELF 使用 FP;
- -g 表示生成 DWARF 格式的 .debug_xx 调试符号表;
- 编译器默认在 ELF 中生成符号表(.symtab 和 .dynsym) 和 .eh_frame;
# gcc test.c -g -o test
符号表、.eh_frame 和 .debug_xx 调试符号表是 perf probe 所必须的信息。如果 strip 了符号表、调试符号表,则 perf probe 将不能为函数添加 probe。
用户函数调用栈跟踪(uprobe):
# perf probe -x ./hello func_d
Added new event:
probe_hello:func_d (on func_d in /root/hello)
You can now use it in all perf tools, such as:
perf record -e probe_hello:func_d -aR sleep 1
# perf record -e probe_hello:func_d -aR -g ./hello # -g 默认使用 FP,也可以指定 -g --call-graph=dwarf
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.162 MB perf.data (1 samples) ]
# perf script
hello 14239 [002] 5516.587824: probe_hello:func_d: (561112f12169)
561112f12169 func_d+0x0 (/root/hello)
561112f121e3 func_b+0x21 (/root/hello)
561112f12207 func_a+0x21 (/root/hello)
561112f1221c main+0x12 (/root/hello)
7fe3e67e1d90 __libc_start_call_main+0x80 (/usr/lib/x86_64-linux-gnu/libc.so.6)
内核函数跟踪(kprobe、tracepoint、syscall):
# 查看内核可以动态 probe 的函数:
# perf probe -F |grep ^block |head
block_ce_count_show
block_class
block_commit_write
block_depr
block_devnode
block_group_cache_tree_search
block_group_err
block_invalidatepage
block_ioctl
block_is_partially_uptodate
# 查看用户程序可以动态 probe 的函数:
# perf probe -F -x ./hello |head
completed.0
data_start
deregister_tm_clones
frame_dummy
func_a
func_b
func_c
func_d
main
register_tm_clones
参考:
perf probe -V 显示函数参数 #
对于内核函数,需要通过 –vmlinux 参数来指定包含内核 .debug_xx 调试符号表的 vmlinux 文件:
- vmlinux 一般是 kernel-debuginfo 包提供的,需要先安装。
# readelf -S /usr/lib/debug/lib/modules/4.19.91-007.ali4000.alios7.x86_64/vmlinux |grep -E 'debug|symtab|eh_'
[11] __ksymtab PROGBITS ffffffff8215e8b0 0135e8b0
[12] .rela__ksymtab RELA 0000000000000000 0f4cb330
[13] __ksymtab_gpl PROGBITS ffffffff82167f88 01367f88
[14] .rela__ksymtab_gp RELA 0000000000000000 0f503c40
[19] __ksymtab_strings PROGBITS ffffffff82179a48 01379a48
[61] .debug_aranges PROGBITS 0000000000000000 01c76030
[62] .rela.debug_arang RELA 0000000000000000 10097c28
[63] .debug_info PROGBITS 0000000000000000 01c9b9d0
[64] .rela.debug_info RELA 0000000000000000 100c36f8
[65] .debug_abbrev PROGBITS 0000000000000000 0c330c02
[66] .debug_line PROGBITS 0000000000000000 0c6a64cb
[67] .rela.debug_line RELA 0000000000000000 1f785c90
[68] .debug_frame PROGBITS 0000000000000000 0d05c6c0
[69] .rela.debug_frame RELA 0000000000000000 1f79a3c0
[70] .debug_str PROGBITS 0000000000000000 0d27d7a0
[71] .debug_loc PROGBITS 0000000000000000 0d53a983
[72] .rela.debug_loc RELA 0000000000000000 1f980c30
[73] .debug_ranges PROGBITS 0000000000000000 0e03c570
[74] .rela.debug_range RELA 0000000000000000 2085ff90
[77] .symtab SYMTAB 0000000000000000 0e975500
# ls -l /usr/lib/debug/lib/modules/4.19.91-007.ali4000.alios7.x86_64/vdso/
total 212
-r--r--r-- 1 root root 93452 Apr 8 2020 vdso32.so.debug
-r--r--r-- 1 root root 122824 Apr 8 2020 vdso64.so.debug
# perf probe -V tcp_sendmsg
Failed to find the path for kernel: Invalid ELF file
Error: Failed to show vars.
# perf probe --vmlinux /usr/lib/debug/lib/modules/4.19.91-007.ali4000.alios7.x86_64/vmlinux -V tcp_sendmsg
Available variables at tcp_sendmsg
@<tcp_sendmsg+0>
int ret
size_t size
struct msghdr* msg
struct sock* sk
# Show available variables for the kernel tcp_sendmsg() function, plus external vars (needs debuginfo):
perf probe -V tcp_sendmsg --externs
# Show available line probes for tcp_sendmsg() (needs debuginfo):
perf probe -L tcp_sendmsg
# Show available variables for tcp_sendmsg() at line number 81 (needs debuginfo):
perf probe -V tcp_sendmsg:81
# Add a tracepoint for tcp_sendmsg(), with three entry argument registers (platform specific):
perf probe 'tcp_sendmsg %ax %dx %cx'
# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for the %cx register (platform specific):
perf probe 'tcp_sendmsg bytes=%cx'
# Trace previously created probe when the bytes (alias) variable is greater than 100:
perf record -e probe:tcp_sendmsg --filter 'bytes > 100'
对于用户函数,需要 ELF 中有 .debug_XX 调试符号表,或则有对应的 debuginfo 文件。
perf probe 跟踪所有函数 #
向 perf probe 添加用户程序的函数时,可以指定一个 glob pattern,如 ‘func_*’:
# perf probe -x ./test 'func_*'
Error: event "func_d" already exists.
Hint: Remove existing event by 'perf probe -d'
or force duplicates by 'perf probe -f'
or set 'force=yes' in BPF source.
Error: Failed to add events.
# perf probe -d probe_test:*
Removed event: probe_test:func_d
# perf probe -x ./test '*'
Added new events:
probe_test:func_d (on * in /root/zj/test)
probe_test:func_c (on * in /root/zj/test)
probe_test:func_b (on * in /root/zj/test)
probe_test:func_a (on * in /root/zj/test)
probe_test:main (on * in /root/zj/test)
You can now use it in all perf tools, such as:
perf record -e probe_test:main -aR sleep 1
# perf record -e probe_test:* -g -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 12.568 MB perf.data (5 samples) ]
# perf script
test 104426 [083] 3189986.791451: probe_test:main: (4005f6)
4005f6 main+0x0 (/root/zj/test)
test 104426 [083] 3189986.791458: probe_test:func_a: (4005dc)
4005dc func_a+0x0 (/root/zj/test)
7f4e708de555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)
test 104426 [083] 3189986.791492: probe_test:func_b: (4005c2)
4005c2 func_b+0x0 (/root/zj/test)
400604 main+0xe (/root/zj/test)
7f4e708de555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)
test 104426 [083] 3189986.791500: probe_test:func_c: (4005a8)
4005a8 func_c+0x0 (/root/zj/test)
4005f4 func_a+0x18 (/root/zj/test)
400604 main+0xe (/root/zj/test)
7f4e708de555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)
test 104426 [083] 3189986.791504: probe_test:func_d: (40057d)
40057d func_d+0x0 (/root/zj/test)
4005da func_b+0x18 (/root/zj/test)
4005f4 func_a+0x18 (/root/zj/test)
400604 main+0xe (/root/zj/test)
7f4e708de555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)
perf probe 不支持非 C 程序特定函数追踪 #
go、rust 的函数名通常包含 package 路径,而 perf probe 不支持函数名包含特殊字符,即使使用函数地址,也报错。
- perf-probe: Improbe non-C language support: https://lwn.net/Articles/997329/
- https://lore.kernel.org/all/173073702882.2098439.13342508872190995896.stgit@mhiramat.roam.corp.google.com/
#perf probe -F -x /app/myagent |grep -i store |grep gitlab
git.com/paas/myagent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
git.com/paas/myagent/pkg/storage.(*ProcessStore).SaveProcessLocateResult
git.com/paas/myagent/pkg/storage.(*ProcessStore).SetRelistFunc
git.com/paas/myagent/pkg/storage.(*ProcessStore).cleanupByDiskUsed
git.com/paas/myagent/pkg/storage.(*ProcessStore).cleanupByDiskUsed.func1
git.com/paas/myagent/pkg/storage.(*ProcessStore).cleanupOldRecords
git.com/paas/myagent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo
git.com/paas/myagent/pkg/storage.NewStore
git.com/paas/myagent/pkg/storage.NewStore.func1
git.com/paas/myagent/pkg/storage.NewStore.func2
#perf probe -x /app/myagent 'SaveNetworkLocateInfo'
Failed to find symbol SaveNetworkLocateInfo in /app/myagent
Error: Failed to add events.
# perf probe -x /app/myagent 'git.com/paas/myagent/pkg/storage.(\*ProcessStore).SaveNetworkLocateInfo'
Failed to write event: Invalid argument
Error: Failed to add events.
# nm -n /app/myagent |grep SaveNetwork
00000000012e0220 T git.com/paas/myagent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
00000000012e0480 T git.com/paas/myagent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo
# perf probe -x /app/myagent --add '0x12e0220'
Failed to write event: Invalid argument
Error: Failed to add events.
但 perf record 等还能正常使用。
解决办法:使用 bpftrace,它支持 rust、go 带 package path 的函数名。
perf trace 捕获事件 #
类似于 strace
各种事件:
- 系统调用函数;
- tracepoint 函数,一般是执行系统调用函数调用的其他内核函数触发的。
应用场景:例如想查那个进程再发送 ICMP 包。但是又不知道应该向哪个 内核函数 插桩。
perf 会自动显示每一个函数的参数:
root@lima-ebpf-dev:~# perf trace ping www.baidu.com -c1 > /dev/null
? ( ): ping/203535 ... [continued]: execve()) = 0
0.119 ( 0.046 ms): ping/203535 brk() = 0x55b19f84a000
0.606 ( 0.044 ms): ping/203535 arch_prctl(option: 0x3001, arg2: 0x7fffc98478d0) = -1 EINVAL (Invalid argument)
0.714 ( 0.049 ms): ping/203535 mmap(len: 8192, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS) = 0x7f36617ca000
0.816 ( 0.050 ms): ping/203535 access(filename: 0x617fbd90, mode: R) = -1 ENOENT (No such file or directory)
0.916 ( 0.050 ms): ping/203535 openat(dfd: CWD, filename: 0x617fa21b, flags: RDONLY|CLOEXEC) = 3
1.010 ( 0.045 ms): ping/203535 newfstatat(dfd: 3, filename: 0x617faee9, statbuf: 0x7fffc9846a20, flag: 4096) = 0
1.100 ( 0.158 ms): ping/203535 mmap(len: 35835, prot: READ, flags: PRIVATE, fd: 3) = 0x7f36617c1000
1.308 ( 0.044 ms): ping/203535 close(fd: 3) = 0
1.403 ( 0.049 ms): ping/203535 openat(dfd: CWD, filename: 0x617ca140, flags: RDONLY|CLOEXEC) = 3
1.498 ( 0.045 ms): ping/203535 read(fd: 3, buf: 0x7fffc9846c58, count: 832) = 832
1.588 ( 0.045 ms): ping/203535 newfstatat(dfd: 3, filename: 0x617faee9, statbuf: 0x7fffc9846af0, flag: 4096) = 0
1.697 ( 0.055 ms): ping/203535 mmap(len: 41016, prot: READ, flags: PRIVATE|DENYWRITE, fd: 3) = 0x7f36617b6000
1.797 ( 0.067 ms): ping/203535 mmap(addr: 0x7f36617b9000, len: 16384, prot: READ|EXEC, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x3000) = 0x7f36617b9000
1.912 ( 0.050 ms): ping/203535 mmap(addr: 0x7f36617bd000, len: 8192, prot: READ, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x7000) = 0x7f36617bd000
2.007 ( 0.049 ms): ping/203535 mmap(addr: 0x7f36617bf000, len: 8192, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x8000) = 0x7f36617bf000
...
279.945 ( 0.049 ms): ping/203535 close(fd: 1</dev/null>) = 0
280.046 ( 0.065 ms): ping/203535 close(fd: 2</dev/pts/1>) = 0
280.185 ( ): ping/203535 exit_group() = ?
root@lima-ebpf-dev:~#
通过指定 -e glob,可以只捕获符合 glob 匹配的内核事件,如向捕获执行 ping 命令式,内核内部执行了那些 net tracepoint 函数:
root@lima-ebpf-dev:~# perf trace --no-syscalls --event 'net:*' ping www.baidu.com -c1 > /dev/null
0.000 ping/203626 net:net_dev_queue(skbaddr: 0xffff99fac46fc400, len: 84, name: "lo")
0.061 ping/203626 net:net_dev_start_xmit(name: "lo", skbaddr: 0xffff99fac46fc400, protocol: 2048, ip_summed: 3, len: 84, network_offset: 14, transport_offset_valid: 1, transport_offset: 34)
0.099 ping/203626 net:netif_rx_entry(name: "lo", napi_id: 2, skbaddr: 0xffff99fac46fc400, protocol: 2048, ip_summed: 3, hash: 190000638, l4_hash: 1, len: 70, truesize: 768, mac_header_valid: 1, mac_header: 4294967282)
0.139 ping/203626 net:netif_rx(skbaddr: 0xffff99fac46fc400, len: 70, name: "lo")
0.173 ping/203626 net:netif_rx_exit(skbaddr: 0xffff99fac46fc400, len: 70, name: "lo")
0.206 ping/203626 net:net_dev_xmit(skbaddr: 0xffff99fac46fc400, len: 84, name: "lo")
...
root@lima-ebpf-dev:~#
比如,如果想通过 bpftrace 来捕获发送 icmp 包的进程,则可以从上面 perf trace 的输出结果中来看,对 net_dev_start_xmit 函数进行插桩是比较合适的选择:
- 任意内核函数插桩,建议使用 kfunc 类型,因为可以使用 -lv 来看 kfunc 的参数列表,后面使用 args.
来引用函数参数。 - 如果使用 kprobe 类型,则不支持查看函数列表,需要使用 argN 来引用每一个函数参数。
# 对内核函数关键字进行搜索,不一定完整匹配
root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:net_dev_start_xmit'
root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:*' |grep net_dev |grep xmit
kfunc:vmlinux:__traceiter_net_dev_start_xmit
kfunc:vmlinux:__traceiter_net_dev_xmit
kfunc:vmlinux:__traceiter_net_dev_xmit_timeout
root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:vmlinux:__traceiter_net_dev_start_xmit'
kfunc:vmlinux:__traceiter_net_dev_start_xmit
void * __data
const struct sk_buff * skb
const struct net_device * dev
int retval
内核对应的 net_dev_start_xmit trace event 定义:
TRACE_EVENT(net_dev_start_xmit,
TP_PROTO(const struct sk_buff *skb, const struct net_device *dev),
TP_ARGS(skb, dev),
TP_STRUCT__entry(
__string( name, dev->name )
__field( u16, queue_mapping )
__field( const void *, skbaddr )
__field( bool, vlan_tagged )
__field( u16, vlan_proto )
__field( u16, vlan_tci )
__field( u16, protocol )
__field( u8, ip_summed )
__field( unsigned int, len )
__field( unsigned int, data_len )
__field( int, network_offset )
__field( bool, transport_offset_valid)
__field( int, transport_offset)
__field( u8, tx_flags )
__field( u16, gso_size )
__field( u16, gso_segs )
__field( u16, gso_type )
),
TP_fast_assign(
__assign_str(name, dev->name);
__entry->queue_mapping = skb->queue_mapping;
__entry->skbaddr = skb;
__entry->vlan_tagged = skb_vlan_tag_present(skb);
__entry->vlan_proto = ntohs(skb->vlan_proto);
__entry->vlan_tci = skb_vlan_tag_get(skb);
__entry->protocol = ntohs(skb->protocol);
__entry->ip_summed = skb->ip_summed;
__entry->len = skb->len;
__entry->data_len = skb->data_len;
__entry->network_offset = skb_network_offset(skb);
__entry->transport_offset_valid =
skb_transport_header_was_set(skb);
__entry->transport_offset = skb_transport_offset(skb);
__entry->tx_flags = skb_shinfo(skb)->tx_flags;
__entry->gso_size = skb_shinfo(skb)->gso_size;
__entry->gso_segs = skb_shinfo(skb)->gso_segs;
__entry->gso_type = skb_shinfo(skb)->gso_type;
),
TP_printk("dev=%s queue_mapping=%u skbaddr=%p vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x ip_summed=%d len=%u data_len=%u network_offset=%d transport_offset_valid=%d transport_offset=%d tx_flags=%d gso_size=%d gso_segs=%d gso_type=%#x",
__get_str(name), __entry->queue_mapping, __entry->skbaddr,
__entry->vlan_tagged, __entry->vlan_proto, __entry->vlan_tci,
__entry->protocol, __entry->ip_summed, __entry->len,
__entry->data_len,
__entry->network_offset, __entry->transport_offset_valid,
__entry->transport_offset, __entry->tx_flags,
__entry->gso_size, __entry->gso_segs, __entry->gso_type)
)
另一个例子:
$ sudo bpftrace -e '
> tracepoint:syscalls:sys_enter_open,
> tracepoint:syscalls:sys_enter_openat
> {
> $mysecret = str(args->filename);
> if ( $mysecret == "homework" ) {
> printf("Somebody touch my file ! Call the police !\n");
> }
> }
> '
Attaching 2 probes...Somebody touch my file ! Call the police !
^C
$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
printf("%-10u %-5d ", elapsed / 1000000, pid);
join(args->argv);
}
'
perf ftrace 追踪函数 #
https://mp.weixin.qq.com/s/kZfmtbbcTwumyJ7I551Xpw
为了测试方便,将 eth1 的网卡队列设置为 1,并且将网卡中断绑定到 CPU0。
ethtool -L eth1 combined 1
echo 0 > /proc/irq/$(cat /proc/interrupts | awk -F ':' '/eth1/ {gsub(/ /,""); print $1}')/smp_affinity_list
- 收包路径:perf ftrace -G ‘_netif_receive_skb_list_core’ -g ‘smp*’
- 发包路径:perf ftrace -G ‘_netif_receive_skb_core’ -g ‘smp*’
参考:
- Debugging the kernel using Ftrace - part 1: https://lwn.net/Articles/365835/
- Debugging the kernel using Ftrace - part 2: https://lwn.net/Articles/366796/
采样和追踪容器进程 #
容器进程在独立的 mount ns 中,即根文件系统和 Host 是独立的,需要在 Host 上上到容器内使用的二进制和库文件,然后做符号解析和函数追踪。
在 Host 上查找容器内二进制文件,由两种方式:
- /proc/
/root - 查找容器 mergedDir
# ls /proc/102366/root/
apsara bin boot dev entrypoint.sh etc home lib lib64 lost+found media mnt nsenter opt proc root run sbin srv sys tmp usr var
# ls -l /proc/102366/exe
lrwxrwxrwx 1 root root 0 Jan 9 16:11 /proc/102366/exe -> /usr/bin/plugin.csi.cloud.com
# ls -l /proc/102366/root/usr/bin/plugin.csi.cloud.com
-rwxr-xr-x 1 root root 82455166 Jan 9 13:23 /proc/102366/root/usr/bin/plugin.csi.cloud.com
另外也可以根据容器的 .GraphDriver.Data.MergedDir 来在 Host 上查找容器内路径的二进制路径。例如,如果想跟踪 cilium-agent 进程(本身是用 docker 容器部署的),首先需要找到 cilium-agent 文件在宿主机上的绝对路径,可以通过 container ID 或 name 找:
- merged path 是容器使用的 overlay 根文件系统。
# Check cilium-agent container
$ docker ps | grep cilium-agent
0eb2e76384b3 cilium:test "/usr/bin/cilium-agent ..." 4 hours ago Up 4 hours cilium-agent
# Find the merged path for cilium-agent container
$ docker inspect --format "{{.GraphDriver.Data.MergedDir}}" 0eb2e76384b3
/var/lib/docker/overlay2/a17f868d/merged # a17f868d.. is shortened for better viewing
# The object file we are going to trace
$ ls -ahl /var/lib/docker/overlay2/a17f868d/merged/usr/bin/cilium-agent
-rwxr-xr-x 1 root root 86M /var/lib/docker/overlay2/a17f868d/merged/usr/bin/cilium-agent
perf 命令感知进程对应的 cgroup,如果发现它是其它 mount 或 pid ns 中的进程,则自动到对应的 cgruop rootfs 下查找二进制和对应的 debuginfo 信息。
Host System
├── /proc/<container_pid>/
│ ├── root/ (容器根文件系统)
│ ├── exe (二进制链接)
│ └── maps (内存映射)
└── /usr/lib/debug/ (调试符号)
判断是否是容器进程的依据是:NSpid 字段:
$ cat /proc/1229/status | grep NSpid
NSpid: 1229
$ cat /proc/11459/status | grep NSpid
NSpid: 11459 1
11459 是在宿主机的 pid ns 内的进程 ID,1 是在容器自己的 pid ns 的进程 ID
如果要对指定 cgroup 进行 syscall tracing/probe,则需要使用 –cgroup/-G 来指定 /sys/fs/cgroup/perf_event 下的 cgrup 路径。
# ls /sys/fs/cgroup/perf_event/docker/
59a016887a1b563799d1f5d942f420ab37f9e99be68996e80ffc68fdc96e06ff a38dc1db0ad27317bbe8914e0e25930e0b7650d2b3c6bb10860dbf797cfba770 cgroup.procs tasks
9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058 cgroup.clone_children notify_on_release
# docker exec -it 9c2281e2be09 sh
/bin # cat /proc/self/cgroup
12:freezer:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
11:pids:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
10:devices:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
9:blkio:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
8:cpuset:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
7:rdma:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
6:hugetlb:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
5:perf_event:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
4:net_cls,net_prio:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
3:cpu,cpuacct:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
2:memory:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
1:name=systemd:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
0::/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
/bin #
# perf record -F 99 -e cpu-clock --cgroup=docker/9c2281e2be099 -a -- sleep 10
另一例子:
# docker exec 879e9059287c cat /proc/self/cgroup
12:pids:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
11:devices:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
10:memory:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
9:cpu,cpuacct:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
8:blkio:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
7:hugetlb:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
6:perf_event:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
5:freezer:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
4:rdma:/
3:net_cls,net_prio:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
2:cpuset:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
1:name=systemd:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
# perf stat -e "syscalls:*" -G my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope -a
perf 实验 #
# gcc test.c -o test
# file test
test: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=cf5aa8fa3e8950c6e5c71b2588d8b2e8bf40f175, not stripped
# readelf -S ./test |grep -E 'sym|debug|eh_'
[ 5] .dynsym DYNSYM 00000000004002b8 000002b8
[16] .eh_frame_hdr PROGBITS 00000000004006dc 000006dc
[17] .eh_frame PROGBITS 0000000000400730 00000730
[27] .symtab SYMTAB 0000000000000000 00001098
# nm test |grep func
00000000004005dc T func_a
00000000004005c2 T func_b
00000000004005a8 T func_c
000000000040057d T func_d
# strings test |grep func
func_b
func_d
func_a
func_c
# 没有加 -g,默认不捕获函数调用栈,
# 没有加 -e,默认使用 cycles:ppp
# perf record ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.030 MB perf.data (10 samples) ]
# perf report 并不显示函数调用栈
# perf report --stdio |cat -
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 9 of event 'cycles:ppp'
# Event count (approx.): 1717500
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ............................
#
63.50% test [kernel.kallsyms] [k] sync_regs
32.86% test [kernel.kallsyms] [k] vma_interval_tree_insert
3.29% perf [kernel.kallsyms] [k] do_nmi
0.32% perf [kernel.kallsyms] [k] acpi_os_read_memory
0.03% perf [kernel.kallsyms] [k] native_write_msr
# perf script
perf 41565 3185673.474644: 1 cycles:ppp: ffffffffaf067b74 native_write_msr+0x4 ([kernel.kallsyms])
perf 41565 3185673.474657: 1 cycles:ppp: ffffffffaf067b74 native_write_msr+0x4 ([kernel.kallsyms])
perf 41565 3185673.474659: 4 cycles:ppp: ffffffffaf067b74 native_write_msr+0x4 ([kernel.kallsyms])
perf 41565 3185673.474663: 47 cycles:ppp: ffffffffaf067b74 native_write_msr+0x4 ([kernel.kallsyms])
perf 41565 3185673.474666: 506 cycles:ppp: ffffffffaf067b76 native_write_msr+0x6 ([kernel.kallsyms])
perf 41565 3185673.474669: 5450 cycles:ppp: ffffffffaf49bdc5 acpi_os_read_memory+0x35 ([kernel.kallsyms])
perf 41565 3185673.474672: 56442 cycles:ppp: ffffffffaf029f59 do_nmi+0x99 ([kernel.kallsyms])
test 41565 3185673.474702: 564369 cycles:ppp: ffffffffaf2268dc vma_interval_tree_insert+0x2c ([kernel.kallsyms])
test 41565 3185673.474986: 1090680 cycles:ppp: ffffffffaf02538e sync_regs+0x1e ([kernel.kallsyms])
# 加 -g 时捕获函数调用栈
# perf record -g ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.031 MB perf.data (11 samples) ]
# perf report --stdio |head -20
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 11 of event 'cycles:ppp'
# Event count (approx.): 1310813
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... ................. .......................................
#
75.09% 0.00% test [kernel.kallsyms] [k] page_fault
|
---page_fault
do_page_fault
__do_page_fault
handle_mm_fault
|
--35.97%--__handle_mm_fault
do_anonymous_page
# 但是并没有捕获到用户函数的调用栈
# perf report --stdio |grep func_
# 如果要捕获用户函数,则需要使用 perf probe
# 查看 ./test 中可以 probe 的用户函数列表
# perf probe -F -x ./test
completed.6355
data_start
deregister_tm_clones
frame_dummy
func_a
func_b
func_c
func_d
main
puts@plt
register_tm_clones
usleep@plt
# perf probe -x ./test 'func_d'
Added new event:
probe_test:func_d (on func_d in /root/zj/test)
You can now use it in all perf tools, such as:
perf record -e probe_test:func_d -aR sleep 1
# 然后执行 perf record 时需要加 -g 才能捕获函数调用栈,
# 或者使用 dwarf 来 unwinding(使用默认生成的 .eh_frame)
# perf record -e probe_test:func_d -aR ./test # perf record -e probe_test:func_d --call-graph=dwarf -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.330 MB perf.data (1 samples) ]
# perf script
test 104460 [076] 3186931.797081: probe_test:func_d_1: (40057d)
40057d func_d+0x0 (/root/zj/test)
4005bf func_c+0x17 (/root/zj/test)
4005d9 func_b+0x17 (/root/zj/test)
4005f3 func_a+0x17 (/root/zj/test)
400603 main+0xd (/root/zj/test)
7fb599014444 __libc_start_main+0xf4 (/usr/lib64/libc-2.17.so)
4004b8 _start+0x28 (/root/zj/test)
# perf report --stdio |cat -
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1 of event 'probe_test:func_d'
# Event count (approx.): 1
#
# Children Self Trace output
# ........ ........ ............
#
100.00% 100.00% (40057d)
|
---__libc_start_main
main
func_a
func_b
func_d
# perf record -e probe_test:func_d_1 --proc-map-timeout 100 -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.845 MB perf.data (1 samples) ]
# perf script
test 129241 [084] 3187206.870347: probe_test:func_d_1: (40057d)
# perf record -e probe_test:func_d_1 --proc-map-timeout 500 -g -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.404 MB perf.data (1 samples) ]
# perf script
test 76042 [073] 3187258.344951: probe_test:func_d_1: (40057d)
40057d func_d+0x0 (/root/zj/test)
4005da func_b+0x18 (/root/zj/test)
4005f4 func_a+0x18 (/root/zj/test)
400604 main+0xe (/root/zj/test)
7f8f5e361445 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)
# perf record -e probe_test:func_d_1 --proc-map-timeout 500 --call-graph fp -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.476 MB perf.data (1 samples) ]
# perf script
test 103396 [068] 3187282.462500: probe_test:func_d_1: (40057d)
40057d func_d+0x0 (/root/zj/test)
4005da func_b+0x18 (/root/zj/test)
4005f4 func_a+0x18 (/root/zj/test)
400604 main+0xe (/root/zj/test)
7feeb9598445 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)
# gcc ./test.c -o test -fomit-frame-pointer
# perf probe -l
kprobes:aegis___x64_sys_execve (on __x64_sys_execve with s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 file arg1 arg2 arg3 arg4 arg5 ar
kprobes:aegis_do_execve (on do_execve with s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 file arg1 arg2 arg3 arg4 arg5 arg6 arg7 arg8 a
kprobes:aegis_do_sys_open (on do_sys_open with dfd file flag)
kprobes:aegis_proc_fork_connector (on proc_fork_connector with tgid pid parent_tgid parent_pid)
kprobes:aegis_tcp_connect (on tcp_connect with saddr daddr sport dport)
probe_test:func_d (on func_d in /root/zj/test)
probe_test:func_d_1 (on func_d in /root/zj/test)
# perf probe -d probe_test:*
Removed event: probe_test:func_d
Removed event: probe_test:func_d_1
# perf probe -x ./test 'func_d'
Added new event:
probe_test:func_d (on func_d in /root/zj/test)
You can now use it in all perf tools, such as:
perf record -e probe_test:func_d -aR sleep 1
# perf record -e probe_test:func_d -aR -g ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.281 MB perf.data (1 samples) ]
# perf script
test 85054 [093] 3187435.293375: probe_test:func_d: (40057d)
40057d func_d+0x0 (/root/zj/test)
# perf record -e probe_test:func_d -aR --call-graph dwarf ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 29.288 MB perf.data (1 samples) ]
# perf script
test 106253 [094] 3187527.044076: probe_test:func_d: (40057d)
40057d func_d+0x0 (/root/zj/test)
4005c0 func_c+0x17 (/root/zj/test)
4005dd func_b+0x17 (/root/zj/test)
4005fa func_a+0x17 (/root/zj/test)
40060d main+0xd (/root/zj/test)
7f15c70ca444 __libc_start_main+0xf4 (/usr/lib64/libc-2.17.so)
4004b8 _start+0x28 (/root/zj/test)
# perf annotate --stdio2
Samples: 1 of event 'probe_test:func_d', 1 Hz, Event count (approx.): 1, [percent: local period]
func_d() /root/zj/test
Percent
Disassembly of section .text:
000000000040057d <func_d>:
func_d():
100.00 push %rbp
mov %rsp,%rbp
sub $0x10,%rsp
movl $0x1,-0x4(%rbp)
mov $0x4006a0,%edi
→ callq puts@plt
mov -0x4(%rbp),%eax
imul $0x2710,%eax,%eax
mov %eax,%edi
→ callq usleep@plt
leaveq
← retq
# gcc ./test.c -g -o test
# readelf -S ./test |grep debug
[27] .debug_aranges PROGBITS 0000000000000000 00001096
[28] .debug_info PROGBITS 0000000000000000 000010c6
[29] .debug_abbrev PROGBITS 0000000000000000 000011ce
[30] .debug_line PROGBITS 0000000000000000 0000124d
[31] .debug_str PROGBITS 0000000000000000 0000129a
# perf record -e probe_test:func_d -aR --call-graph dwarf ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 27.890 MB perf.data (1 samples) ]
# perf annotate --stdio2
Samples: 1 of event 'probe_test:func_d', 1 Hz, Event count (approx.): 1, [percent: local period]
func_d() /root/zj/test
Percent
Disassembly of section .text:
000000000040057d <func_d>:
func_d():
#include <stdio.h>
#include <unistd.h>
void func_d() {
100.00 push %rbp
mov %rsp,%rbp
sub $0x10,%rsp
int msec=1;
movl $0x1,-0x4(%rbp)
printf("%s","Hello world from D\n");
mov $0x4006a0,%edi
→ callq puts@plt
usleep(10000*msec);
mov -0x4(%rbp),%eax
imul $0x2710,%eax,%eax
mov %eax,%edi
→ callq usleep@plt
}
leaveq
← retq
报错:
# perf record -e probe_test:func_d_1 --call-graph=dwarf -aR ./test
Reading /proc/101522/task/101522/maps time out. You may want to increase the time limit by --proc-map-timeout
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
Warning:
1 map information files for pre-existing threads were
not processed, if there are samples for addresses they
will not be resolved, you may find out which are these
threads by running with -v and redirecting the output
to a file.
The time limit to process proc map is too short?
Increase it by --proc-map-timeout
[ perf record: Captured and wrote 28.158 MB perf.data (1 samples) ]
# perf record -e probe_test:func_d_1 --call-graph=dwarf --proc-map-timeout 100 -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 29.081 MB perf.data (1 samples) ]