介绍 perf 工具的使用方式、局限性和问题。
查看 perf 构建参数 #
# perf version --build-options
perf version 4.19.91-007.ali4000.alios7.x86_64.debug
dwarf: [ on ] # HAVE_DWARF_SUPPORT
dwarf_getlocations: [ on ] # HAVE_DWARF_GETLOCATIONS_SUPPORT
glibc: [ on ] # HAVE_GLIBC_SUPPORT
gtk2: [ OFF ] # HAVE_GTK2_SUPPORT
syscall_table: [ on ] # HAVE_SYSCALL_TABLE_SUPPORT
libbfd: [ on ] # HAVE_LIBBFD_SUPPORT
libelf: [ on ] # HAVE_LIBELF_SUPPORT
libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
libperl: [ on ] # HAVE_LIBPERL_SUPPORT
libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
libslang: [ on ] # HAVE_SLANG_SUPPORT
libcrypto: [ on ] # HAVE_LIBCRYPTO_SUPPORT
libunwind: [ OFF ] # HAVE_LIBUNWIND_SUPPORT
libdw-dwarf-unwind: [ on ] # HAVE_DWARF_SUPPORT
zlib: [ on ] # HAVE_ZLIB_SUPPORT
lzma: [ OFF ] # HAVE_LZMA_SUPPORT
get_cpuid: [ on ] # HAVE_AUXTRACE_SUPPORT
bpf: [ on ] # HAVE_LIBBPF_SUPPORT
系统整体 profiling,分析 CPU 繁忙情况 #
$ sudo perf record -a -g sleep 5
$ sudo perf report
$ sudo perf report --stdio --comm fvUnisocket --percent-limit 0.5
一次 probe trace 程序中的所有函数 #
# perf probe -x ./test 'func_*'
Error: event "func_d" already exists.
Hint: Remove existing event by 'perf probe -d'
or force duplicates by 'perf probe -f'
or set 'force=yes' in BPF source.
Error: Failed to add events.
# perf probe -d probe_test:*
Removed event: probe_test:func_d
# perf probe -x ./test '*'
Added new events:
probe_test:func_d (on * in /root/zj/test)
probe_test:func_c (on * in /root/zj/test)
probe_test:func_b (on * in /root/zj/test)
probe_test:func_a (on * in /root/zj/test)
probe_test:main (on * in /root/zj/test)
You can now use it in all perf tools, such as:
perf record -e probe_test:main -aR sleep 1
# perf record -e probe_test:* -g -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 12.568 MB perf.data (5 samples) ]
# perf script
test 104426 [083] 3189986.791451: probe_test:main: (4005f6)
4005f6 main+0x0 (/root/zj/test)
test 104426 [083] 3189986.791458: probe_test:func_a: (4005dc)
4005dc func_a+0x0 (/root/zj/test)
7f4e708de555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)
test 104426 [083] 3189986.791492: probe_test:func_b: (4005c2)
4005c2 func_b+0x0 (/root/zj/test)
400604 main+0xe (/root/zj/test)
7f4e708de555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)
test 104426 [083] 3189986.791500: probe_test:func_c: (4005a8)
4005a8 func_c+0x0 (/root/zj/test)
4005f4 func_a+0x18 (/root/zj/test)
400604 main+0xe (/root/zj/test)
7f4e708de555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)
test 104426 [083] 3189986.791504: probe_test:func_d: (40057d)
40057d func_d+0x0 (/root/zj/test)
4005da func_b+0x18 (/root/zj/test)
4005f4 func_a+0x18 (/root/zj/test)
400604 main+0xe (/root/zj/test)
7f4e708de555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)
# grep libwebp /proc/25760/maps
7f7bc6af3000-7f7bc6af6000 r--p 00000000 09:01 38281904 /usr/lib/x86_64-linux-gnu/libwebp.so.6.0.2
# bpftrace -e 'uprobe:/usr/lib/x86_64-linux-gnu/libwebp.so.6.0.2:* {time("%H:%M:%S "); printf("%s %d\n",comm,pid);}' | tee /tmp/libwebp.trace
perf probe 不支持 go、rust 等非 C 程序 #
但 perf record 等还能正常使用:
- 因为 go、rust 的函数名通常包含 package 路径,具有 perf 不支持的函数名特殊字符。
- 即使使用函数地址,也报错;
- perf-probe: Improbe non-C language support: https://lwn.net/Articles/997329/
- https://lore.kernel.org/all/173073702882.2098439.13342508872190995896.stgit@mhiramat.roam.corp.google.com/
解决办法:使用 bpftrace
#perf probe -F -x /app/myagent |grep -i store |grep gitlab
git.com/paas/myagent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
git.com/paas/myagent/pkg/storage.(*ProcessStore).SaveProcessLocateResult
git.com/paas/myagent/pkg/storage.(*ProcessStore).SetRelistFunc
git.com/paas/myagent/pkg/storage.(*ProcessStore).cleanupByDiskUsed
git.com/paas/myagent/pkg/storage.(*ProcessStore).cleanupByDiskUsed.func1
git.com/paas/myagent/pkg/storage.(*ProcessStore).cleanupOldRecords
git.com/paas/myagent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo
git.com/paas/myagent/pkg/storage.NewStore
git.com/paas/myagent/pkg/storage.NewStore.func1
git.com/paas/myagent/pkg/storage.NewStore.func2
#perf probe -x /app/myagent 'SaveNetworkLocateInfo'
Failed to find symbol SaveNetworkLocateInfo in /app/myagent
Error: Failed to add events.
# perf probe -x /app/myagent 'git.com/paas/myagent/pkg/storage.(\*ProcessStore).SaveNetworkLocateInfo'
Failed to write event: Invalid argument
Error: Failed to add events.
# nm -n /app/myagent |grep SaveNetwork
00000000012e0220 T git.com/paas/myagent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
00000000012e0480 T git.com/paas/myagent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo
# perf probe -x /app/myagent --add '0x12e0220'
Failed to write event: Invalid argument
Error: Failed to add events.
采样:perf record #
perf record 跟踪记录内核及应用程序的执行状态以及调用栈。
-g 表示记录函数调用栈,等效于 --call-graph=fp
,fp 表示使用 frame pointer 来进行 stack unwinding。如果 ELF 或共享库没有开启 frame pointer,则需要指定 --call-graph=dwarf
参数,它使用 ELF 中的 .eh_frame 或 debuginfo 文件中的 .debug_frame 信息来对用户空间函数进行 stack unwinding。
- 内核函数栈不使用 DWARF,它使用 FP 或 ORC(4.14 及以后内核),perf record 自动选择。
除了 stack unwinding 外,perf record 还使用符号表(.symtab、.dynsym)和 DWARF 符号表(.debug_xx)来对地址进行符号解析(symbol),需要采样的进程二进制以及依赖的共享库(如 libc6-dbg,zlib1g-dbg)包含它们。但是这两类符号表不是程序运行所必须的,同时为了减少二进制大小,一般把它们从 ELF 中 strip(dynsym 不能被 strip),然后保存到单独的 debuginfo 包中。
所以为了 perf、bpfrace 等正常工作,需要安装对应的 XX-debuginfo 或 XX-dbg 或 XX-dbgsym 包,用于提供符号表和 DWARF 调试符号表:
- 使用 file 或 objedump 或 readelf 命令查看 ELF 是否包含 debuginfo 和符号表;
- 如果不能符号解析,则后续生成的火焰图会有大量的 unkonwn block;
- 内核符号表使用的是 /proc/kallsyms 或者 vmlinux ELF 文件;
# perf record -F 99 -g --call-graph=dwarf -p 35775 -g -- sleep 10
# perf record -a -g -- sleep 5
[ perf record: Woken up 584 times to write data ]
Warning:
2 out of order events recorded.
[ perf record: Captured and wrote 172.619 MB perf.data (1128041 samples) ]
# du -sh perf.data
173M perf.data
选项说明:
- -a: 指定采样所有 CPU 核;
- -p PID:指定采样指定进程,否则是所有进程。后续一般展示特定进程的火焰图,所以在 perf record 时需要通过 -p 来指定进程 PID;
- -F 99:采样频率
生成的信息保存在 perf.data
中,然后通过 perf report/script
分析性能和调用栈。
提高采集频率:$ echo {rate} > /proc/sys/kernel/perf_event_max_sample_rate
注意,这将对 CPU、磁盘 IO 等有显著影响。
bpftrace 也支持 profiling: bpftrace -e 'profile:hz:99 {@[kstack]=count();}'
, 底层还是用的 perf 数据源(perf_event_output()) 。
查看函数 CPU 占用量:perf report #
perf report
查看 perf report
记录的 perf.data 文件中哪些函数占用的 CPU 最多:
$ perf report
Samples: 24K of event 'cycles', Event count (approx.): 4868947877
Children Self Command Shared Object Symbol
+ 17.08% 0.23% swapper [kernel.kallsyms] [k] do_idle
+ 5.38% 5.38% swapper [kernel.kallsyms] [k] intel_idle
+ 4.21% 0.02% kubelet [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 4.08% 0.00% kubelet kubelet [.] k8s.io/kubernetes/vendor/github.com/google/...
+ 4.06% 0.00% dockerd dockerd [.] net/http.(*conn).serve
+ 3.96% 0.00% dockerd dockerd [.] net/http.serverHandler.ServeHTTP
...
这是一个交互式的窗口,可以选中具体函数展开查看详情。
打印调用栈:perf script #
展示采集到的事件及其调用栈:
isc-socket 15964 [003] 1912865.267505: 699832 cycles:ppp:
ffffffffa0422fb2 _copy_from_user+0x2 ([kernel.kallsyms])
ffffffffaFailed to open /tmp/perf-38107.map, continuing without symbols
02efbbe __x64_sys_epoll_ctl+0x4e ([kernel.kallsyms])
ffffffffa00042e5 do_syscall_64+0x55 ([kernel.kallsyms])
ffffffffa0a00088 entry_SYSCALL_64_after_hwframe+0x44 ([kernel.kallsyms])
7f460786349a epoll_ctl+0xa (/usr/lib64/libc-2.17.so)
swapper 0 [005] 1912865.267508: 20833 cycles:ppp:
ffffffffa0067b76 native_write_msr+0x6 ([kernel.kallsyms])
ffffffffa000c987 __intel_pmu_enable_all.constprop.26+0x47 ([kernel.kallsyms])
ffffffffa06e9492 net_rx_action+0x292 ([kernel.kallsyms])
ffffffffa0c00108 __softirqentry_text_start+0x108 ([kernel.kallsyms])
ffffffffa009f594 irq_exit+0xf4 ([kernel.kallsyms])
ffffffffa0a01cb2 do_IRQ+0x52 ([kernel.kallsyms])
ffffffffa0a009cf ret_from_intr+0x0 ([kernel.kallsyms])
7f67d9be12e7 vfprintf+0x3df7 (/usr/lib64/libc-2.17.so)
7f67d9c0cfa9 _IO_vsnprintf+0x79 (/usr/lib64/libc-2.17.so)
546f4db8 [unknown] (/tmp/perf-38107.map)
sqlonline_worke 38771 [006] 1912865.267542: 1 cycles:ppp:
ffffffffa0067b76 native_write_msr+0x6 ([kernel.kallsyms])
ffffffffa000c987 __intel_pmu_enable_all.constprop.26+0x47 ([kernel.kallsyms])
ffffffffa01d27d3 event_function+0x83 ([kernel.kallsyms])
ffffffffa01d3e69 remote_function+0x39 ([kernel.kallsyms])
ffffffffa0139f50 flush_smp_call_function_queue+0x70 ([kernel.kallsyms])
ffffffffa0a023b4 smp_call_function_single_interrupt+0x34 ([kernel.kallsyms])
ffffffffa0a01b5f call_function_single_interrupt+0xf ([kernel.kallsyms])
ffffffffc0362a76 ipt_do_table+0x2a6 ([kernel.kallsyms])
ffffffffa073abbd nf_hook_slow+0x3d ([kernel.kallsyms])
ffffffffa07471c3 ip_rcv+0xa3 ([kernel.kallsyms])
ffffffffa06e8b00 __netif_receive_skb_one_core+0x50 ([kernel.kallsyms])
ffffffffa06e7d02 netif_receive_skb_internal+0x42 ([kernel.kallsyms])
ffffffffa06e9cf5 napi_gro_receive+0xb5 ([kernel.kallsyms])
ffffffffc051da6b ixgbe_clean_rx_irq+0x49b ([kernel.kallsyms])
ffffffffc051ef1b ixgbe_poll+0x2ab ([kernel.kallsyms])
ffffffffa06e9492 net_rx_action+0x292 ([kernel.kallsyms])
ffffffffa0c00108 __softirqentry_text_start+0x108 ([kernel.kallsyms])
ffffffffa009f594 irq_exit+0xf4 ([kernel.kallsyms])
ffffffffa0a01cb2 do_IRQ+0x52 ([kernel.kallsyms])
ffffffffa0a009cf ret_from_intr+0x0 ([kernel.kallsyms])
7f67d9be12e7 vfprintf+0x3df7 (/usr/lib64/libc-2.17.so)
7f67d9c0cfa9 _IO_vsnprintf+0x79 (/usr/lib64/libc-2.17.so)
546f4db8 [unknown] (/tmp/perf-38107.map)
可以指定一些参数,如 perf script -c bash
来只显示 bash 命令的调用栈。
生成火焰图 #
将 perf script 的输出重定向到 perl 脚本做进一步处理,就得到了火焰图:
# perf record -F 99 -g --call-graph=dwarf -p 35775 -g -- sleep 10
# perf report
# perf script > out.perf
# git clone https://github.com/brendangregg/FlameGraph
# cd FlameGraph/
# 使用 ./stackcollapse-perf.pl 工具将 perf 数据转换
# ./stackcollapse-perf.pl ../out.perf > out.folded
# ./flamegraph.pl out.folded > process.svg
# open process.svg
对于 bpftrace 产生的 profiling 数据,也可以使用 flamegraph 提供的转换工具进行可视化:
sudo bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > trace.data
cd FlameGraph
# 使用 stackcollapse-bpftrace.pl 工具进行转换
./stackcollapse-bpftrace.pl trace.data > trace.folded
./flamegraph.pl --inverted trace.folded > traceflamegraph.svg
跟踪特定函数调用栈:perf probe #
示例函数:
# cat test.c
#
#include <stdio.h>
#include <unistd.h>
void func_d() {
int msec=1;
printf("%s","Hello world from D\n");
usleep(10000*msec);
}
void func_c() {
printf("%s","Hello from C\n");
func_d();
}
void func_b() {
printf("%s","Hello from B\n");
func_c();
}
void func_a() {
printf("%s","Hello from A\n");
func_b();
}
int main() {
func_a();
}
编译:
- 未开启优化,故 ELF 使用 FP。
- -g 表示生成 DWARF 格式的 .debug_xx 调试符号表。
gcc 默认也在 test ELF 中包含符号表(.symtab 和 .dynsym) 和 .eh_frame 信息:
# gcc test.c -g -o test
符号表、.eh_frame 和 .debug_xx 调试符号表是 perf probe 所必须的信息:
- 如果 strip 了符号表、调试符号表,则 perf probe 将不能为函数添加 probe;
perf probe:
# perf probe -x ./hello func_d
Added new event:
probe_hello:func_d (on func_d in /root/hello)
You can now use it in all perf tools, such as:
perf record -e probe_hello:func_d -aR sleep 1
# perf record -e probe_hello:func_d -aR -g ./hello # -g 默认使用 FP,也可以指定 -g --call-graph=dwarf
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.162 MB perf.data (1 samples) ]
# perf script
hello 14239 [002] 5516.587824: probe_hello:func_d: (561112f12169)
561112f12169 func_d+0x0 (/root/hello)
561112f121e3 func_b+0x21 (/root/hello)
561112f12207 func_a+0x21 (/root/hello)
561112f1221c main+0x12 (/root/hello)
7fe3e67e1d90 __libc_start_call_main+0x80 (/usr/lib/x86_64-linux-gnu/libc.so.6)
perf probe 支持内核和用户程序的 dynamic probe:
# 查看内核可以动态 probe 的函数:
# perf probe -F |grep ^block |head
block_ce_count_show
block_class
block_commit_write
block_depr
block_devnode
block_group_cache_tree_search
block_group_err
block_invalidatepage
block_ioctl
block_is_partially_uptodate
# 查看用户程序可以动态 probe 的函数:
# perf probe -F -x ./hello |head
completed.0
data_start
deregister_tm_clones
frame_dummy
func_a
func_b
func_c
func_d
main
register_tm_clones
perf trace 捕获执行进程时的各种事件 #
各种事件:
- 系统调用函数;
- tracepoint 函数,一般是执行系统调用函数调用的其他内核函数触发的。
应用场景:例如想查那个进程再发送 ICMP 包。但是又不知道应该向哪个 内核函数 插桩。
perf 会自动显示每一个函数的参数:
root@lima-ebpf-dev:~# perf trace ping www.baidu.com -c1 > /dev/null
? ( ): ping/203535 ... [continued]: execve()) = 0
0.119 ( 0.046 ms): ping/203535 brk() = 0x55b19f84a000
0.606 ( 0.044 ms): ping/203535 arch_prctl(option: 0x3001, arg2: 0x7fffc98478d0) = -1 EINVAL (Invalid argument)
0.714 ( 0.049 ms): ping/203535 mmap(len: 8192, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS) = 0x7f36617ca000
0.816 ( 0.050 ms): ping/203535 access(filename: 0x617fbd90, mode: R) = -1 ENOENT (No such file or directory)
0.916 ( 0.050 ms): ping/203535 openat(dfd: CWD, filename: 0x617fa21b, flags: RDONLY|CLOEXEC) = 3
1.010 ( 0.045 ms): ping/203535 newfstatat(dfd: 3, filename: 0x617faee9, statbuf: 0x7fffc9846a20, flag: 4096) = 0
1.100 ( 0.158 ms): ping/203535 mmap(len: 35835, prot: READ, flags: PRIVATE, fd: 3) = 0x7f36617c1000
1.308 ( 0.044 ms): ping/203535 close(fd: 3) = 0
1.403 ( 0.049 ms): ping/203535 openat(dfd: CWD, filename: 0x617ca140, flags: RDONLY|CLOEXEC) = 3
1.498 ( 0.045 ms): ping/203535 read(fd: 3, buf: 0x7fffc9846c58, count: 832) = 832
1.588 ( 0.045 ms): ping/203535 newfstatat(dfd: 3, filename: 0x617faee9, statbuf: 0x7fffc9846af0, flag: 4096) = 0
1.697 ( 0.055 ms): ping/203535 mmap(len: 41016, prot: READ, flags: PRIVATE|DENYWRITE, fd: 3) = 0x7f36617b6000
1.797 ( 0.067 ms): ping/203535 mmap(addr: 0x7f36617b9000, len: 16384, prot: READ|EXEC, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x3000) = 0x7f36617b9000
1.912 ( 0.050 ms): ping/203535 mmap(addr: 0x7f36617bd000, len: 8192, prot: READ, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x7000) = 0x7f36617bd000
2.007 ( 0.049 ms): ping/203535 mmap(addr: 0x7f36617bf000, len: 8192, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x8000) = 0x7f36617bf000
...
279.945 ( 0.049 ms): ping/203535 close(fd: 1</dev/null>) = 0
280.046 ( 0.065 ms): ping/203535 close(fd: 2</dev/pts/1>) = 0
280.185 ( ): ping/203535 exit_group() = ?
root@lima-ebpf-dev:~#
通过指定 -e glob,可以只捕获符合 glob 匹配的内核事件,如向捕获执行 ping 命令式,内核内部执行了那些 net tracepoint 函数:
root@lima-ebpf-dev:~# perf trace --no-syscalls --event 'net:*' ping www.baidu.com -c1 > /dev/null
0.000 ping/203626 net:net_dev_queue(skbaddr: 0xffff99fac46fc400, len: 84, name: "lo")
0.061 ping/203626 net:net_dev_start_xmit(name: "lo", skbaddr: 0xffff99fac46fc400, protocol: 2048, ip_summed: 3, len: 84, network_offset: 14, transport_offset_valid: 1, transport_offset: 34)
0.099 ping/203626 net:netif_rx_entry(name: "lo", napi_id: 2, skbaddr: 0xffff99fac46fc400, protocol: 2048, ip_summed: 3, hash: 190000638, l4_hash: 1, len: 70, truesize: 768, mac_header_valid: 1, mac_header: 4294967282)
0.139 ping/203626 net:netif_rx(skbaddr: 0xffff99fac46fc400, len: 70, name: "lo")
0.173 ping/203626 net:netif_rx_exit(skbaddr: 0xffff99fac46fc400, len: 70, name: "lo")
0.206 ping/203626 net:net_dev_xmit(skbaddr: 0xffff99fac46fc400, len: 84, name: "lo")
...
root@lima-ebpf-dev:~#
比如,如果想通过 bpftrace 来捕获发送 icmp 包的进程,则可以从上面 perf trace 的输出结果中来看,对 net_dev_start_xmit 函数进行插桩是比较合适的选择:
- 任意内核函数插桩,建议使用 kfunc 类型,因为可以使用 -lv 来看 kfunc 的参数列表,后面使用 args.
来引用函数参数。 - 如果使用 kprobe 类型,则不支持查看函数列表,需要使用 argN 来引用每一个函数参数。
# 对内核函数关键字进行搜索,不一定完整匹配
root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:net_dev_start_xmit'
root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:*' |grep net_dev |grep xmit
kfunc:vmlinux:__traceiter_net_dev_start_xmit
kfunc:vmlinux:__traceiter_net_dev_xmit
kfunc:vmlinux:__traceiter_net_dev_xmit_timeout
root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:vmlinux:__traceiter_net_dev_start_xmit'
kfunc:vmlinux:__traceiter_net_dev_start_xmit
void * __data
const struct sk_buff * skb
const struct net_device * dev
int retval
内核对应的 net_dev_start_xmit trace event 定义:
TRACE_EVENT(net_dev_start_xmit,
TP_PROTO(const struct sk_buff *skb, const struct net_device *dev),
TP_ARGS(skb, dev),
TP_STRUCT__entry(
__string( name, dev->name )
__field( u16, queue_mapping )
__field( const void *, skbaddr )
__field( bool, vlan_tagged )
__field( u16, vlan_proto )
__field( u16, vlan_tci )
__field( u16, protocol )
__field( u8, ip_summed )
__field( unsigned int, len )
__field( unsigned int, data_len )
__field( int, network_offset )
__field( bool, transport_offset_valid)
__field( int, transport_offset)
__field( u8, tx_flags )
__field( u16, gso_size )
__field( u16, gso_segs )
__field( u16, gso_type )
),
TP_fast_assign(
__assign_str(name, dev->name);
__entry->queue_mapping = skb->queue_mapping;
__entry->skbaddr = skb;
__entry->vlan_tagged = skb_vlan_tag_present(skb);
__entry->vlan_proto = ntohs(skb->vlan_proto);
__entry->vlan_tci = skb_vlan_tag_get(skb);
__entry->protocol = ntohs(skb->protocol);
__entry->ip_summed = skb->ip_summed;
__entry->len = skb->len;
__entry->data_len = skb->data_len;
__entry->network_offset = skb_network_offset(skb);
__entry->transport_offset_valid =
skb_transport_header_was_set(skb);
__entry->transport_offset = skb_transport_offset(skb);
__entry->tx_flags = skb_shinfo(skb)->tx_flags;
__entry->gso_size = skb_shinfo(skb)->gso_size;
__entry->gso_segs = skb_shinfo(skb)->gso_segs;
__entry->gso_type = skb_shinfo(skb)->gso_type;
),
TP_printk("dev=%s queue_mapping=%u skbaddr=%p vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x ip_summed=%d len=%u data_len=%u network_offset=%d transport_offset_valid=%d transport_offset=%d tx_flags=%d gso_size=%d gso_segs=%d gso_type=%#x",
__get_str(name), __entry->queue_mapping, __entry->skbaddr,
__entry->vlan_tagged, __entry->vlan_proto, __entry->vlan_tci,
__entry->protocol, __entry->ip_summed, __entry->len,
__entry->data_len,
__entry->network_offset, __entry->transport_offset_valid,
__entry->transport_offset, __entry->tx_flags,
__entry->gso_size, __entry->gso_segs, __entry->gso_type)
)
perf ftrace 追踪函数 #
https://mp.weixin.qq.com/s/kZfmtbbcTwumyJ7I551Xpw
为了测试方便,将 eth1 的网卡队列设置为 1,并且将网卡中断绑定到 CPU0。
ethtool -L eth1 combined 1
echo 0 > /proc/irq/$(cat /proc/interrupts | awk -F ':' '/eth1/ {gsub(/ /,""); print $1}')/smp_affinity_list
- 收包路径:perf ftrace -G ‘_netif_receive_skb_list_core’ -g ‘smp*’
- 发包路径:perf ftrace -G ‘_netif_receive_skb_core’ -g ‘smp*’
参考:
- Debugging the kernel using Ftrace - part 1: https://lwn.net/Articles/365835/
- Debugging the kernel using Ftrace - part 2: https://lwn.net/Articles/366796/
perf 对指定 cgroup 进行追踪 #
在 Host 上查找容器内二进制文件 #
两种方式:
- /proc/
/root - 查找容器 mergedDir
# ls /proc/102366/root/
apsara bin boot dev entrypoint.sh etc home lib lib64 lost+found media mnt nsenter opt proc root run sbin srv sys tmp usr var
# ls -l /proc/102366/exe
lrwxrwxrwx 1 root root 0 Jan 9 16:11 /proc/102366/exe -> /usr/bin/plugin.csi.cloud.com
# ls -l /proc/102366/root/usr/bin/plugin.csi.cloud.com
-rwxr-xr-x 1 root root 82455166 Jan 9 13:23 /proc/102366/root/usr/bin/plugin.csi.cloud.com
另外也可以根据容器的 .GraphDriver.Data.MergedDir 来在 Host 上查找容器内路径的二进制路径。例如,如果想跟踪 cilium-agent 进程(本身是用 docker 容器部署的),首先需要找到 cilium-agent 文件在宿主机上的绝对路径,可以通过 container ID 或 name 找:
- merged path 是容器使用的 overlay 根文件系统。
# Check cilium-agent container
$ docker ps | grep cilium-agent
0eb2e76384b3 cilium:test "/usr/bin/cilium-agent ..." 4 hours ago Up 4 hours cilium-agent
# Find the merged path for cilium-agent container
$ docker inspect --format "{{.GraphDriver.Data.MergedDir}}" 0eb2e76384b3
/var/lib/docker/overlay2/a17f868d/merged # a17f868d.. is shortened for better viewing
# The object file we are going to trace
$ ls -ahl /var/lib/docker/overlay2/a17f868d/merged/usr/bin/cilium-agent
-rwxr-xr-x 1 root root 86M /var/lib/docker/overlay2/a17f868d/merged/usr/bin/cilium-agent
perf 感知进程 cgroup #
perf 命令感知进程对应的 cgroup,并自动到对应的 cgruop rootfs 下查找二进制和对应的 debuginfo 信息。
Host System
├── /proc/<container_pid>/
│ ├── root/ (容器根文件系统)
│ ├── exe (二进制链接)
│ └── maps (内存映射)
└── /usr/lib/debug/ (调试符号)
根据 /proc//status 的 NSPid 来判断是否是容器进程
#
$ sudo cat /proc/200/status
Name: ksoftirqd/37
Umask: 0000
State: S (sleeping)
...
Cpus_allowed: 0020,00000000
Cpus_allowed_list: 37
voluntary_ctxt_switches: 27251
nonvoluntary_ctxt_switches: 350
其中,根据 NSpid 字段可以判断这个进程是不是容器
:
$ cat /proc/1229/status | grep NSpid
NSpid: 1229
$ cat /proc/11459/status | grep NSpid
NSpid: 11459 1
11459 是在宿主机的 pid ns 内的进程 ID,1 是在容器自己的 pid ns 的进程 ID
如果要对指定 cgroup 进行 syscall tracing/probe,则需要使用 –cgroup/-G 来指定 /sys/fs/cgroup/perf_event 下的 cgrup 路径。
# ls /sys/fs/cgroup/perf_event/docker/
59a016887a1b563799d1f5d942f420ab37f9e99be68996e80ffc68fdc96e06ff a38dc1db0ad27317bbe8914e0e25930e0b7650d2b3c6bb10860dbf797cfba770 cgroup.procs tasks
9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058 cgroup.clone_children notify_on_release
# docker exec -it 9c2281e2be09 sh
/bin # cat /proc/self/cgroup
12:freezer:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
11:pids:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
10:devices:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
9:blkio:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
8:cpuset:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
7:rdma:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
6:hugetlb:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
5:perf_event:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
4:net_cls,net_prio:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
3:cpu,cpuacct:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
2:memory:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
1:name=systemd:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
0::/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
/bin #
# perf record -F 99 -e cpu-clock --cgroup=docker/9c2281e2be099 -a -- sleep 10
另一例子:
# docker exec 879e9059287c cat /proc/self/cgroup
12:pids:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
11:devices:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
10:memory:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
9:cpu,cpuacct:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
8:blkio:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
7:hugetlb:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
6:perf_event:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
5:freezer:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
4:rdma:/
3:net_cls,net_prio:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
2:cpuset:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
1:name=systemd:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
# perf stat -e "syscalls:*" -G my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope -a
perf 实验 #
# gcc test.c -o test
# file test
test: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=cf5aa8fa3e8950c6e5c71b2588d8b2e8bf40f175, not stripped
# readelf -S ./test |grep -E 'sym|debug|eh_'
[ 5] .dynsym DYNSYM 00000000004002b8 000002b8
[16] .eh_frame_hdr PROGBITS 00000000004006dc 000006dc
[17] .eh_frame PROGBITS 0000000000400730 00000730
[27] .symtab SYMTAB 0000000000000000 00001098
# nm test |grep func
00000000004005dc T func_a
00000000004005c2 T func_b
00000000004005a8 T func_c
000000000040057d T func_d
# strings test |grep func
func_b
func_d
func_a
func_c
# 没有加 -g,默认不捕获函数调用栈,
# 没有加 -e,默认使用 cycles:ppp
# perf record ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.030 MB perf.data (10 samples) ]
# perf report 并不显示函数调用栈
# perf report --stdio |cat -
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 9 of event 'cycles:ppp'
# Event count (approx.): 1717500
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ............................
#
63.50% test [kernel.kallsyms] [k] sync_regs
32.86% test [kernel.kallsyms] [k] vma_interval_tree_insert
3.29% perf [kernel.kallsyms] [k] do_nmi
0.32% perf [kernel.kallsyms] [k] acpi_os_read_memory
0.03% perf [kernel.kallsyms] [k] native_write_msr
# perf script
perf 41565 3185673.474644: 1 cycles:ppp: ffffffffaf067b74 native_write_msr+0x4 ([kernel.kallsyms])
perf 41565 3185673.474657: 1 cycles:ppp: ffffffffaf067b74 native_write_msr+0x4 ([kernel.kallsyms])
perf 41565 3185673.474659: 4 cycles:ppp: ffffffffaf067b74 native_write_msr+0x4 ([kernel.kallsyms])
perf 41565 3185673.474663: 47 cycles:ppp: ffffffffaf067b74 native_write_msr+0x4 ([kernel.kallsyms])
perf 41565 3185673.474666: 506 cycles:ppp: ffffffffaf067b76 native_write_msr+0x6 ([kernel.kallsyms])
perf 41565 3185673.474669: 5450 cycles:ppp: ffffffffaf49bdc5 acpi_os_read_memory+0x35 ([kernel.kallsyms])
perf 41565 3185673.474672: 56442 cycles:ppp: ffffffffaf029f59 do_nmi+0x99 ([kernel.kallsyms])
test 41565 3185673.474702: 564369 cycles:ppp: ffffffffaf2268dc vma_interval_tree_insert+0x2c ([kernel.kallsyms])
test 41565 3185673.474986: 1090680 cycles:ppp: ffffffffaf02538e sync_regs+0x1e ([kernel.kallsyms])
# 加 -g 时捕获函数调用栈
# perf record -g ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.031 MB perf.data (11 samples) ]
# perf report --stdio |head -20
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 11 of event 'cycles:ppp'
# Event count (approx.): 1310813
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... ................. .......................................
#
75.09% 0.00% test [kernel.kallsyms] [k] page_fault
|
---page_fault
do_page_fault
__do_page_fault
handle_mm_fault
|
--35.97%--__handle_mm_fault
do_anonymous_page
# 但是并没有捕获到用户函数的调用栈
# perf report --stdio |grep func_
# 如果要捕获用户函数,则需要使用 perf probe
# 查看 ./test 中可以 probe 的用户函数列表
# perf probe -F -x ./test
completed.6355
data_start
deregister_tm_clones
frame_dummy
func_a
func_b
func_c
func_d
main
puts@plt
register_tm_clones
usleep@plt
# perf probe -x ./test 'func_d'
Added new event:
probe_test:func_d (on func_d in /root/zj/test)
You can now use it in all perf tools, such as:
perf record -e probe_test:func_d -aR sleep 1
# 然后执行 perf record 时需要加 -g 才能捕获函数调用栈,
# 或者使用 dwarf 来 unwinding(使用默认生成的 .eh_frame)
# perf record -e probe_test:func_d -aR ./test # perf record -e probe_test:func_d --call-graph=dwarf -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.330 MB perf.data (1 samples) ]
# perf script
test 104460 [076] 3186931.797081: probe_test:func_d_1: (40057d)
40057d func_d+0x0 (/root/zj/test)
4005bf func_c+0x17 (/root/zj/test)
4005d9 func_b+0x17 (/root/zj/test)
4005f3 func_a+0x17 (/root/zj/test)
400603 main+0xd (/root/zj/test)
7fb599014444 __libc_start_main+0xf4 (/usr/lib64/libc-2.17.so)
4004b8 _start+0x28 (/root/zj/test)
# perf report --stdio |cat -
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1 of event 'probe_test:func_d'
# Event count (approx.): 1
#
# Children Self Trace output
# ........ ........ ............
#
100.00% 100.00% (40057d)
|
---__libc_start_main
main
func_a
func_b
func_d
# perf record -e probe_test:func_d_1 --proc-map-timeout 100 -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.845 MB perf.data (1 samples) ]
# perf script
test 129241 [084] 3187206.870347: probe_test:func_d_1: (40057d)
# perf record -e probe_test:func_d_1 --proc-map-timeout 500 -g -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.404 MB perf.data (1 samples) ]
# perf script
test 76042 [073] 3187258.344951: probe_test:func_d_1: (40057d)
40057d func_d+0x0 (/root/zj/test)
4005da func_b+0x18 (/root/zj/test)
4005f4 func_a+0x18 (/root/zj/test)
400604 main+0xe (/root/zj/test)
7f8f5e361445 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)
# perf record -e probe_test:func_d_1 --proc-map-timeout 500 --call-graph fp -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.476 MB perf.data (1 samples) ]
# perf script
test 103396 [068] 3187282.462500: probe_test:func_d_1: (40057d)
40057d func_d+0x0 (/root/zj/test)
4005da func_b+0x18 (/root/zj/test)
4005f4 func_a+0x18 (/root/zj/test)
400604 main+0xe (/root/zj/test)
7feeb9598445 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)
# gcc ./test.c -o test -fomit-frame-pointer
# perf probe -l
kprobes:aegis___x64_sys_execve (on __x64_sys_execve with s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 file arg1 arg2 arg3 arg4 arg5 ar
kprobes:aegis_do_execve (on do_execve with s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 file arg1 arg2 arg3 arg4 arg5 arg6 arg7 arg8 a
kprobes:aegis_do_sys_open (on do_sys_open with dfd file flag)
kprobes:aegis_proc_fork_connector (on proc_fork_connector with tgid pid parent_tgid parent_pid)
kprobes:aegis_tcp_connect (on tcp_connect with saddr daddr sport dport)
probe_test:func_d (on func_d in /root/zj/test)
probe_test:func_d_1 (on func_d in /root/zj/test)
# perf probe -d probe_test:*
Removed event: probe_test:func_d
Removed event: probe_test:func_d_1
# perf probe -x ./test 'func_d'
Added new event:
probe_test:func_d (on func_d in /root/zj/test)
You can now use it in all perf tools, such as:
perf record -e probe_test:func_d -aR sleep 1
# perf record -e probe_test:func_d -aR -g ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.281 MB perf.data (1 samples) ]
# perf script
test 85054 [093] 3187435.293375: probe_test:func_d: (40057d)
40057d func_d+0x0 (/root/zj/test)
# perf record -e probe_test:func_d -aR --call-graph dwarf ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 29.288 MB perf.data (1 samples) ]
# perf script
test 106253 [094] 3187527.044076: probe_test:func_d: (40057d)
40057d func_d+0x0 (/root/zj/test)
4005c0 func_c+0x17 (/root/zj/test)
4005dd func_b+0x17 (/root/zj/test)
4005fa func_a+0x17 (/root/zj/test)
40060d main+0xd (/root/zj/test)
7f15c70ca444 __libc_start_main+0xf4 (/usr/lib64/libc-2.17.so)
4004b8 _start+0x28 (/root/zj/test)
# perf annotate --stdio2
Samples: 1 of event 'probe_test:func_d', 1 Hz, Event count (approx.): 1, [percent: local period]
func_d() /root/zj/test
Percent
Disassembly of section .text:
000000000040057d <func_d>:
func_d():
100.00 push %rbp
mov %rsp,%rbp
sub $0x10,%rsp
movl $0x1,-0x4(%rbp)
mov $0x4006a0,%edi
→ callq puts@plt
mov -0x4(%rbp),%eax
imul $0x2710,%eax,%eax
mov %eax,%edi
→ callq usleep@plt
leaveq
← retq
# gcc ./test.c -g -o test
# readelf -S ./test |grep debug
[27] .debug_aranges PROGBITS 0000000000000000 00001096
[28] .debug_info PROGBITS 0000000000000000 000010c6
[29] .debug_abbrev PROGBITS 0000000000000000 000011ce
[30] .debug_line PROGBITS 0000000000000000 0000124d
[31] .debug_str PROGBITS 0000000000000000 0000129a
# perf record -e probe_test:func_d -aR --call-graph dwarf ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 27.890 MB perf.data (1 samples) ]
# perf annotate --stdio2
Samples: 1 of event 'probe_test:func_d', 1 Hz, Event count (approx.): 1, [percent: local period]
func_d() /root/zj/test
Percent
Disassembly of section .text:
000000000040057d <func_d>:
func_d():
#include <stdio.h>
#include <unistd.h>
void func_d() {
100.00 push %rbp
mov %rsp,%rbp
sub $0x10,%rsp
int msec=1;
movl $0x1,-0x4(%rbp)
printf("%s","Hello world from D\n");
mov $0x4006a0,%edi
→ callq puts@plt
usleep(10000*msec);
mov -0x4(%rbp),%eax
imul $0x2710,%eax,%eax
mov %eax,%edi
→ callq usleep@plt
}
leaveq
← retq
报错:
# perf record -e probe_test:func_d_1 --call-graph=dwarf -aR ./test
Reading /proc/101522/task/101522/maps time out. You may want to increase the time limit by --proc-map-timeout
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
Warning:
1 map information files for pre-existing threads were
not processed, if there are samples for addresses they
will not be resolved, you may find out which are these
threads by running with -v and redirecting the output
to a file.
The time limit to process proc map is too short?
Increase it by --proc-map-timeout
[ perf record: Captured and wrote 28.158 MB perf.data (1 samples) ]
# perf record -e probe_test:func_d_1 --call-graph=dwarf --proc-map-timeout 100 -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 29.081 MB perf.data (1 samples) ]