跳过正文

使用 linux perf 进行内核和应用性能分析

·5365 字
Perf Kernel Performance Tool
目录

介绍 perf 工具的使用方式、局限性和问题。

查看 perf 构建参数
#

# perf version --build-options
perf version 4.19.91-007.ali4000.alios7.x86_64.debug
                 dwarf: [ on  ]  # HAVE_DWARF_SUPPORT
    dwarf_getlocations: [ on  ]  # HAVE_DWARF_GETLOCATIONS_SUPPORT
                 glibc: [ on  ]  # HAVE_GLIBC_SUPPORT
                  gtk2: [ OFF ]  # HAVE_GTK2_SUPPORT
         syscall_table: [ on  ]  # HAVE_SYSCALL_TABLE_SUPPORT
                libbfd: [ on  ]  # HAVE_LIBBFD_SUPPORT
                libelf: [ on  ]  # HAVE_LIBELF_SUPPORT
               libnuma: [ on  ]  # HAVE_LIBNUMA_SUPPORT
numa_num_possible_cpus: [ on  ]  # HAVE_LIBNUMA_SUPPORT
               libperl: [ on  ]  # HAVE_LIBPERL_SUPPORT
             libpython: [ on  ]  # HAVE_LIBPYTHON_SUPPORT
              libslang: [ on  ]  # HAVE_SLANG_SUPPORT
             libcrypto: [ on  ]  # HAVE_LIBCRYPTO_SUPPORT
             libunwind: [ OFF ]  # HAVE_LIBUNWIND_SUPPORT
    libdw-dwarf-unwind: [ on  ]  # HAVE_DWARF_SUPPORT
                  zlib: [ on  ]  # HAVE_ZLIB_SUPPORT
                  lzma: [ OFF ]  # HAVE_LZMA_SUPPORT
             get_cpuid: [ on  ]  # HAVE_AUXTRACE_SUPPORT
                   bpf: [ on  ]  # HAVE_LIBBPF_SUPPORT

系统整体 profiling,分析 CPU 繁忙情况
#

$ sudo perf record -a -g sleep 5
$ sudo perf report

$ sudo perf report --stdio --comm fvUnisocket --percent-limit 0.5

一次 probe trace 程序中的所有函数
#

# perf probe -x ./test 'func_*'
Error: event "func_d" already exists.
 Hint: Remove existing event by 'perf probe -d'
       or force duplicates by 'perf probe -f'
       or set 'force=yes' in BPF source.
  Error: Failed to add events.

# perf probe -d probe_test:*
Removed event: probe_test:func_d

# perf probe -x ./test '*'
Added new events:
  probe_test:func_d    (on * in /root/zj/test)
  probe_test:func_c    (on * in /root/zj/test)
  probe_test:func_b    (on * in /root/zj/test)
  probe_test:func_a    (on * in /root/zj/test)
  probe_test:main      (on * in /root/zj/test)

You can now use it in all perf tools, such as:

	perf record -e probe_test:main -aR sleep 1

# perf record -e probe_test:* -g -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 12.568 MB perf.data (5 samples) ]

# perf script
test 104426 [083] 3189986.791451:   probe_test:main: (4005f6)
                  4005f6 main+0x0 (/root/zj/test)

test 104426 [083] 3189986.791458: probe_test:func_a: (4005dc)
                  4005dc func_a+0x0 (/root/zj/test)
            7f4e708de555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)

test 104426 [083] 3189986.791492: probe_test:func_b: (4005c2)
                  4005c2 func_b+0x0 (/root/zj/test)
                  400604 main+0xe (/root/zj/test)
            7f4e708de555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)

test 104426 [083] 3189986.791500: probe_test:func_c: (4005a8)
                  4005a8 func_c+0x0 (/root/zj/test)
                  4005f4 func_a+0x18 (/root/zj/test)
                  400604 main+0xe (/root/zj/test)
            7f4e708de555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)

test 104426 [083] 3189986.791504: probe_test:func_d: (40057d)
                  40057d func_d+0x0 (/root/zj/test)
                  4005da func_b+0x18 (/root/zj/test)
                  4005f4 func_a+0x18 (/root/zj/test)
                  400604 main+0xe (/root/zj/test)
            7f4e708de555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)

# grep libwebp /proc/25760/maps
7f7bc6af3000-7f7bc6af6000 r--p 00000000 09:01 38281904                   /usr/lib/x86_64-linux-gnu/libwebp.so.6.0.2

# bpftrace -e 'uprobe:/usr/lib/x86_64-linux-gnu/libwebp.so.6.0.2:* {time("%H:%M:%S "); printf("%s %d\n",comm,pid);}' | tee /tmp/libwebp.trace

perf probe 不支持 go、rust 等非 C 程序
#

但 perf record 等还能正常使用:

解决办法:使用 bpftrace

#perf probe -F -x /app/myagent |grep -i store |grep gitlab
git.com/paas/myagent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
git.com/paas/myagent/pkg/storage.(*ProcessStore).SaveProcessLocateResult
git.com/paas/myagent/pkg/storage.(*ProcessStore).SetRelistFunc
git.com/paas/myagent/pkg/storage.(*ProcessStore).cleanupByDiskUsed
git.com/paas/myagent/pkg/storage.(*ProcessStore).cleanupByDiskUsed.func1
git.com/paas/myagent/pkg/storage.(*ProcessStore).cleanupOldRecords
git.com/paas/myagent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo
git.com/paas/myagent/pkg/storage.NewStore
git.com/paas/myagent/pkg/storage.NewStore.func1
git.com/paas/myagent/pkg/storage.NewStore.func2

#perf probe -x /app/myagent 'SaveNetworkLocateInfo'
Failed to find symbol SaveNetworkLocateInfo in /app/myagent
  Error: Failed to add events.

# perf probe -x /app/myagent 'git.com/paas/myagent/pkg/storage.(\*ProcessStore).SaveNetworkLocateInfo'
Failed to write event: Invalid argument
  Error: Failed to add events.

# nm -n /app/myagent |grep SaveNetwork
00000000012e0220 T git.com/paas/myagent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
00000000012e0480 T git.com/paas/myagent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo

# perf probe -x /app/myagent --add '0x12e0220'
Failed to write event: Invalid argument
  Error: Failed to add events.

采样:perf record
#

perf record 跟踪记录内核及应用程序的执行状态以及调用栈。

-g 表示记录函数调用栈,等效于 --call-graph=fp,fp 表示使用 frame pointer 来进行 stack unwinding。如果 ELF 或共享库没有开启 frame pointer,则需要指定 --call-graph=dwarf 参数,它使用 ELF 中的 .eh_frame 或 debuginfo 文件中的 .debug_frame 信息来对用户空间函数进行 stack unwinding。

  • 内核函数栈不使用 DWARF,它使用 FP 或 ORC(4.14 及以后内核),perf record 自动选择。

除了 stack unwinding 外,perf record 还使用符号表(.symtab、.dynsym)和 DWARF 符号表(.debug_xx)来对地址进行符号解析(symbol),需要采样的进程二进制以及依赖的共享库(如 libc6-dbg,zlib1g-dbg)包含它们。但是这两类符号表不是程序运行所必须的,同时为了减少二进制大小,一般把它们从 ELF 中 strip(dynsym 不能被 strip),然后保存到单独的 debuginfo 包中。

所以为了 perf、bpfrace 等正常工作,需要安装对应的 XX-debuginfo 或 XX-dbg 或 XX-dbgsym 包,用于提供符号表和 DWARF 调试符号表:

  • 使用 file 或 objedump 或 readelf 命令查看 ELF 是否包含 debuginfo 和符号表;
  • 如果不能符号解析,则后续生成的火焰图会有大量的 unkonwn block;
  • 内核符号表使用的是 /proc/kallsyms 或者 vmlinux ELF 文件;
# perf record -F 99 -g --call-graph=dwarf -p 35775 -g -- sleep 10
# perf record -a -g -- sleep 5
[ perf record: Woken up 584 times to write data ]
Warning:
2 out of order events recorded.
[ perf record: Captured and wrote 172.619 MB perf.data (1128041 samples) ]

# du -sh perf.data
173M    perf.data

选项说明:

  1. -a: 指定采样所有 CPU 核;
  2. -p PID:指定采样指定进程,否则是所有进程。后续一般展示特定进程的火焰图,所以在 perf record 时需要通过 -p 来指定进程 PID;
  3. -F 99:采样频率

生成的信息保存在 perf.data 中,然后通过 perf report/script 分析性能和调用栈。

提高采集频率:$ echo {rate} > /proc/sys/kernel/perf_event_max_sample_rate

注意,这将对 CPU、磁盘 IO 等有显著影响。

bpftrace 也支持 profiling: bpftrace -e 'profile:hz:99 {@[kstack]=count();}' , 底层还是用的 perf 数据源(perf_event_output()) 。

查看函数 CPU 占用量:perf report
#

perf report 查看 perf report 记录的 perf.data 文件中哪些函数占用的 CPU 最多:

$ perf report
Samples: 24K of event 'cycles', Event count (approx.): 4868947877
  Children      Self  Command   Shared Object        Symbol
+   17.08%     0.23%  swapper   [kernel.kallsyms]    [k] do_idle
+    5.38%     5.38%  swapper   [kernel.kallsyms]    [k] intel_idle
+    4.21%     0.02%  kubelet   [kernel.kallsyms]    [k] entry_SYSCALL_64_after_hwframe
+    4.08%     0.00%  kubelet   kubelet              [.] k8s.io/kubernetes/vendor/github.com/google/...
+    4.06%     0.00%  dockerd   dockerd              [.] net/http.(*conn).serve
+    3.96%     0.00%  dockerd   dockerd              [.] net/http.serverHandler.ServeHTTP
...

这是一个交互式的窗口,可以选中具体函数展开查看详情。

打印调用栈:perf script
#

展示采集到的事件及其调用栈:

isc-socket 15964 [003] 1912865.267505:     699832 cycles:ppp:
        ffffffffa0422fb2 _copy_from_user+0x2 ([kernel.kallsyms])
        ffffffffaFailed to open /tmp/perf-38107.map, continuing without symbols
02efbbe __x64_sys_epoll_ctl+0x4e ([kernel.kallsyms])
        ffffffffa00042e5 do_syscall_64+0x55 ([kernel.kallsyms])
        ffffffffa0a00088 entry_SYSCALL_64_after_hwframe+0x44 ([kernel.kallsyms])
            7f460786349a epoll_ctl+0xa (/usr/lib64/libc-2.17.so)

swapper     0 [005] 1912865.267508:      20833 cycles:ppp:
        ffffffffa0067b76 native_write_msr+0x6 ([kernel.kallsyms])
        ffffffffa000c987 __intel_pmu_enable_all.constprop.26+0x47 ([kernel.kallsyms])
        ffffffffa06e9492 net_rx_action+0x292 ([kernel.kallsyms])
        ffffffffa0c00108 __softirqentry_text_start+0x108 ([kernel.kallsyms])
        ffffffffa009f594 irq_exit+0xf4 ([kernel.kallsyms])
        ffffffffa0a01cb2 do_IRQ+0x52 ([kernel.kallsyms])
        ffffffffa0a009cf ret_from_intr+0x0 ([kernel.kallsyms])
            7f67d9be12e7 vfprintf+0x3df7 (/usr/lib64/libc-2.17.so)
            7f67d9c0cfa9 _IO_vsnprintf+0x79 (/usr/lib64/libc-2.17.so)
                546f4db8 [unknown] (/tmp/perf-38107.map)

sqlonline_worke 38771 [006] 1912865.267542:          1 cycles:ppp:
        ffffffffa0067b76 native_write_msr+0x6 ([kernel.kallsyms])
        ffffffffa000c987 __intel_pmu_enable_all.constprop.26+0x47 ([kernel.kallsyms])
        ffffffffa01d27d3 event_function+0x83 ([kernel.kallsyms])
        ffffffffa01d3e69 remote_function+0x39 ([kernel.kallsyms])
        ffffffffa0139f50 flush_smp_call_function_queue+0x70 ([kernel.kallsyms])
        ffffffffa0a023b4 smp_call_function_single_interrupt+0x34 ([kernel.kallsyms])
        ffffffffa0a01b5f call_function_single_interrupt+0xf ([kernel.kallsyms])
        ffffffffc0362a76 ipt_do_table+0x2a6 ([kernel.kallsyms])
        ffffffffa073abbd nf_hook_slow+0x3d ([kernel.kallsyms])
        ffffffffa07471c3 ip_rcv+0xa3 ([kernel.kallsyms])
        ffffffffa06e8b00 __netif_receive_skb_one_core+0x50 ([kernel.kallsyms])
        ffffffffa06e7d02 netif_receive_skb_internal+0x42 ([kernel.kallsyms])
        ffffffffa06e9cf5 napi_gro_receive+0xb5 ([kernel.kallsyms])
        ffffffffc051da6b ixgbe_clean_rx_irq+0x49b ([kernel.kallsyms])
        ffffffffc051ef1b ixgbe_poll+0x2ab ([kernel.kallsyms])
        ffffffffa06e9492 net_rx_action+0x292 ([kernel.kallsyms])
        ffffffffa0c00108 __softirqentry_text_start+0x108 ([kernel.kallsyms])
        ffffffffa009f594 irq_exit+0xf4 ([kernel.kallsyms])
        ffffffffa0a01cb2 do_IRQ+0x52 ([kernel.kallsyms])
        ffffffffa0a009cf ret_from_intr+0x0 ([kernel.kallsyms])
            7f67d9be12e7 vfprintf+0x3df7 (/usr/lib64/libc-2.17.so)
            7f67d9c0cfa9 _IO_vsnprintf+0x79 (/usr/lib64/libc-2.17.so)
                546f4db8 [unknown] (/tmp/perf-38107.map)

可以指定一些参数,如 perf script -c bash 来只显示 bash 命令的调用栈。

生成火焰图
#

将 perf script 的输出重定向到 perl 脚本做进一步处理,就得到了火焰图:

# perf record -F 99 -g --call-graph=dwarf -p 35775 -g -- sleep 10
# perf report
# perf script > out.perf
# git clone https://github.com/brendangregg/FlameGraph
# cd FlameGraph/
# 使用 ./stackcollapse-perf.pl 工具将 perf 数据转换
# ./stackcollapse-perf.pl  ../out.perf > out.folded
# ./flamegraph.pl out.folded > process.svg
# open process.svg

对于 bpftrace 产生的 profiling 数据,也可以使用 flamegraph 提供的转换工具进行可视化:

sudo bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > trace.data
cd FlameGraph
# 使用 stackcollapse-bpftrace.pl 工具进行转换
./stackcollapse-bpftrace.pl trace.data > trace.folded
./flamegraph.pl --inverted trace.folded > traceflamegraph.svg

跟踪特定函数调用栈:perf probe
#

示例函数:

# cat test.c
#
#include <stdio.h>
#include <unistd.h>

void func_d() {
                int msec=1;
                printf("%s","Hello world from D\n");
                usleep(10000*msec);
}
void func_c() {
                printf("%s","Hello from C\n");
                func_d();
}
void func_b() {
                printf("%s","Hello from B\n");
        func_c();
}
void func_a() {
                printf("%s","Hello from A\n");
                func_b();
}
int main() {
        func_a();
}

编译:

  • 未开启优化,故 ELF 使用 FP。
  • -g 表示生成 DWARF 格式的 .debug_xx 调试符号表。

gcc 默认也在 test ELF 中包含符号表(.symtab 和 .dynsym) 和 .eh_frame 信息:

# gcc test.c -g -o test

符号表、.eh_frame 和 .debug_xx 调试符号表是 perf probe 所必须的信息:

  • 如果 strip 了符号表、调试符号表,则 perf probe 将不能为函数添加 probe;

perf probe:

# perf probe -x ./hello func_d
Added new event:
  probe_hello:func_d   (on func_d in /root/hello)

You can now use it in all perf tools, such as:

        perf record -e probe_hello:func_d -aR sleep 1

# perf record -e probe_hello:func_d -aR -g ./hello  # -g 默认使用 FP,也可以指定 -g --call-graph=dwarf
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.162 MB perf.data (1 samples) ]

# perf script
hello 14239 [002]  5516.587824: probe_hello:func_d: (561112f12169)
            561112f12169 func_d+0x0 (/root/hello)
            561112f121e3 func_b+0x21 (/root/hello)
            561112f12207 func_a+0x21 (/root/hello)
            561112f1221c main+0x12 (/root/hello)
            7fe3e67e1d90 __libc_start_call_main+0x80 (/usr/lib/x86_64-linux-gnu/libc.so.6)

perf probe 支持内核和用户程序的 dynamic probe:

# 查看内核可以动态 probe 的函数:
# perf probe -F |grep ^block |head
block_ce_count_show
block_class
block_commit_write
block_depr
block_devnode
block_group_cache_tree_search
block_group_err
block_invalidatepage
block_ioctl
block_is_partially_uptodate

# 查看用户程序可以动态 probe 的函数:
# perf probe -F -x ./hello |head
completed.0
data_start
deregister_tm_clones
frame_dummy
func_a
func_b
func_c
func_d
main
register_tm_clones

perf trace 捕获执行进程时的各种事件
#

各种事件:

  1. 系统调用函数;
  2. tracepoint 函数,一般是执行系统调用函数调用的其他内核函数触发的。

应用场景:例如想查那个进程再发送 ICMP 包。但是又不知道应该向哪个 内核函数 插桩。

perf 会自动显示每一个函数的参数:

root@lima-ebpf-dev:~# perf trace ping www.baidu.com -c1 > /dev/null
         ? (         ): ping/203535  ... [continued]: execve())                                           = 0
     0.119 ( 0.046 ms): ping/203535 brk()                                                                 = 0x55b19f84a000
     0.606 ( 0.044 ms): ping/203535 arch_prctl(option: 0x3001, arg2: 0x7fffc98478d0)                      = -1 EINVAL (Invalid argument)
     0.714 ( 0.049 ms): ping/203535 mmap(len: 8192, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS)           = 0x7f36617ca000
     0.816 ( 0.050 ms): ping/203535 access(filename: 0x617fbd90, mode: R)                                 = -1 ENOENT (No such file or directory)
     0.916 ( 0.050 ms): ping/203535 openat(dfd: CWD, filename: 0x617fa21b, flags: RDONLY|CLOEXEC)         = 3
     1.010 ( 0.045 ms): ping/203535 newfstatat(dfd: 3, filename: 0x617faee9, statbuf: 0x7fffc9846a20, flag: 4096) = 0
     1.100 ( 0.158 ms): ping/203535 mmap(len: 35835, prot: READ, flags: PRIVATE, fd: 3)                   = 0x7f36617c1000
     1.308 ( 0.044 ms): ping/203535 close(fd: 3)                                                          = 0
     1.403 ( 0.049 ms): ping/203535 openat(dfd: CWD, filename: 0x617ca140, flags: RDONLY|CLOEXEC)         = 3
     1.498 ( 0.045 ms): ping/203535 read(fd: 3, buf: 0x7fffc9846c58, count: 832)                          = 832
     1.588 ( 0.045 ms): ping/203535 newfstatat(dfd: 3, filename: 0x617faee9, statbuf: 0x7fffc9846af0, flag: 4096) = 0
     1.697 ( 0.055 ms): ping/203535 mmap(len: 41016, prot: READ, flags: PRIVATE|DENYWRITE, fd: 3)         = 0x7f36617b6000
     1.797 ( 0.067 ms): ping/203535 mmap(addr: 0x7f36617b9000, len: 16384, prot: READ|EXEC, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x3000) = 0x7f36617b9000
     1.912 ( 0.050 ms): ping/203535 mmap(addr: 0x7f36617bd000, len: 8192, prot: READ, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x7000) = 0x7f36617bd000
     2.007 ( 0.049 ms): ping/203535 mmap(addr: 0x7f36617bf000, len: 8192, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x8000) = 0x7f36617bf000
...
   279.945 ( 0.049 ms): ping/203535 close(fd: 1</dev/null>)                                               = 0
   280.046 ( 0.065 ms): ping/203535 close(fd: 2</dev/pts/1>)                                              = 0
   280.185 (         ): ping/203535 exit_group()                                                          = ?
root@lima-ebpf-dev:~#

通过指定 -e glob,可以只捕获符合 glob 匹配的内核事件,如向捕获执行 ping 命令式,内核内部执行了那些 net tracepoint 函数:

root@lima-ebpf-dev:~# perf trace --no-syscalls --event 'net:*' ping www.baidu.com -c1 > /dev/null
     0.000 ping/203626 net:net_dev_queue(skbaddr: 0xffff99fac46fc400, len: 84, name: "lo")
     0.061 ping/203626 net:net_dev_start_xmit(name: "lo", skbaddr: 0xffff99fac46fc400, protocol: 2048, ip_summed: 3, len: 84, network_offset: 14, transport_offset_valid: 1, transport_offset: 34)
     0.099 ping/203626 net:netif_rx_entry(name: "lo", napi_id: 2, skbaddr: 0xffff99fac46fc400, protocol: 2048, ip_summed: 3, hash: 190000638, l4_hash: 1, len: 70, truesize: 768, mac_header_valid: 1, mac_header: 4294967282)
     0.139 ping/203626 net:netif_rx(skbaddr: 0xffff99fac46fc400, len: 70, name: "lo")
     0.173 ping/203626 net:netif_rx_exit(skbaddr: 0xffff99fac46fc400, len: 70, name: "lo")
     0.206 ping/203626 net:net_dev_xmit(skbaddr: 0xffff99fac46fc400, len: 84, name: "lo")
...
root@lima-ebpf-dev:~#

比如,如果想通过 bpftrace 来捕获发送 icmp 包的进程,则可以从上面 perf trace 的输出结果中来看,对 net_dev_start_xmit 函数进行插桩是比较合适的选择:

  • 任意内核函数插桩,建议使用 kfunc 类型,因为可以使用 -lv 来看 kfunc 的参数列表,后面使用 args. 来引用函数参数。
  • 如果使用 kprobe 类型,则不支持查看函数列表,需要使用 argN 来引用每一个函数参数。
# 对内核函数关键字进行搜索,不一定完整匹配
root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:net_dev_start_xmit'
root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:*' |grep net_dev |grep xmit
kfunc:vmlinux:__traceiter_net_dev_start_xmit
kfunc:vmlinux:__traceiter_net_dev_xmit
kfunc:vmlinux:__traceiter_net_dev_xmit_timeout

root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:vmlinux:__traceiter_net_dev_start_xmit'
kfunc:vmlinux:__traceiter_net_dev_start_xmit
    void * __data
    const struct sk_buff * skb
    const struct net_device * dev
    int retval

内核对应的 net_dev_start_xmit trace event 定义:

TRACE_EVENT(net_dev_start_xmit,

	TP_PROTO(const struct sk_buff *skb, const struct net_device *dev),

	TP_ARGS(skb, dev),

	TP_STRUCT__entry(
		__string(	name,			dev->name	)
		__field(	u16,			queue_mapping	)
		__field(	const void *,		skbaddr		)
		__field(	bool,			vlan_tagged	)
		__field(	u16,			vlan_proto	)
		__field(	u16,			vlan_tci	)
		__field(	u16,			protocol	)
		__field(	u8,			ip_summed	)
		__field(	unsigned int,		len		)
		__field(	unsigned int,		data_len	)
		__field(	int,			network_offset	)
		__field(	bool,			transport_offset_valid)
		__field(	int,			transport_offset)
		__field(	u8,			tx_flags	)
		__field(	u16,			gso_size	)
		__field(	u16,			gso_segs	)
		__field(	u16,			gso_type	)
	),

	TP_fast_assign(
		__assign_str(name, dev->name);
		__entry->queue_mapping = skb->queue_mapping;
		__entry->skbaddr = skb;
		__entry->vlan_tagged = skb_vlan_tag_present(skb);
		__entry->vlan_proto = ntohs(skb->vlan_proto);
		__entry->vlan_tci = skb_vlan_tag_get(skb);
		__entry->protocol = ntohs(skb->protocol);
		__entry->ip_summed = skb->ip_summed;
		__entry->len = skb->len;
		__entry->data_len = skb->data_len;
		__entry->network_offset = skb_network_offset(skb);
		__entry->transport_offset_valid =
			skb_transport_header_was_set(skb);
		__entry->transport_offset = skb_transport_offset(skb);
		__entry->tx_flags = skb_shinfo(skb)->tx_flags;
		__entry->gso_size = skb_shinfo(skb)->gso_size;
		__entry->gso_segs = skb_shinfo(skb)->gso_segs;
		__entry->gso_type = skb_shinfo(skb)->gso_type;
	),

	TP_printk("dev=%s queue_mapping=%u skbaddr=%p vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x ip_summed=%d len=%u data_len=%u network_offset=%d transport_offset_valid=%d transport_offset=%d tx_flags=%d gso_size=%d gso_segs=%d gso_type=%#x",
		  __get_str(name), __entry->queue_mapping, __entry->skbaddr,
		  __entry->vlan_tagged, __entry->vlan_proto, __entry->vlan_tci,
		  __entry->protocol, __entry->ip_summed, __entry->len,
		  __entry->data_len,
		  __entry->network_offset, __entry->transport_offset_valid,
		  __entry->transport_offset, __entry->tx_flags,
		  __entry->gso_size, __entry->gso_segs, __entry->gso_type)
)

perf ftrace 追踪函数
#

https://mp.weixin.qq.com/s/kZfmtbbcTwumyJ7I551Xpw

为了测试方便,将 eth1 的网卡队列设置为 1,并且将网卡中断绑定到 CPU0。

ethtool -L eth1 combined 1
echo 0 > /proc/irq/$(cat /proc/interrupts | awk -F ':' '/eth1/ {gsub(/ /,""); print $1}')/smp_affinity_list
  • 收包路径:perf ftrace -G ‘_netif_receive_skb_list_core’ -g ‘smp*’
  • 发包路径:perf ftrace -G ‘_netif_receive_skb_core’ -g ‘smp*’

参考:

  1. Debugging the kernel using Ftrace - part 1: https://lwn.net/Articles/365835/
  2. Debugging the kernel using Ftrace - part 2: https://lwn.net/Articles/366796/

perf 对指定 cgroup 进行追踪
#

在 Host 上查找容器内二进制文件
#

两种方式:

  1. /proc//root
  2. 查找容器 mergedDir
# ls  /proc/102366/root/
apsara  bin  boot  dev  entrypoint.sh  etc  home  lib  lib64  lost+found  media  mnt  nsenter  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

# ls -l  /proc/102366/exe
lrwxrwxrwx 1 root root 0 Jan  9 16:11 /proc/102366/exe -> /usr/bin/plugin.csi.cloud.com

# ls -l  /proc/102366/root/usr/bin/plugin.csi.cloud.com
-rwxr-xr-x 1 root root 82455166 Jan  9 13:23 /proc/102366/root/usr/bin/plugin.csi.cloud.com

另外也可以根据容器的 .GraphDriver.Data.MergedDir 来在 Host 上查找容器内路径的二进制路径。例如,如果想跟踪 cilium-agent 进程(本身是用 docker 容器部署的),首先需要找到 cilium-agent 文件在宿主机上的绝对路径,可以通过 container ID 或 name 找:

  • merged path 是容器使用的 overlay 根文件系统。
# Check cilium-agent container
$ docker ps | grep cilium-agent
0eb2e76384b3        cilium:test   "/usr/bin/cilium-agent ..."   4 hours ago    Up 4 hours   cilium-agent

# Find the merged path for cilium-agent container
$ docker inspect --format "{{.GraphDriver.Data.MergedDir}}" 0eb2e76384b3
/var/lib/docker/overlay2/a17f868d/merged # a17f868d.. is shortened for better viewing

# The object file we are going to trace
$ ls -ahl /var/lib/docker/overlay2/a17f868d/merged/usr/bin/cilium-agent
-rwxr-xr-x 1 root root 86M /var/lib/docker/overlay2/a17f868d/merged/usr/bin/cilium-agent

perf 感知进程 cgroup
#

perf 命令感知进程对应的 cgroup,并自动到对应的 cgruop rootfs 下查找二进制和对应的 debuginfo 信息。

Host System
├── /proc/<container_pid>/
│   ├── root/ (容器根文件系统)
│   ├── exe (二进制链接)
│   └── maps (内存映射)
└── /usr/lib/debug/ (调试符号)

根据 /proc//status 的 NSPid 来判断是否是容器进程
#

$ sudo cat /proc/200/status
Name:   ksoftirqd/37
Umask:  0000
State:  S (sleeping)
...
Cpus_allowed:   0020,00000000
Cpus_allowed_list:      37
voluntary_ctxt_switches:        27251
nonvoluntary_ctxt_switches:     350

其中,根据 NSpid 字段可以判断这个进程是不是容器

$ cat /proc/1229/status | grep NSpid
NSpid: 1229
$ cat /proc/11459/status | grep NSpid
NSpid: 11459 1

11459 是在宿主机的 pid ns 内的进程 ID,1 是在容器自己的 pid ns 的进程 ID

如果要对指定 cgroup 进行 syscall tracing/probe,则需要使用 –cgroup/-G 来指定 /sys/fs/cgroup/perf_event 下的 cgrup 路径。

# ls /sys/fs/cgroup/perf_event/docker/
59a016887a1b563799d1f5d942f420ab37f9e99be68996e80ffc68fdc96e06ff  a38dc1db0ad27317bbe8914e0e25930e0b7650d2b3c6bb10860dbf797cfba770  cgroup.procs       tasks
9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058  cgroup.clone_children                                             notify_on_release

# docker exec -it 9c2281e2be09 sh
/bin # cat /proc/self/cgroup
12:freezer:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
11:pids:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
10:devices:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
9:blkio:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
8:cpuset:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
7:rdma:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
6:hugetlb:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
5:perf_event:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
4:net_cls,net_prio:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
3:cpu,cpuacct:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
2:memory:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
1:name=systemd:/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
0::/docker/9c2281e2be0997da155140b9316c3703f1e1f85838219552b0a4a08741190058
/bin #

# perf record -F 99 -e cpu-clock --cgroup=docker/9c2281e2be099 -a -- sleep 10

另一例子:

# docker exec 879e9059287c cat /proc/self/cgroup
12:pids:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
11:devices:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
10:memory:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
9:cpu,cpuacct:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
8:blkio:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
7:hugetlb:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
6:perf_event:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
5:freezer:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
4:rdma:/
3:net_cls,net_prio:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
2:cpuset:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope
1:name=systemd:/my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope

# perf stat -e "syscalls:*" -G  my_private.slice/docker-879e9059287c1d7a4092ed4682ee2471834f6e1ef4399c73b26277e1a29658eb.scope -a

perf 实验
#

# gcc test.c  -o test

# file test
test: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=cf5aa8fa3e8950c6e5c71b2588d8b2e8bf40f175, not stripped

# readelf -S ./test |grep -E 'sym|debug|eh_'
  [ 5] .dynsym           DYNSYM           00000000004002b8  000002b8
  [16] .eh_frame_hdr     PROGBITS         00000000004006dc  000006dc
  [17] .eh_frame         PROGBITS         0000000000400730  00000730
  [27] .symtab           SYMTAB           0000000000000000  00001098

# nm test |grep func
00000000004005dc T func_a
00000000004005c2 T func_b
00000000004005a8 T func_c
000000000040057d T func_d

# strings test |grep func
func_b
func_d
func_a
func_c

# 没有加 -g,默认不捕获函数调用栈,
# 没有加 -e,默认使用 cycles:ppp
# perf record ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.030 MB perf.data (10 samples) ]

# perf report 并不显示函数调用栈
# perf report  --stdio |cat -
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 9  of event 'cycles:ppp'
# Event count (approx.): 1717500
#
# Overhead  Command  Shared Object      Symbol
# ........  .......  .................  ............................
#
    63.50%  test     [kernel.kallsyms]  [k] sync_regs
    32.86%  test     [kernel.kallsyms]  [k] vma_interval_tree_insert
     3.29%  perf     [kernel.kallsyms]  [k] do_nmi
     0.32%  perf     [kernel.kallsyms]  [k] acpi_os_read_memory
     0.03%  perf     [kernel.kallsyms]  [k] native_write_msr

# perf script
            perf 41565 3185673.474644:          1 cycles:ppp:  ffffffffaf067b74 native_write_msr+0x4 ([kernel.kallsyms])
            perf 41565 3185673.474657:          1 cycles:ppp:  ffffffffaf067b74 native_write_msr+0x4 ([kernel.kallsyms])
            perf 41565 3185673.474659:          4 cycles:ppp:  ffffffffaf067b74 native_write_msr+0x4 ([kernel.kallsyms])
            perf 41565 3185673.474663:         47 cycles:ppp:  ffffffffaf067b74 native_write_msr+0x4 ([kernel.kallsyms])
            perf 41565 3185673.474666:        506 cycles:ppp:  ffffffffaf067b76 native_write_msr+0x6 ([kernel.kallsyms])
            perf 41565 3185673.474669:       5450 cycles:ppp:  ffffffffaf49bdc5 acpi_os_read_memory+0x35 ([kernel.kallsyms])
            perf 41565 3185673.474672:      56442 cycles:ppp:  ffffffffaf029f59 do_nmi+0x99 ([kernel.kallsyms])
            test 41565 3185673.474702:     564369 cycles:ppp:  ffffffffaf2268dc vma_interval_tree_insert+0x2c ([kernel.kallsyms])
            test 41565 3185673.474986:    1090680 cycles:ppp:  ffffffffaf02538e sync_regs+0x1e ([kernel.kallsyms])

# 加 -g 时捕获函数调用栈
# perf record -g  ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.031 MB perf.data (11 samples) ]
# perf report  --stdio |head -20
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 11  of event 'cycles:ppp'
# Event count (approx.): 1310813
#
# Children      Self  Command  Shared Object      Symbol
# ........  ........  .......  .................  .......................................
#
    75.09%     0.00%  test     [kernel.kallsyms]  [k] page_fault
            |
            ---page_fault
               do_page_fault
               __do_page_fault
               handle_mm_fault
               |
                --35.97%--__handle_mm_fault
                          do_anonymous_page
# 但是并没有捕获到用户函数的调用栈
# perf report  --stdio |grep func_

# 如果要捕获用户函数,则需要使用 perf probe
# 查看 ./test 中可以 probe 的用户函数列表
# perf probe -F -x ./test
completed.6355
data_start
deregister_tm_clones
frame_dummy
func_a
func_b
func_c
func_d
main
puts@plt
register_tm_clones
usleep@plt

# perf probe -x ./test 'func_d'
Added new event:
  probe_test:func_d    (on func_d in /root/zj/test)

You can now use it in all perf tools, such as:

	perf record -e probe_test:func_d -aR sleep 1

# 然后执行 perf record 时需要加 -g 才能捕获函数调用栈,
# 或者使用 dwarf 来 unwinding(使用默认生成的 .eh_frame)
# perf record -e probe_test:func_d -aR ./test # perf record -e probe_test:func_d --call-graph=dwarf -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.330 MB perf.data (1 samples) ]

# perf script
test 104460 [076] 3186931.797081: probe_test:func_d_1: (40057d)
                  40057d func_d+0x0 (/root/zj/test)
                  4005bf func_c+0x17 (/root/zj/test)
                  4005d9 func_b+0x17 (/root/zj/test)
                  4005f3 func_a+0x17 (/root/zj/test)
                  400603 main+0xd (/root/zj/test)
            7fb599014444 __libc_start_main+0xf4 (/usr/lib64/libc-2.17.so)
                  4004b8 _start+0x28 (/root/zj/test)

# perf report  --stdio |cat -
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1  of event 'probe_test:func_d'
# Event count (approx.): 1
#
# Children      Self  Trace output
# ........  ........  ............
#
   100.00%   100.00%  (40057d)
            |
            ---__libc_start_main
               main
               func_a
               func_b
               func_d

# perf record -e probe_test:func_d_1  --proc-map-timeout 100  -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.845 MB perf.data (1 samples) ]
# perf script
            test 129241 [084] 3187206.870347: probe_test:func_d_1: (40057d)

# perf record -e probe_test:func_d_1  --proc-map-timeout 500 -g -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.404 MB perf.data (1 samples) ]

# perf script
test 76042 [073] 3187258.344951: probe_test:func_d_1: (40057d)
                  40057d func_d+0x0 (/root/zj/test)
                  4005da func_b+0x18 (/root/zj/test)
                  4005f4 func_a+0x18 (/root/zj/test)
                  400604 main+0xe (/root/zj/test)
            7f8f5e361445 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)

# perf record -e probe_test:func_d_1  --proc-map-timeout 500 --call-graph fp -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.476 MB perf.data (1 samples) ]

# perf script
test 103396 [068] 3187282.462500: probe_test:func_d_1: (40057d)
                    40057d func_d+0x0 (/root/zj/test)
                    4005da func_b+0x18 (/root/zj/test)
                    4005f4 func_a+0x18 (/root/zj/test)
                    400604 main+0xe (/root/zj/test)
            7feeb9598445 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)




# gcc ./test.c  -o test -fomit-frame-pointer
# perf probe -l
  kprobes:aegis___x64_sys_execve (on __x64_sys_execve with s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 file arg1 arg2 arg3 arg4 arg5 ar
  kprobes:aegis_do_execve (on do_execve with s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 file arg1 arg2 arg3 arg4 arg5 arg6 arg7 arg8 a
  kprobes:aegis_do_sys_open (on do_sys_open with dfd file flag)
  kprobes:aegis_proc_fork_connector (on proc_fork_connector with tgid pid parent_tgid parent_pid)
  kprobes:aegis_tcp_connect (on tcp_connect with saddr daddr sport dport)
  probe_test:func_d    (on func_d in /root/zj/test)
  probe_test:func_d_1  (on func_d in /root/zj/test)

# perf probe -d probe_test:*
Removed event: probe_test:func_d
Removed event: probe_test:func_d_1

# perf probe -x ./test 'func_d'
Added new event:
  probe_test:func_d    (on func_d in /root/zj/test)

You can now use it in all perf tools, such as:

	perf record -e probe_test:func_d -aR sleep 1

# perf record -e probe_test:func_d -aR -g ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 11.281 MB perf.data (1 samples) ]

# perf script
test 85054 [093] 3187435.293375: probe_test:func_d: (40057d)
                  40057d func_d+0x0 (/root/zj/test)



# perf record -e probe_test:func_d -aR --call-graph dwarf ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 29.288 MB perf.data (1 samples) ]

# perf script
test 106253 [094] 3187527.044076: probe_test:func_d: (40057d)
                  40057d func_d+0x0 (/root/zj/test)
                  4005c0 func_c+0x17 (/root/zj/test)
                  4005dd func_b+0x17 (/root/zj/test)
                  4005fa func_a+0x17 (/root/zj/test)
                  40060d main+0xd (/root/zj/test)
            7f15c70ca444 __libc_start_main+0xf4 (/usr/lib64/libc-2.17.so)
                  4004b8 _start+0x28 (/root/zj/test)


# perf annotate --stdio2
Samples: 1  of event 'probe_test:func_d', 1 Hz, Event count (approx.): 1, [percent: local period]
func_d() /root/zj/test
Percent


            Disassembly of section .text:

            000000000040057d <func_d>:
            func_d():
100.00        push   %rbp
              mov    %rsp,%rbp
              sub    $0x10,%rsp
              movl   $0x1,-0x4(%rbp)
              mov    $0x4006a0,%edi
            → callq  puts@plt
              mov    -0x4(%rbp),%eax
              imul   $0x2710,%eax,%eax
              mov    %eax,%edi
            → callq  usleep@plt
              leaveq
            ← retq

# gcc ./test.c  -g -o test
# readelf -S ./test  |grep debug
  [27] .debug_aranges    PROGBITS         0000000000000000  00001096
  [28] .debug_info       PROGBITS         0000000000000000  000010c6
  [29] .debug_abbrev     PROGBITS         0000000000000000  000011ce
  [30] .debug_line       PROGBITS         0000000000000000  0000124d
  [31] .debug_str        PROGBITS         0000000000000000  0000129a
# perf record -e probe_test:func_d -aR --call-graph dwarf ./test
  Hello from A
  Hello from B
  Hello from C
  Hello world from D
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 27.890 MB perf.data (1 samples) ]
# perf annotate --stdio2
  Samples: 1  of event 'probe_test:func_d', 1 Hz, Event count (approx.): 1, [percent: local period]
  func_d() /root/zj/test
  Percent


              Disassembly of section .text:

              000000000040057d <func_d>:
              func_d():
              #include <stdio.h>
              #include <unistd.h>

              void func_d() {
  100.00        push   %rbp
                mov    %rsp,%rbp
                sub    $0x10,%rsp
                                  int msec=1;
                movl   $0x1,-0x4(%rbp)
                                                  printf("%s","Hello world from D\n");
                mov    $0x4006a0,%edi
              → callq  puts@plt
                                                                  usleep(10000*msec);
                mov    -0x4(%rbp),%eax
                imul   $0x2710,%eax,%eax
                mov    %eax,%edi
              → callq  usleep@plt
              }
                leaveq
              ← retq

报错:

# perf record -e probe_test:func_d_1 --call-graph=dwarf -aR ./test
Reading /proc/101522/task/101522/maps time out. You may want to increase the time limit by --proc-map-timeout
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
Warning:
1 map information files for pre-existing threads were
not processed, if there are samples for addresses they
will not be resolved, you may find out which are these
threads by running with -v and redirecting the output
to a file.
The time limit to process proc map is too short?
Increase it by --proc-map-timeout
[ perf record: Captured and wrote 28.158 MB perf.data (1 samples) ]

# perf record -e probe_test:func_d_1 --call-graph=dwarf --proc-map-timeout 100 -aR ./test
Hello from A
Hello from B
Hello from C
Hello world from D
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 29.081 MB perf.data (1 samples) ]

相关文章

perf_event_open() 系统调用分析
·2476 字
Ebpf Perf
perf_event_open() 系统调用分析
magit
··8481 字
Emacs Tool
magit 是最好用的 git 客户端!
objdump
··4718 字
Elf Debug Tool
objdump 是通用的 ELF 文件反汇编工具,同时还具有 readelf 类似的打印文件中各 Sections 内容的功能,是 readelf 的功能超集。
readelf
··2630 字
Elf Debug Tool
readelf 是显示 ELF 格式的二进制文件(可执行程序或动态库等)中各 Section 内容的重要工具。 显示符号表 Sections,如 .dnysym 和 .symtab 中的符号名称和地址; 显示 DWARF 格式的 Sections,如各种 .debug_xx,.eh_frame 等; 显示查找 debuginfo 文件所需的 .gnu_debuglink 和 .note.gnu.build-id ;