跳过正文

使用 linux bpftrace 进行内核和应用性能分析

Bpftrace Kernel Ebpf Performance Tool
目录

介绍 bpftrace 工具的使用方式、局限性和问题。

内核不支持 DWARF,perf、gdb、systemtap 等是在用户态使用 DWARF 做调用栈展开。

但目前 bpftrace/bcc 不支持使用 DWARF 来做用户态程序的调用栈展开,而只能使用 FP :

  • Comparing SystemTap and bpftrace:https://lwn.net/Articles/852112/
  • User-space backtrace support for programs built without frame pointers #1744 :https://github.com/iovisor/bpftrace/issues/1744

对于 BPF 程序,如果要 ustack() 函数正常工作,需要编译时开启 FP:

  1. 不开启优化,不使用任何 -O 选项或指定 -O0;
  2. 或者明确指定编译参数:-fno-omit-frame-pointer--enable-frame-pointer;

bpftrace 虽然不支持使用 DWARF 进行 unwinding,但是支持使用 DWARF 来对用户函数的参数进行解析。也即使用 bpftrace -lv 'uprobe:/bin/bash:readline' 来显示 readline 函数参数列表时,也是从调试符号表中解析函数名称和参数信息,如果 bpftrace 查不到调试符号表,则会报错: No DWARF found for XX,cannot show parameter info

  • 参考:https://github.com/iovisor/bpftrace/blob/master/src/dwarf_parser.cpp
root@lima-ebpf-dev:~# apt install bash-dbgsym bash-static-dbgsym
root@lima-ebpf-dev:~# bpftrace -e 'uprobe:/usr/bin/bash:readline {printf("%s", ustack)}' # -p 12446

安装 bpftrace
#

RPM/Deb 包安装:

echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" |  sudo tee -a /etc/apt/sources.list.d/ddebs.list

# ubuntu 需要安装 bpftrace-dbgsym 包:
apt install bpftrace-dbgsym
bpftrace -e 'BEGIN { printf("hello world\n"); }'

bpftrace 支持 #include<xx> 内核头文件来获得内核 struct 定义,所以需要安装内核头文件。

bpftrace Cheat Sheet: https://www.brendangregg.com/BPF/bpftrace-cheat-sheet.html

查看 bpftrace 信息
#

  • Build: 如是否支持 libdw,只有支持 libdw 才能使用 -lv 显示用户函数的参数列表(来源于 DWARF);
  • Kernel helpers: 内核支持的 eBPF Kernel helpers 特性列表;
  • Kernel fatures: 内核支持的 eBPF 特性列表;
root@lima-ebpf-dev:~# bpftrace --info
System
  OS: Linux 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023
  Arch: x86_64

Build
  version: v0.18.0-97-ge010d
  LLVM: 14.0.0
  unsafe probe: no
  bfd: yes
  libdw (DWARF support): yes

Kernel helpers
  probe_read: yes
  probe_read_str: yes
  probe_read_user: yes
  probe_read_user_str: yes
  probe_read_kernel: yes
  probe_read_kernel_str: yes
  get_current_cgroup_id: yes
  send_signal: yes
  override_return: yes
  get_boot_ns: yes
  dpath: yes
  skboutput: yes
  get_tai_ns: no
  get_func_ip: yes

Kernel features
  Instruction limit: 1000000
  Loop support: yes
  btf: yes
  module btf: yes
  map batch: yes
  uprobe refcount (depends on Build:bcc bpf_attach_uprobe refcount): yes

Map types
  hash: yes
  percpu hash: yes
  array: yes
  percpu array: yes
  stack_trace: yes
  perf_event_array: yes
  ringbuf: yes

Probe types
  kprobe: yes
  tracepoint: yes
  perf_event: yes
  kfunc: yes
  kprobe_multi: no
  raw_tp_special: yes
  iter: yes

列出插桩点和函数参数
#

  1. bpftrace -l “tracepoint:*”: 显示指定 glob 模式的插桩点名称。
  2. bftrace -lv “tracepoint:syscalls:sys_enter_execve”: 显示 tracepoint/syscall/kfunc/uprobe 函数的参数列表。

对于用户函数(uprobe),-lv 使用 DWARF 数据来解析函数参数的,所以需要 ELF 包含 .debug_XX 符号表,或者安装对应的 debuginfo 包。对于 ubuntu,一般是 XX-dbgsym。

  • kprobe 等不支持 -lv 查看参数。
# apt install bash-dbgsym

# bpftrace -lv 'uprobe:/bin/bash:readline'
uprobe:/bin/bash:readline
    const char* prompt

# bpftrace -lv 'tracepoint:syscalls:sys_enter_write'
tracepoint:syscalls:sys_enter_write
    int __syscall_nr
    unsigned int fd
    const char * buf
    size_t count

跟踪内核函数
#

使用 -e 来指定 kprobe、syscall 和 tracepoint 等类型事件,打印 kstack:

# bpftrace -e 'kprobe:nf_conntrack_in {printf("%s\n", kstack); }'
        nf_conntrack_in+1
        nf_hook_slow+61
        __ip_local_out+214
        ip_local_out+23
        ip_send_skb+21
        udp_send_skb.isra.43+277
        udp_sendmsg+1544
        sock_sendmsg+48
        ___sys_sendmsg+688
        __sys_sendmsg+99
        do_syscall_64+85
        entry_SYSCALL_64_after_hwframe+68

跟踪用户函数
#

bpftrace 不支持基于 DWARF 的用户栈展开,需要用户程序编译时生成 frame pointer。

需要提供 ELF 对应的符号表,可以是 ELF 中自带,或者安装的对应的 debuginfo 包来提供。 中需要包含符号表,

  1. 执行使用 bpftrace 执行程序;
root@lima-ebpf-dev:~# cat test.c
#include <stdio.h>
#include <unistd.h>

void func_d() {
                int msec=1;
                printf("%s","Hello world from D\n");
                usleep(10000*msec);
}
void func_c() {
                printf("%s","Hello from C\n");
                func_d();
}
void func_b() {
                printf("%s","Hello from B\n");
        func_c();
}
void func_a() {
                printf("%s","Hello from A\n");
                func_b();
}
int main() {
        func_a();
}
# 没有指定 -O 优化选项,所以开启 FP
root@lima-ebpf-dev:~# gcc  test.c -o hello

# 确认 gcc 在函数调用的开头添加保存 FP 的指令。
root@lima-ebpf-dev:~# objdump -S hello |grep -A 4 func_c
000000000000119e <func_c>:
    119e:       f3 0f 1e fa             endbr64
    11a2:       55                      push   %rbp  # 保存 FP
    11a3:       48 89 e5                mov    %rsp,%rbp
    11a6:       48 8d 05 6a 0e 00 00    lea    0xe6a(%rip),%rax        # 2017 <_IO_stdin_used+0x17>
--
    11de:       e8 bb ff ff ff          call   119e <func_c>
    11e3:       90                      nop
    11e4:       5d                      pop    %rbp
    11e5:       c3                      ret

# 打印调用 func_c 的 user call stack
root@lima-ebpf-dev:~# bpftrace -e 'uprobe:./hello:func_c {printf("%s", ustack)}' -c ./hello
Attaching 1 probe...
Hello from A
Hello from B
Hello from C
Hello world from D

        func_c+0
        func_a+33
        main+18
        __libc_start_call_main+128

使用 pid 追踪正在运行的程序(需要使用 FP 和包含符号表):

root@lima-ebpf-dev:~# apt install bash-dbgsym bash-static-dbgsym
root@lima-ebpf-dev:~# bpftrace -e 'uprobe:/usr/bin/bash:readline {printf("%s", ustack)}' # -p 12446

打印 ustack、kstack 时,可以指定参数,如 ustack(perf, 3), 其中 perf 表示栈的格式,3 表示用户空间栈层级.

[ku]stack([bpftrace|perf|raw]):https://github.com/bpftrace/bpftrace/issues/430#issuecomment-2580126066

# bpftrace -e 'uprobe:/cloud/my-agent:*doSaveNetworkLocateInfo {printf("%s\n", ustack(perf, 2));}'
	12e0480 git.com/my-agent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo+0 (/cloud/my-agent)
	12e557d git.com/my-agent/pkg/network/processor/pidricher.(*pidEnricher).Process+1405 (/cloud/my-agent)

测量函数执行延迟
#

版本1:使用全局变量,有并发干扰问题

#!/usr/bin/bpftrace
uprobe:/usr/bin/dockerd:"github.com/docker/docker/api/server/router/network.(*networkRouter).getNetworksList" {
    @start = nsecs;
}
uretprobe:/usr/bin/dockerd:"github.com/docker/docker/api/server/router/network.(*networkRouter).getNetworksList" {
    printf("getNetworksList took %d ms\n", (nsecs - @start) / 1000000);
}

版本2: OK,使用 per thread 的变量

#!/usr/bin/bpftrace
uprobe:/usr/bin/dockerd:"github.com/docker/docker/api/server/router/network.(*networkRouter).getNetworksList" {
    @start[tid] = nsecs;
}
uretprobe:/usr/bin/dockerd:"github.com/docker/docker/api/server/router/network.(*networkRouter).getNetworksList" {
    if (@start[tid] != 0) {
        printf("getNetworksList took %d ms\n", (nsecs - @start[tid]) / 1000000);
        delete(@start[tid]);
    }
}

跟踪容器进程
#

容器进程在独立的 mount ns 中,即根文件系统和 Host 是独立的,需要在 Host 上上到容器内使用的二进制和库文件,然后做符号解析和函数追踪。

在 Host 上查找容器内二进制文件,由两种方式:

  1. /proc//root
  2. 查找容器 mergedDir
# ls  /proc/102366/root/
apsara  bin  boot  dev  entrypoint.sh  etc  home  lib  lib64  lost+found  media  mnt  nsenter  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

# ls -l  /proc/102366/exe
lrwxrwxrwx 1 root root 0 Jan  9 16:11 /proc/102366/exe -> /usr/bin/plugin.csi.cloud.com

# ls -l  /proc/102366/root/usr/bin/plugin.csi.cloud.com
-rwxr-xr-x 1 root root 82455166 Jan  9 13:23 /proc/102366/root/usr/bin/plugin.csi.cloud.com

$ sudo docker inspect -f '{{.State.Pid}}' cilium-agent
109997

$ bpftrace -e 'uprobe:/proc/109997/root/usr/bin/cilium-agent:"github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate" {printf("%s\n", ustack); }'

另外也可以根据容器的 .GraphDriver.Data.MergedDir 来在 Host 上查找容器内路径的二进制路径。例如,如果想跟踪 cilium-agent 进程(本身是用 docker 容器部署的),首先需要找到 cilium-agent 文件在宿主机上的绝对路径,可以通过 container ID 或 name 找:

  • merged path 是容器使用的 overlay 根文件系统。
# Check cilium-agent container
$ docker ps | grep cilium-agent
0eb2e76384b3        cilium:test   "/usr/bin/cilium-agent ..."   4 hours ago    Up 4 hours   cilium-agent

# Find the merged path for cilium-agent container
$ docker inspect --format "{{.GraphDriver.Data.MergedDir}}" 0eb2e76384b3
/var/lib/docker/overlay2/a17f868d/merged # a17f868d.. is shortened for better viewing

# The object file we are going to trace
$ ls -ahl /var/lib/docker/overlay2/a17f868d/merged/usr/bin/cilium-agent
-rwxr-xr-x 1 root root 86M /var/lib/docker/overlay2/a17f868d/merged/usr/bin/cilium-agent

然后再指定绝对路径 uprobe:go 函数需要包含完整路径 字符串, 如 “github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate”

(node) $ bpftrace -e 'uprobe:/var/lib/docker/overlay2/a17f868d/merged/usr/bin/cilium-agent:"github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate" {printf("%s\n", ustack); }'
Attaching 1 probe...

        github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate+0
        github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run.func1+363
        sync.(*Once).doSlow+236
        github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run+101
        runtime.goexit+1

使用 nm 或者 bptrace 命令来查看 go 二进制中可以 tracing 的符号(函数)列表:

$ nm cilium-agent
000000000427d1d0 B bufio.ErrBufferFull
000000000427d1e0 B bufio.ErrFinalToken
0000000001d3e940 T type..hash.github.com/cilium/cilium/pkg/k8s.ServiceID
0000000001f32300 T type..hash.github.com/cilium/cilium/pkg/node/types.Identity
0000000001d05620 T type..hash.github.com/cilium/cilium/pkg/policy/api.FQDNSelector
0000000001d05e80 T type..hash.github.com/cilium/cilium/pkg/policy.PortProto
...

# bpftrace -l 'uprobe:./exec:*'|tail
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.lookupInfoNFC
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.lookupInfoNFKC
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextCGJCompose
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextCGJDecompose
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextComposed
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextDecomposed
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextDone
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextHangul
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextMulti
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextMultiNorm

判断是否是容器进程的依据是:NSpid 字段:

$ cat /proc/1229/status | grep NSpid
NSpid: 1229

$ cat /proc/11459/status | grep NSpid
NSpid: 11459 1

11459 是在宿主机的 pid ns 内的进程 ID,1 是在容器自己的 pid ns 的进程 ID

采样和火焰图
#

和 perf record 类似,可以周期采样整个系统或特定进程:

# bpftrace -v   -e 'profile:hz:100 /pid == 1/ { @[ustack(1)] = count(); }'

对于 bpftrace 产生的 profiling 数据,可以使用 flamegraph 提供的转换工具进行可视化:

sudo bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > trace.data
cd FlameGraph
# 使用 stackcollapse-bpftrace.pl 工具进行转换
./stackcollapse-bpftrace.pl trace.data > trace.folded
./flamegraph.pl --inverted trace.folded > traceflamegraph.svg

join
#

最多读取 16 个长度为 1024 的内容。

// https://github.com/iovisor/bpftrace/blob/0b3392baa881f501ce684637acbd4136f8a29ed3/src/bpftrace.h#L190C1-L191C37
  unsigned int join_argnum_ = 16;
  unsigned int join_argsize_ = 1024;

// https://github.com/iovisor/bpftrace/blob/0b3392baa881f501ce684637acbd4136f8a29ed3/src/bpftrace.cpp#L463C1-L478C4
  else if (printf_id == asyncactionint(AsyncAction::join))
  {
    uint64_t join_id = (uint64_t) * (static_cast<uint64_t *>(data) + 1);
    auto delim = bpftrace->resources.join_args[join_id].c_str();
    std::stringstream joined;
    for (unsigned int i = 0; i < bpftrace->join_argnum_; i++) {
      auto *arg = arg_data + 2*sizeof(uint64_t) + i * bpftrace->join_argsize_;
      if (arg[0] == 0)
        break;
      if (i)
        joined << delim;
      joined << arg;
    }
    bpftrace->out_->message(MessageType::join, joined.str());
    return;
  }

argN/sargN/reg/args/retval
#

N 从 0 开始,表示函数第一个、第二个参数。

  • arg0, arg1, …: Arguments to the traced function; assumed to be 64 bits wide
    • 适用于:kprobes, uprobes, usdt
  • sarg0, sarg1, …: Arguments to the traced function (for programs that store arguments on the stack); assumed to be 64 bits wide
    • 适用于:kprobes, uprobes

如果函数参数不严格占用一个 64 bit(如 struct 而非 struct 指针),则该参数可能使用多个寄存器。但是内核函数惯例都是 struct 指针,所以基本上 argN 是对应第 N 个参数。

# bpftrace -e 'uprobe:/home/bgregg/func:main.add { printf("%d %d\n", arg0, arg1); }'
Attaching 1 probe...
42 13

# bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Attaching 1 probe...
opening: /proc/cpuinfo
opening: /proc/stat
opening: /proc/diskstats
opening: /proc/stat
opening: /proc/vmstat
[...]

reg(const string name):reg 是 bpftrace 内置函数,返回指定 name 的寄存器值,比如 amd64 的 ax/bx/cx/sp(不含 r 前缀);

  • 适用于:kprobe、uprobe
# bpftrace -e 'uprobe:/home/bgregg/Lang/go/func:main*add { printf("%d %d\n", *(reg("sp") + 8), *(reg("sp") + 16)); }'
Attaching 1 probe...
42 13

args:The struct with all arguments of the traced function. Available in tracepoint, kfunc, and uprobe (with DWARF) probes. Use args.x to access argument x or args to get a record with all arguments.

root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:vmlinux:__traceiter_net_dev_start_xmit'
kfunc:vmlinux:__traceiter_net_dev_start_xmit
    void * __data
    const struct sk_buff * skb
    const struct net_device * dev
    int retval

root@lima-ebpf-dev:~# bpftrace -e 'kfunc:vmlinux:__traceiter_net_dev_start_xmit {printf("%x\n", args.skb->protocol);}'
Attaching 1 probe...

retval: Value returned by the function being traced (kretprobe, uretprobe, fexit). For kretprobe and uretprobe, its type is uint64, but for fexit it depends. You can look up the type using bpftrace -lv

  • 适用于 kretprobe, uretprobe, fexit
# bpftrace -e 'kretprobe:do_sys_open { printf("returned: %d\n", retval); }'
Attaching 1 probe...
returned: 8
returned: 21
returned: -2
returned: 21
[...]

打印 struct 字段
#

  1. 需要先导入 struct 的定义,将参数转换为 struct xx 指针,然后才能用 print() 来打印
bpftrace -v -e 'struct Foo { int m; int n; } uprobe:./testprogs/simple_struct:func { $f = *((struct Foo *) arg0); print($f); exit(); }'

bpftrace -v -e 'struct Foo { int m; int n; } u:./testprogs/simple_struct:func { @s = *((struct Foo *)arg0); exit(); }'

bpftrace -v -e 'struct Foo { struct { int m[1] } y; struct { int n } a; } u:./testprogs/simple_struct:func { @s = *((struct Foo *)arg0); exit(); }'

参考:

  1. 引入头文件中 struct 定义,然后就可以解析各字段;

    bpftrace 可以读取系统的头文件,如下面的内核头文件。

    # cat path.bt
    #include <linux/path.h>
    #include <linux/dcache.h>
    
    kprobe:vfs_open
    {
    printf("open path: %s\n", str(((struct path *)arg0)->dentry->d_name.name));
    }
    
    # bpftrace path.bt
    Attaching 1 probe...
    open path: dev
    open path: if_inet6
    open path: retrans_time_ms
    [...]
    
  2. 或者,如果内置内置 BTF,就可以不引入头文件,直接解析字段;

    # bpftrace -e 'kprobe:vfs_open { printf("open path: %s\n", str(((struct path *)arg0)->dentry->d_name.name)); }'
    Attaching 1 probe...
    open path: cmdline
    open path: interrupts
    [...]
    

C 和 Go 函数传参差异
#

C/C++ 使用 AMD64 ABI 规范,使用寄存器为函数传参:rdi, rsi, rdx, rcx, r8, r9, stack, stack …

bpftrace 使用 arg0, arg1, arg2, …argN 来获取这些寄存器传参的值。

# bpftrace -e 'uprobe:/home/bgregg/func:main.add { printf("%d %d\n", arg0, arg1); }'
Attaching 1 probe...
42 13

go 1.17 以前版本,使用 stack 传参,

sarg0 == *(reg(“sp”) + 8) sarg1 == *(reg(“sp”) + 16)

# bpftrace -e 'uprobe:/home/bgregg/Lang/go/func:main*add { printf("%d %d\n", *(reg("sp") + 8), *(reg("sp") + 16)); }'
Attaching 1 probe...
42 13

go 1.17 以后版本,改为寄存器传参为主,stack 传参为辅助(具体需要反汇编二进制来确定),但使用的寄存器顺序:rax, rbx, rcx, rdi, rsi, r8, r9, r10, r11, stack 和 C 的 ADM64 ABI 使用的顺序不一致,所以 bpftrace 的 argN 不适用于新的 golang 版本

另外一个问题是,go 的一些类型,如 string,实际是地址+长度组成(uinptr + i64),对于这样的一个 go 类型参数,使用两个寄存器来传参,这时 argN 就不一定对应第 N 个参数了。

对于 C/Go 函数,如果参数类型是 struct 而非指针,则也会有上面的问题,argN 和函数的第 N 个参数不是一一对应了 !

# https://godbolt.org/z/67a4Yde5e

#include <stdint.h>

struct Foo {
    uint64_t a;
    uint64_t b;
};

void byval(Foo f);

void bar() {
    Foo f = {
        .a = 1,
        .b = 2,
    };
    byval(f);
}

### 反汇编
bar():
        mov     edi, 1
        mov     esi, 2
        jmp     byval(Foo)

如果参数是一个地址或结构体,需要将该地址强转到对应的数据结构,才能正常解析。比如 golang 的 string 其实内部是一个 struct 定义,当函数参数是 string 类型时,需要使用如下方式解析:

struct GoString {
     char * str;
    int len;
};

uprobe:./string:main.join
{
    $p1 = (struct GoString*) sarg0;
    printf("arg1[%d]:%s\n", $p1->len, str($p1->str, $p1->len));
    $p2 = (struct GoString*) sarg1;
    printf("arg2[%d]:%s\n", $p2->len, str($p2->str, $p2->len));
}

所以,为了在 bpfstrace 中准确获取函数参数,最保险的办法是:反汇编函数然后看传参的方式和使用的寄存器或 stack 情况。

bpftrace 调试 Go 程序调试
#

go 不使用 .eh_frame,而使用 FP 或 .debug_frame,go build 生成的二进制默认包含 .debug_xx 调试符号表和 FP:

alizj@lima-dev2:/Users/alizj/go/src/git.com/my-agent$ GOOS=linux GOARCH=amd64 go build -o my-agent-amd64 ./cmd/

alizj@lima-dev2:/Users/alizj/go/src/git.com/my-agent$ file my-agent-amd64
my-agent-amd64: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=CwNfhKa02_E_0xJyBXTh/YtKaW6xl9F8fuPNB2FmQ/GTJ4i0gXCea-djzz28sm/8HQk6LYzhP01Evpb6hA5, with debug_info, not stripped

alizj@lima-dev2:/Users/alizj/go/src/git.com/my-agent$ go version
go version go1.21.1 linux/arm64
alizj@lima-dev2:/Users/alizj/go/src/git.com/my-agent$ readelf -S ./my-agent-amd64 |grep -E 'debug|eh'
  [13] .debug_abbrev     PROGBITS         0000000000000000  02e61000
  [14] .debug_line       PROGBITS         0000000000000000  02e61135
  [15] .debug_frame      PROGBITS         0000000000000000  0311f672
  [16] .debug_gdb_s[...] PROGBITS         0000000000000000  031c7ed2
  [17] .debug_info       PROGBITS         0000000000000000  031c7eff
  [18] .debug_loc        PROGBITS         0000000000000000  0368e961
  [19] .debug_ranges     PROGBITS         0000000000000000  03a55bbf

go run 默认使用 “–ldflags ‘-s -w’”, 故删除了 symbol table(-s) 和 DWARF debug info(-w),不能用于调试。

为了更好的调试 go 程序,需要使用 go build -gcflags=all="-N -l" 命令。

使用 bpftrace -l 查看 go 二进制中的函数列表:

# bpftrace -l uprobe:myagentt/current/bin/my-agent/my-agent |grep SaveNet
uprobe:/cloud/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
uprobe:/cloud/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo

如果二进制包含 DWARF 的 .debug_xx 信息,对于高版本的 bpftrace 可以使用 -lv 参数查看 uprobe 函数参数列表:

  • 二进制需要包含 DWARF 信息;
#readelf -S /tmp/my-agent |grep -E 'eh|debug'
  [13] .debug_abbrev     PROGBITS         0000000000000000  02e5b000
  [14] .debug_line       PROGBITS         0000000000000000  02e5b135
  [15] .debug_frame      PROGBITS         0000000000000000  03118b86
  [16] .debug_gdb_script PROGBITS         0000000000000000  031c13e6
  [17] .debug_info       PROGBITS         0000000000000000  031c1413
  [18] .debug_loc        PROGBITS         0000000000000000  03687edb
  [19] .debug_ranges     PROGBITS         0000000000000000  03a4f139

#/tmp/bpftrace4 --version
/tmp/bpftrace4: stat /static-python: No such file or directory
bpftrace v0.20.4

#/tmp/bpftrace4 -lv 'uprobe:/tmp/my-agent:*SaveNetworkLocateInfo'
/tmp/bpftrace4: stat /static-python: No such file or directory
uprobe:/tmp/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
    git.com/my-agent/pkg/storage.ProcessStore* ps
    git.com/my-agent/pkg/storage.NetworkLocateInfo* n
    error ~r0
uprobe:/tmp/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo
    git.com/my-agent/pkg/storage.ProcessStore* ps
    git.com/my-agent/pkg/storage.NetworkInfo* ni
    struct string pid
    bool isFailed
    error ~r0

C++/Rust/Go 函数名 demangling
#

bpftrace 也支持 C++、Rust、Go 函数名的 demangling:

  • Add support for rust demangling #3688 :https://github.com/bpftrace/bpftrace/pull/3688/files
// Note that legacy rust programs use the same C++ mangling convention,
// and therefore will always start with _Z. This is fine, but they also
// include a symbol hash at the end which would normally be stripped
// off. If users have legacy rust programs, they can just use a
// wildcard match against the hash component, and the rust specific
// bits will match against the newer v0 rust mangling convention.

// The legacy mangling scheme for rust actually uses the C++
// demangler with an extra hash at the end. We use the same scheme,
// and users will need to explicitly wildcard against this hash.

// We may choose to parse the v0 mangled symbols defined by:
// https://rust-lang.github.io/rfcs/2603-rust-symbol-name-mangling-v0.html
//
// Or may vendor/link an alternate library to do so.

COMMAND ${CMAKE_COMMAND} -E env "RUSTFLAGS=-C symbol_mangling_version=v0" ${CARGO_EXECUTABLE} build --target-dir ${CMAKE_CURRENT_BINARY_DIR}

go 和 bpftrace 的兼容性问题
#

uprobe Go 函数名问题
#

bpftrace uprobe 函数名不支持特殊字符,如点号、括号等。

  • 新的 bpftrace 版本支持点号,https://github.com/bpftrace/bpftrace/issues/548
  • 但是最新版还是不支持括号

而 golang 的函数明一般包含完整的 go package 路径,如 git.com/my-agent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo,这会导致 bpftrace 解析函数名称时报错:

#bpftrace  --version
bpftrace v0.11.2

#bpftrace -l uprobe:/cloud/my-agent |grep SaveNet
uprobe:myagentt/current/bin/my-agent/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
uprobe:myagentt/current/bin/my-agent/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo

#bpftrace -e 'uprobe:/cloud/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo { @[ustack] = count(); }'
stdin:1:1-97: ERROR: syntax error, unexpected -, expecting {
uprobe:/cloud/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo { @[ustack] = count(); }
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

解决办法:使用 1. 函数地址,2. 函数名路径 wildcard ,3. 函数名路径字符串

  • bpftrace 的 uprobe 后面的函数名路径支持 wildcard,所以可以使用 * 来匹配或忽略特殊字符。
# OK
#bpftrace -e 'uprobe:/cloud/my-agent:*SaveNetworkLocateInfo { @[ustack] = count(); }'

# OK
#bpftrace -e 'uprobe:/cloud/my-agent:gitlab.comp*SaveNetworkLocateInfo { @[ustack] = count(); }'
Attaching 2 probes...

# OK
#bpftrace -e 'uprobe:/cloud/my-agent:gitlab.comp*SaveNetwork*Info { @[ustack] = count(); }'

# 为函数名路径添加字符串双引号,OK
#bpftrace -e 'uprobe:/cloud/my-agent:"git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo" { @[ustack] = count(); }'

#nm -n /cloud/my-agent |grep SaveNetwork
00000000012e0220 T git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
00000000012e0480 T git.com/my-agent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo

# 使用 地址 OK
# bpftrace -e 'uprobe:/cloud/my-agent:0x12e0220 { @[ustack] = count(); }'

uretprobe 不兼容 golang
#

通过 uretprobe 检查 golang 方法的返回值可能存在风险。这是因为 uretprobe 是通过修改栈来加入探针的, 这和 golang 本身对栈的管理存在冲突的可能.

// example.go
func myprint(s string) {
  fmt.Printf("Input: %s\n", s)
}

func main() {
  ss := []string{"a", "b", "c"}
  for _, s := range ss {
    go myprint(s)
  }
  time.Sleep(1*time.Second)
}

bpftrace uretprobe 出错:

# bpftrace -e 'uretprobe:./test:main.myprint { @=count(); }' -c ./test
runtime: unexpected return pc for main.myprint called from 0x7fffffffe000
stack: frame={sp:0xc00008cf60, fp:0xc00008cfd0} stack=[0xc00008c000,0xc00008d000)
fatal error: unknown caller pc

虽然在 golang 程序中使用 uretprobe 是不安全的,但是好在 uprobe 还可以放心用。其实换个角度看,即便我们不使用 uretprobe,依然有办法获取返回时,比如我们可以通过在本方法 return 的时候或者在一个方法开始的时候设置一个 uprobe 来获取返回值。

参考:

bpftrace skb 解析
#

if the $ipheader->daddr is 192.168.2.44, just convert this four number to hex chars, which are c0 a8 02 2c. reversal chars are 2c 02 a8 c0. so you can just write: https://stackoverflow.com/questions/75172893/comparing-ip-addresses-in-bpftrace

root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:ip_finish_output'
kfunc:vmlinux:ip_finish_output
struct net * net
struct sock * sk
struct sk_buff * skb
int retval

# bpftrace 可以读取系统的内核头文件
root@lima-ebpf-dev:~# cat /Users/zhangjun/skb.bt
#include <linux/skbuff.h>
#include <linux/icmp.h>
#include <linux/ip.h>
#include <linux/ipv6.h>
#include <linux/in.h>

kfunc:ip_finish_output {
  $skb = (struct sk_buff *)args.skb;
  $dev = $skb->dev;
  $name = $dev->name;
  $ipheader = ((struct iphdr *) ($skb->head + $skb->network_header));
  $version = ($ipheader->version) >>4;

  if($ipheader->protocol == IPPROTO_ICMP) {
    // get ICMP header; see skb_transport_header():
     $icmph = (struct icmphdr *)($skb->head + $skb->transport_header);
     if ($icmph->type == ICMP_ECHO) {
       $id = $icmph->un.echo.id;
       $seq = $icmph->un.echo.sequence;

     printf("icmp: pid %d, comm: %s, [%d] %d\t%s > %s\n, id: %d, seq: %d, dev: %s\n", pid, comm, $version, $ipheader->protocol,
                 ntop($ipheader->saddr), ntop($ipheader->daddr), $id, $seq, $name);
       }
  }
}

root@lima-ebpf-dev:~# bpftrace /Users/zhangjun/skb.bt
Attaching 1 probe...
icmp: pid 206295, comm: ping, [0] 1     192.168.5.1 > 114.114.114.114
, id: 5376, seq: 48640, dev: eth0
icmp: pid 206295, comm: ping, [0] 1     192.168.5.1 > 114.114.114.114
, id: 5376, seq: 48896, dev: eth0
icmp: pid 206295, comm: ping, [0] 1     192.168.5.1 > 114.114.114.114
, id: 5376, seq: 49152, dev: eth0

另一个例子:https://lwn.net/Articles/793749/

There is an important capability missing from those one-liners: struct navigation. Here is the function prototype again: int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);

bpftrace provides arg0-argN for kprobe function arguments, simply mapping them to the registers for the calling convention (arg2 becomes %rdx on x86_64, for example). Since bpftrace can read kernel headers, which are often installed on production systems, accessing struct data is possible by including the right header and casting the arguments:

#include <net/sock.h>
[...]
        $sk = (struct sock *)arg0;

Here’s an example of a bpftrace tool that prints the address information, size, and return value from tcp_sendmsg(). Example output:

# ./tcp_sendmsg.bt
Attaching 2 probes...
10.0.0.65       49978 -> 52.37.243.173   443  : 63 bytes, retval 63
127.0.0.1       58566 -> 127.0.0.1       22   : 36 bytes, retval 36
127.0.0.1       22    -> 127.0.0.1       58566: 36 bytes, retval 36
[...]

The source of tcp_sendmsg.bt:

#!/usr/local/bin/bpftrace

#include <net/sock.h>

k:tcp_sendmsg
{
  @sk[tid] = arg0;
  @size[tid] = arg2;
}

kr:tcp_sendmsg
/@sk[tid]/
{
  $sk = (struct sock *)@sk[tid];
  $size = @size[tid];
  $af = $sk->__sk_common.skc_family;
  if ($af == AF_INET) {
    $daddr = ntop($af, $sk->__sk_common.skc_daddr);
    $saddr = ntop($af, $sk->__sk_common.skc_rcv_saddr);
    $lport = $sk->__sk_common.skc_num;

    $dport = $sk->__sk_common.skc_dport;
    $dport = ($dport >> 8) | (($dport << 8) & 0xff00);

    printf("%-15s %-5d -> %-15s %-5d: %d bytes, retval %d\n",
        $saddr, $lport, $daddr, $dport, $size, retval);
  } else {
    printf("IPv6...\n");
  }
  delete(@sk[tid]);
  delete(@size[tid]);
}

In the kprobe, sk and size are saved in per-thread-ID maps, so they can be retrieved in the kretprobe when tcp_sendmsg() returns. The kretprobe casts sk and prints out details, if it is an IPv4 message, using the bpftrace function =ntop() to convert the address to a string=. The destination port is =flipped from network to host order=. To keep this short I skipped IPv6, but you can add code to handle it too (ntop() does support IPv6 addresses).

There is work underway for bpftrace to use BPF Type Format (BTF) information as well, which brings various advantages including struct definitions that are missing from kernel headers.

相关文章

使用 linux perf 进行内核和应用性能分析
Perf Kernel Performance Tool
介绍 perf 工具的使用方式、局限性和问题。
cilium/ebpf
Ebpf
广泛使用的 cilium/ebpf go 库分析,涵盖了 Go 开发 eBPF 程序的各方面内容。
eBPF 常见错误
Ebpf
总结了 eBPF 开发过程中常见的报错和兼容性问题。
创建 ebpf btf 和 vmlinux.h
Ebpf
创建 eBPF BTF 和 vmlinux.h 内核头文件的各种方式。