介绍 bpftrace 工具的使用方式、局限性和问题。

内核不支持 DWARF，perf、gdb、systemtap 等是在用户态使用 DWARF 做调用栈展开。

但目前 bpftrace/bcc 不支持使用 DWARF 来做用户态程序的调用栈展开，而只能使用 FP ：

Comparing SystemTap and bpftrace：https://lwn.net/Articles/852112/
User-space backtrace support for programs built without frame pointers #1744 ：https://github.com/iovisor/bpftrace/issues/1744

对于 BPF 程序，如果要 ustack() 函数正常工作，需要编译时开启 FP：

不开启优化，不使用任何 -O 选项或指定 -O0；
或者明确指定编译参数：-fno-omit-frame-pointer 或 --enable-frame-pointer;

bpftrace 虽然不支持使用 DWARF 进行 unwinding，但是支持使用 DWARF 来对用户函数的参数进行解析。也即使用 bpftrace -lv 'uprobe:/bin/bash:readline' 来显示 readline 函数参数列表时，也是从调试符号表中解析函数名称和参数信息，如果 bpftrace 查不到调试符号表，则会报错： No DWARF found for XX，cannot show parameter info

参考：https://github.com/iovisor/bpftrace/blob/master/src/dwarf_parser.cpp

root@lima-ebpf-dev:~# apt install bash-dbgsym bash-static-dbgsym
root@lima-ebpf-dev:~# bpftrace -e 'uprobe:/usr/bin/bash:readline {printf("%s", ustack)}' # -p 12446

ubuntu 安装调试符号包
#

官方文档：https://documentation.ubuntu.com/server/reference/debugging/debug-symbol-packages/index.html

Ubuntu 的调试符号包有两种命名：

*-dbg.deb
*-dbgsym.ddeb

前者是过时的调试符号包名称，依赖开发这将非调试符号包和调试符号包都放到一个仓库，但由于不是强制机制，所以调试符号包有可能不存在。

后者是当前建议（主流）的方式，即 debian 构建基础设施将自动生成的调试符号包保存在单独的 ddebs 软件源仓库中。

如果要使用后者，需要先配置 ddebs 软件源，步骤如下：

echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" |  sudo tee -a /etc/apt/sources.list.d/ddebs.list

# 添加 ddebgs 仓库包的签名密钥
apt install ubuntu-dbgsym-keyring

# 更新包列表
sudo apt-get update

然后安装常用调试符号包：

# 安装内核调试包
sudo apt-get install linux-image-$(uname -r)-dbgsym

# 安装内核头文件
apt install linux-headers-$(uname -r)

# 安装 bash 调试包
apt install bash-dbgsym

# 安装 libc 调试包
apt install libc-bin-dbgsym libc-dev-bin-dbgsym libc-devtools-dbgsym libc6-dev-dbgsym

安装完成后，调试符号通常会被安装到 /usr/lib/debug 目录下。

安装 bpftrace
#

apt install bpftrace

# 安装 bpftrace-dbgsym 包:
apt install bpftrace-dbgsym
bpftrace -e 'BEGIN { printf("hello world\n"); }'

bpftrace 使用 #include<xx> 内核头文件来获得内核 struct 定义，所以需要安装内核头文件。

apt install linux-headers-$(uname -r)

bpftrace Cheat Sheet: https://www.brendangregg.com/BPF/bpftrace-cheat-sheet.html

查看 bpftrace 信息
#

Build: 如是否支持 libdw，只有支持 libdw 才能使用 -lv 显示用户函数的参数列表（来源于 DWARF）；
Kernel helpers: 内核支持的 eBPF Kernel helpers 特性列表;
Kernel fatures: 内核支持的 eBPF 特性列表;

root@lima-ebpf-dev:~# bpftrace --info
System
  OS: Linux 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023
  Arch: x86_64

Build
  version: v0.18.0-97-ge010d
  LLVM: 14.0.0
  unsafe probe: no
  bfd: yes
  libdw (DWARF support): yes

Kernel helpers
  probe_read: yes
  probe_read_str: yes
  probe_read_user: yes
  probe_read_user_str: yes
  probe_read_kernel: yes
  probe_read_kernel_str: yes
  get_current_cgroup_id: yes
  send_signal: yes
  override_return: yes
  get_boot_ns: yes
  dpath: yes
  skboutput: yes
  get_tai_ns: no
  get_func_ip: yes

Kernel features
  Instruction limit: 1000000
  Loop support: yes
  btf: yes
  module btf: yes
  map batch: yes
  uprobe refcount (depends on Build:bcc bpf_attach_uprobe refcount): yes

Map types
  hash: yes
  percpu hash: yes
  array: yes
  percpu array: yes
  stack_trace: yes
  perf_event_array: yes
  ringbuf: yes

Probe types
  kprobe: yes
  tracepoint: yes
  perf_event: yes
  kfunc: yes
  kprobe_multi: no
  raw_tp_special: yes
  iter: yes

列出插桩点和函数参数
#

bpftrace -l “tracepoint:*”: 显示指定 glob 模式的插桩点名称。
bftrace -lv “tracepoint:syscalls:sys_enter_execve”: 显示 tracepoint/syscall/kfunc/uprobe 函数的参数列表。

对于用户函数（uprobe），-lv 使用 DWARF 数据来解析函数参数的，所以需要 ELF 包含 .debug_XX 符号表，或者安装对应的 debuginfo 包。对于 ubuntu，一般是 XX-dbgsym。

kprobe 等不支持 -lv 查看参数。

# 安装 bash 调试符号包
apt install bash-dbgsym

# bpftrace -lv 'uprobe:/bin/bash:readline'
uprobe:/bin/bash:readline
    const char* prompt

# bpftrace -lv 'tracepoint:syscalls:sys_enter_write'
tracepoint:syscalls:sys_enter_write
    int __syscall_nr
    unsigned int fd
    const char * buf
    size_t count

跟踪内核函数
#

使用 -e 来指定 kprobe、syscall 和 tracepoint 等类型事件，打印 kstack：

# bpftrace -e 'kprobe:nf_conntrack_in {printf("%s\n", kstack); }'
        nf_conntrack_in+1
        nf_hook_slow+61
        __ip_local_out+214
        ip_local_out+23
        ip_send_skb+21
        udp_send_skb.isra.43+277
        udp_sendmsg+1544
        sock_sendmsg+48
        ___sys_sendmsg+688
        __sys_sendmsg+99
        do_syscall_64+85
        entry_SYSCALL_64_after_hwframe+68

跟踪用户函数
#

bpftrace 不支持基于 DWARF 的用户栈展开，需要用户程序编译时生成 frame pointer。

需要提供 ELF 对应的符号表，可以是 ELF 中自带，或者安装的对应的 debuginfo 包来提供。中需要包含符号表，

root@idev2-x86:~# cat func_call.c
#include <stdio.h>
#include <unistd.h>

void func_d(char * id) {
                int msec=1;
                printf("Hello world from %s\n", id);
                usleep(1000000*msec);
}

void func_c(char * id) {
                printf("Hello from %s\n", id);
                func_d("D");
}

void func_b(char * id) {
                printf("Hello from %s\n", id);
                func_c("C");
}

void func_a(char * id) {
                printf("Hello from %s\n", id);
                func_b("B");
}

int main() {
        func_a("A");
}

# 编译, 没有指定 -O 优化选项，所以开启 FP
root@idev2-x86:~# gcc func_call.c -g -o func_call

# 确认 gcc 在函数调用的开头添加保存 FP 的指令。
root@idev2-x86:~# objdump -S func_call |grep -A 4 func_c
func_call:     file format elf64-x86-64


Disassembly of section .init:

--
00000000000011ae <func_c>:

void func_c(char * id) {
    11ae:	f3 0f 1e fa          	endbr64
    11b2:	55                   	push   %rbp
    11b3:	48 89 e5             	mov    %rsp,%rbp
    11b6:	48 83 ec 10          	sub    $0x10,%rsp
--
                func_c("C");
    1216:	48 8d 05 0d 0e 00 00 	lea    0xe0d(%rip),%rax        # 202a <_IO_stdin_used+0x2a>
    121d:	48 89 c7             	mov    %rax,%rdi
    1220:	e8 89 ff ff ff       	call   11ae <func_c>
}
    1225:	90                   	nop
    1226:	c9                   	leave
    1227:	c3                   	ret

# 查看用户程序可追踪的函数
root@idev2-x86:~# bpftrace -lv 'uprobe:./func_call:*'
uprobe:./func_call:__do_global_dtors_aux
uprobe:./func_call:_fini
uprobe:./func_call:_init
uprobe:./func_call:_start
uprobe:./func_call:deregister_tm_clones
uprobe:./func_call:frame_dummy
uprobe:./func_call:func_a
    char* id
uprobe:./func_call:func_b
    char* id
uprobe:./func_call:func_c
    char* id
uprobe:./func_call:func_d
    char* id
uprobe:./func_call:main
uprobe:./func_call:register_tm_clones

# 查看用户程序可追踪的函数
root@idev2-x86:~# bpftrace -e 'uprobe:./func_call:func_d {printf("\nid: %s, stack: %s", str(arg0), ustack)}' -c ./func_call
Attaching 1 probe...
Hello from A
Hello from B
Hello from C
Hello world from D

id: D, stack:
        func_d+0
        func_b+58
        func_a+58
        main+23
        0x74216aa2a1ca
        __libc_start_main+139
        _start+37

使用 pid 追踪正在运行的程序（需要使用 FP 和包含符号表）：

root@lima-ebpf-dev:~# linux-headers-`uname -r` linux-libc-dev
root@lima-ebpf-dev:~# apt install bash-dbgsym bash-static-dbgsym
root@idev2-x86:~# bpftrace -l 'uprobe:/usr/bin/bash:*' |grep readline
uprobe:/usr/bin/bash:initialize_readline
uprobe:/usr/bin/bash:pcomp_set_readline_variables
uprobe:/usr/bin/bash:posix_readline_initialize
uprobe:/usr/bin/bash:readline
uprobe:/usr/bin/bash:readline_internal_char
uprobe:/usr/bin/bash:readline_internal_setup
uprobe:/usr/bin/bash:readline_internal_teardown
uprobe:/usr/bin/bash:readline_set_char_offset
uprobe:/usr/bin/bash:yy_readline_get
uprobe:/usr/bin/bash:yy_readline_unget

root@idev2-x86:~# bpftrace -lv 'uprobe:/usr/bin/bash:readline'
uprobe:/usr/bin/bash:readline
    const char* prompt

root@idev2-x86:~# bpftrace -e 'uprobe:/usr/bin/bash:readline {printf("pid: %d, cmd: %s, stack: \n%s\n prompt: %s\n", pid, comm, ustack, str(args->prompt))}' # -p 12446
Attaching 1 probe...

pid: 153181, cmd: bash, stack:

        readline+0
        shell_getc.lto_priv.0+631
        read_token.constprop.0+123
        yyparse+1279
        parse_command+80
        read_command+137
        reader_loop+359
        main+6411
        0x7f981362a1ca
        __libc_start_main+139
        _start+37

 prompt: root@idev2-x86:~/FlameGraph#..
pid: 153181, cmd: bash, stack:

打印 ustack、kstack 时，可以指定参数，如 ustack(perf, 3), 其中 perf 表示栈的格式，3 表示用户空间栈层级.

[ku]stack([bpftrace|perf|raw])：https://github.com/bpftrace/bpftrace/issues/430#issuecomment-2580126066

# bpftrace -e 'uprobe:/cloud/my-agent:*doSaveNetworkLocateInfo {printf("%s\n", ustack(perf, 2));}'
	12e0480 git.com/my-agent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo+0 (/cloud/my-agent)
	12e557d git.com/my-agent/pkg/network/processor/pidricher.(*pidEnricher).Process+1405 (/cloud/my-agent)

测量函数执行延迟
#

版本1：使用全局变量，有并发干扰问题

#!/usr/bin/bpftrace
uprobe:/usr/bin/dockerd:"github.com/docker/docker/api/server/router/network.(*networkRouter).getNetworksList" {
    @start = nsecs;
}
uretprobe:/usr/bin/dockerd:"github.com/docker/docker/api/server/router/network.(*networkRouter).getNetworksList" {
    printf("getNetworksList took %d ms\n", (nsecs - @start) / 1000000);
}

版本2： OK，使用 per thread 的变量

#!/usr/bin/bpftrace
uprobe:/usr/bin/dockerd:"github.com/docker/docker/api/server/router/network.(*networkRouter).getNetworksList" {
    @start[tid] = nsecs;
}
uretprobe:/usr/bin/dockerd:"github.com/docker/docker/api/server/router/network.(*networkRouter).getNetworksList" {
    if (@start[tid] != 0) {
        printf("getNetworksList took %d ms\n", (nsecs - @start[tid]) / 1000000);
        delete(@start[tid]);
    }
}

跟踪容器进程
#

容器进程在独立的 mount ns 中，即根文件系统和 Host 是独立的，需要在 Host 上上到容器内使用的二进制和库文件，然后做符号解析和函数追踪。

在 Host 上查找容器内二进制文件，由两种方式：

/proc//root
查找容器 mergedDir

# ls  /proc/102366/root/
apsara  bin  boot  dev  entrypoint.sh  etc  home  lib  lib64  lost+found  media  mnt  nsenter  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

# ls -l  /proc/102366/exe
lrwxrwxrwx 1 root root 0 Jan  9 16:11 /proc/102366/exe -> /usr/bin/plugin.csi.cloud.com

# ls -l  /proc/102366/root/usr/bin/plugin.csi.cloud.com
-rwxr-xr-x 1 root root 82455166 Jan  9 13:23 /proc/102366/root/usr/bin/plugin.csi.cloud.com

$ sudo docker inspect -f '{{.State.Pid}}' cilium-agent
109997

$ bpftrace -e 'uprobe:/proc/109997/root/usr/bin/cilium-agent:"github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate" {printf("%s\n", ustack); }'

另外也可以根据容器的 .GraphDriver.Data.MergedDir 来在 Host 上查找容器内路径的二进制路径。例如，如果想跟踪 cilium-agent 进程（本身是用 docker 容器部署的），首先需要找到 cilium-agent 文件在宿主机上的绝对路径，可以通过 container ID 或 name 找：

merged path 是容器使用的 overlay 根文件系统。

# Check cilium-agent container
$ docker ps | grep cilium-agent
0eb2e76384b3        cilium:test   "/usr/bin/cilium-agent ..."   4 hours ago    Up 4 hours   cilium-agent

# Find the merged path for cilium-agent container
$ docker inspect --format "{{.GraphDriver.Data.MergedDir}}" 0eb2e76384b3
/var/lib/docker/overlay2/a17f868d/merged # a17f868d.. is shortened for better viewing

# The object file we are going to trace
$ ls -ahl /var/lib/docker/overlay2/a17f868d/merged/usr/bin/cilium-agent
-rwxr-xr-x 1 root root 86M /var/lib/docker/overlay2/a17f868d/merged/usr/bin/cilium-agent

然后再指定绝对路径 uprobe：go 函数需要包含完整路径 字符串, 如 “github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate”

(node) $ bpftrace -e 'uprobe:/var/lib/docker/overlay2/a17f868d/merged/usr/bin/cilium-agent:"github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate" {printf("%s\n", ustack); }'
Attaching 1 probe...

        github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate+0
        github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run.func1+363
        sync.(*Once).doSlow+236
        github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run+101
        runtime.goexit+1

使用 nm 或者 bptrace 命令来查看 go 二进制中可以 tracing 的符号（函数）列表：

$ nm cilium-agent
000000000427d1d0 B bufio.ErrBufferFull
000000000427d1e0 B bufio.ErrFinalToken
0000000001d3e940 T type..hash.github.com/cilium/cilium/pkg/k8s.ServiceID
0000000001f32300 T type..hash.github.com/cilium/cilium/pkg/node/types.Identity
0000000001d05620 T type..hash.github.com/cilium/cilium/pkg/policy/api.FQDNSelector
0000000001d05e80 T type..hash.github.com/cilium/cilium/pkg/policy.PortProto
...

# bpftrace -l 'uprobe:./exec:*'|tail
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.lookupInfoNFC
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.lookupInfoNFKC
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextCGJCompose
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextCGJDecompose
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextComposed
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextDecomposed
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextDone
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextHangul
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextMulti
uprobe:./exec:vendor/golang.org/x/text/unicode/norm.nextMultiNorm

判断是否是容器进程的依据是：NSpid 字段：

$ cat /proc/1229/status | grep NSpid
NSpid: 1229

$ cat /proc/11459/status | grep NSpid
NSpid: 11459 1

11459 是在宿主机的 pid ns 内的进程 ID，1 是在容器自己的 pid ns 的进程 ID

采样和火焰图
#

和 perf record 类似，可以周期采样整个系统或特定进程：

# bpftrace -v   -e 'profile:hz:100 /pid == 1/ { @[ustack(1)] = count(); }'

对于 bpftrace 产生的 profiling 数据，可以使用 flamegraph 提供的转换工具进行可视化：

# 安装 stress-ng 和它的调试符号包
sudo apt-get install stress-ng stress-ng-dbgsym
# 进行CPU、内存、I/O和文件系统压力测试
stress-ng --cpu $(nproc)  --io 2 --vm 2 --vm-bytes 256M --hdd 1 --timeout 60s

# 下载火焰图工具
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph

# 内核空间火焰图
sudo bpftrace -e 'profile:hz:999 { @[kstack] = count(); }' > trace.data
./stackcollapse-bpftrace.pl trace.data > trace.folded
./flamegraph.pl --inverted trace.folded > traceflamegraph.svg

# 用户空间程序火焰图
bpftrace -v   -e 'profile:hz:999  { @[ustack] = count(); }' >user.trace.data
./stackcollapse-bpftrace.pl user.trace.data > user.trace.folded
./flamegraph.pl --inverted user.trace.folded > user.traceflamegraph.svg

查看 eBPF svg：

file:///Users/alizj/Desktop/eBFP/traceflamegraph.svg
file:///Users/alizj/Desktop/eBFP/user.traceflamegraph.svg

网络数据包过滤
#

root@idev2-x86:~# cat network_filter.bt
#!/usr/bin/env bpftrace

BEGIN
{
    printf("开始监控访问 2223 端口(HTTPS)的流量\n");
    printf("%-20s %-20s %-10s %-8s %-8s %s\n", "源 IP", "目标 IP", "源端口", "目标端口", "协议", "数据包大小");
}

// 使用通用的网络跟踪点来捕获传入流量
tracepoint:net:netif_receive_skb
{
    $skb = (struct sk_buff *)args->skbaddr;
    if ($skb != 0) {     // 检查SKB是否有效
        if ($skb->protocol == 0x08) {  // 基于协议类型进行过滤, ETH_P_IP 为 0x0800，转换为主机字节序为 0x08
            $iph = (struct iphdr*)($skb->head + $skb->network_header);             // 获取IP头部
            if ($iph->protocol == 6) {  // 检查是否为TCP数据包, IPPROTO_TCP = 6
                $tcph = (struct tcphdr*)($skb->head + $skb->transport_header);   // 获取TCP头部
                // 获取源端口和目标端口 (转换为主机字节序)
                $sport = ($tcph->source >> 8) | (($tcph->source & 0xff) << 8);
                $dport = ($tcph->dest >> 8) | (($tcph->dest & 0xff) << 8);
                if ($dport == 2223 ) { // 检查是否为目标端口 2223
                    // 格式化IP地址显示
                    $saddr = ntop($iph->saddr);
                    $daddr = ntop($iph->daddr);
                    // 打印匹配的数据包信息
                    printf("%-20s %-20s %-10d %-8d %-8s %d\n", $saddr, $daddr, $sport, $dport, "TCP", $skb->len);
                    // 统计数据包数量和大小
                    @packet_count++;
                    @total_bytes += $skb->len;
                }
            }
        }
    }
}

// 使用通用的网络跟踪点来捕获传出流量
tracepoint:net:net_dev_start_xmit
{
    $skb = (struct sk_buff *)args->skbaddr;
    if ($skb != 0) { // 检查SKB是否有效
        if ($skb->protocol == 0x08) {
            $iph = (struct iphdr*)($skb->head + $skb->network_header);
            if ($iph->protocol == 6) {
                $tcph = (struct tcphdr*)($skb->head + $skb->transport_header);
                $sport = ($tcph->source >> 8) | (($tcph->source & 0xff) << 8);
                $dport = ($tcph->dest >> 8) | (($tcph->dest & 0xff) << 8);
                if ($dport == 2223 ){
                    $saddr = ntop($iph->saddr);
                    $daddr = ntop($iph->daddr);
                    printf("%-20s %-20s %-10d %-8d %-8s %d (发送)\n", $saddr, $daddr, $sport, $dport, "TCP", $skb->len);
                    @packet_count_tx++;
                    @total_bytes_tx += $skb->len;
                }
            }
        }
    }
}

interval:s:1 // 每秒打印一次统计信息
{
    time("%H:%M:%S ");
    printf("收到: %d 个数据包, %d 字节 | 发送: %d 个数据包, %d 字节\n",  @packet_count, @total_bytes, @packet_count_tx, @total_bytes_tx);
}

END
{
    printf("\n监控结束，统计信息:\n");
    printf("收到数据包总数: %d\n", @packet_count);
    printf("收到数据总量: %d 字节\n", @total_bytes);
    printf("发送数据包总数: %d\n", @packet_count_tx);
    printf("发送数据总量: %d 字节\n", @total_bytes_tx);
}

join
#

最多读取 16 个长度为 1024 的内容。

// https://github.com/iovisor/bpftrace/blob/0b3392baa881f501ce684637acbd4136f8a29ed3/src/bpftrace.h#L190C1-L191C37
  unsigned int join_argnum_ = 16;
  unsigned int join_argsize_ = 1024;

// https://github.com/iovisor/bpftrace/blob/0b3392baa881f501ce684637acbd4136f8a29ed3/src/bpftrace.cpp#L463C1-L478C4
  else if (printf_id == asyncactionint(AsyncAction::join))
  {
    uint64_t join_id = (uint64_t) * (static_cast<uint64_t *>(data) + 1);
    auto delim = bpftrace->resources.join_args[join_id].c_str();
    std::stringstream joined;
    for (unsigned int i = 0; i < bpftrace->join_argnum_; i++) {
      auto *arg = arg_data + 2*sizeof(uint64_t) + i * bpftrace->join_argsize_;
      if (arg[0] == 0)
        break;
      if (i)
        joined << delim;
      joined << arg;
    }
    bpftrace->out_->message(MessageType::join, joined.str());
    return;
  }

argN/sargN/reg/args/retval
#

N 从 0 开始，表示函数第一个、第二个参数。

arg0, arg1, …: Arguments to the traced function; assumed to be 64 bits wide
- 适用于：kprobes, uprobes, usdt
sarg0, sarg1, …: Arguments to the traced function (for programs that store arguments on the stack); assumed to be 64 bits wide
- 适用于：kprobes, uprobes

如果函数参数不严格占用一个 64 bit（如 struct 而非 struct 指针），则该参数可能使用多个寄存器。但是内核函数惯例都是 struct 指针，所以基本上 argN 是对应第 N 个参数。

# bpftrace -e 'uprobe:/home/bgregg/func:main.add { printf("%d %d\n", arg0, arg1); }'
Attaching 1 probe...
42 13

# bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Attaching 1 probe...
opening: /proc/cpuinfo
opening: /proc/stat
opening: /proc/diskstats
opening: /proc/stat
opening: /proc/vmstat
[...]

reg(const string name)：reg 是 bpftrace 内置函数，返回指定 name 的寄存器值，比如 amd64 的 ax/bx/cx/sp(不含 r 前缀)；

适用于：kprobe、uprobe

# bpftrace -e 'uprobe:/home/bgregg/Lang/go/func:main*add { printf("%d %d\n", *(reg("sp") + 8), *(reg("sp") + 16)); }'
Attaching 1 probe...
42 13

args：The struct with all arguments of the traced function. Available in tracepoint, kfunc, and uprobe (with DWARF) probes. Use args.x to access argument x or args to get a record with all arguments.

https://github.com/iovisor/bpftrace/commit/7e77f6896b1285a6b6eba044e16880c88faa2f44
内核函数（tracepoint、kfunc）需要 BTF 支持。用户函数 uprobe 需要二进制有 DWARF 支持；
args. 访问各名称参数，并支持 struct 类型的解引用；

root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:vmlinux:__traceiter_net_dev_start_xmit'
kfunc:vmlinux:__traceiter_net_dev_start_xmit
    void * __data
    const struct sk_buff * skb
    const struct net_device * dev
    int retval

root@lima-ebpf-dev:~# bpftrace -e 'kfunc:vmlinux:__traceiter_net_dev_start_xmit {printf("%x\n", args.skb->protocol);}'
Attaching 1 probe...

retval: Value returned by the function being traced (kretprobe, uretprobe, fexit). For kretprobe and uretprobe, its type is uint64, but for fexit it depends. You can look up the type using bpftrace -lv

适用于 kretprobe, uretprobe, fexit

# bpftrace -e 'kretprobe:do_sys_open { printf("returned: %d\n", retval); }'
Attaching 1 probe...
returned: 8
returned: 21
returned: -2
returned: 21
[...]

打印 struct 字段
#

需要先导入 struct 的定义，将参数转换为 struct xx 指针，然后才能用 print() 来打印

bpftrace -v -e 'struct Foo { int m; int n; } uprobe:./testprogs/simple_struct:func { $f = *((struct Foo *) arg0); print($f); exit(); }'

bpftrace -v -e 'struct Foo { int m; int n; } u:./testprogs/simple_struct:func { @s = *((struct Foo *)arg0); exit(); }'

bpftrace -v -e 'struct Foo { struct { int m[1] } y; struct { int n } a; } u:./testprogs/simple_struct:func { @s = *((struct Foo *)arg0); exit(); }'

参考：

引入头文件中 struct 定义，然后就可以解析各字段；

bpftrace 可以读取系统的头文件，如下面的内核头文件。

# cat path.bt
#include <linux/path.h>
#include <linux/dcache.h>

kprobe:vfs_open
{
printf("open path: %s\n", str(((struct path *)arg0)->dentry->d_name.name));
}

# bpftrace path.bt
Attaching 1 probe...
open path: dev
open path: if_inet6
open path: retrans_time_ms
[...]

或者，如果内置内置 BTF，就可以不引入头文件，直接解析字段；

# bpftrace -e 'kprobe:vfs_open { printf("open path: %s\n", str(((struct path *)arg0)->dentry->d_name.name)); }'
Attaching 1 probe...
open path: cmdline
open path: interrupts
[...]

C 和 Go 函数传参差异
#

C/C++ 使用 AMD64 ABI 规范，使用寄存器为函数传参：rdi, rsi, rdx, rcx, r8, r9, stack, stack …

bpftrace 使用 arg0, arg1, arg2, …argN 来获取这些寄存器传参的值。

# bpftrace -e 'uprobe:/home/bgregg/func:main.add { printf("%d %d\n", arg0, arg1); }'
Attaching 1 probe...
42 13

go 1.17 以前版本，使用 stack 传参，

https://github.com/bpftrace/bpftrace/issues/740 bpftrace 使用 sarg0, sarg1, sarg3, …sargN 来获得 stack 传参的值：

sarg0 == *(reg(“sp”) + 8) sarg1 == *(reg(“sp”) + 16)

# bpftrace -e 'uprobe:/home/bgregg/Lang/go/func:main*add { printf("%d %d\n", *(reg("sp") + 8), *(reg("sp") + 16)); }'
Attaching 1 probe...
42 13

go 1.17 以后版本，改为寄存器传参为主，stack 传参为辅助（具体需要反汇编二进制来确定），但使用的寄存器顺序：rax, rbx, rcx, rdi, rsi, r8, r9, r10, r11, stack 和 C 的 ADM64 ABI 使用的顺序不一致，所以 bpftrace 的 argN 不适用于新的 golang 版本。

另外一个问题是，go 的一些类型，如 string，实际是地址+长度组成（uinptr + i64），对于这样的一个 go 类型参数，使用两个寄存器来传参，这时 argN 就不一定对应第 N 个参数了。

https://github.com/bpftrace/bpftrace/issues/2547#issuecomment-1743593518

对于 C/Go 函数，如果参数类型是 struct 而非指针，则也会有上面的问题，argN 和函数的第 N 个参数不是一一对应了！

# https://godbolt.org/z/67a4Yde5e

#include <stdint.h>

struct Foo {
    uint64_t a;
    uint64_t b;
};

void byval(Foo f);

void bar() {
    Foo f = {
        .a = 1,
        .b = 2,
    };
    byval(f);
}

### 反汇编
bar():
        mov     edi, 1
        mov     esi, 2
        jmp     byval(Foo)

如果参数是一个地址或结构体，需要将该地址强转到对应的数据结构，才能正常解析。比如 golang 的 string 其实内部是一个 struct 定义，当函数参数是 string 类型时，需要使用如下方式解析：

struct GoString {
     char * str;
    int len;
};

uprobe:./string:main.join
{
    $p1 = (struct GoString*) sarg0;
    printf("arg1[%d]:%s\n", $p1->len, str($p1->str, $p1->len));
    $p2 = (struct GoString*) sarg1;
    printf("arg2[%d]:%s\n", $p2->len, str($p2->str, $p2->len));
}

所以，为了在 bpftrace 中准确获取函数参数，最保险的办法是：反汇编函数然后看传参的方式和使用的寄存器或 stack 情况。

opentelemetry-go-instrumentation 项目为了解决函数传参和 offset 稳定性问题，采用的是 import 对应版本源码，然后获得每个 field 相对于 struct 的偏移量：https://github.com/open-telemetry/opentelemetry-go-instrumentation/tree/main/internal/pkg/instrumentation/bpf

bpftrace 调试 Go 程序调试
#

go 不使用 .eh_frame，而使用 FP 或 .debug_frame，go build 生成的二进制默认包含 .debug_xx 调试符号表和 FP：

alizj@lima-dev2:/Users/alizj/go/src/git.com/my-agent$ GOOS=linux GOARCH=amd64 go build -o my-agent-amd64 ./cmd/

alizj@lima-dev2:/Users/alizj/go/src/git.com/my-agent$ file my-agent-amd64
my-agent-amd64: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=CwNfhKa02_E_0xJyBXTh/YtKaW6xl9F8fuPNB2FmQ/GTJ4i0gXCea-djzz28sm/8HQk6LYzhP01Evpb6hA5, with debug_info, not stripped

alizj@lima-dev2:/Users/alizj/go/src/git.com/my-agent$ go version
go version go1.21.1 linux/arm64
alizj@lima-dev2:/Users/alizj/go/src/git.com/my-agent$ readelf -S ./my-agent-amd64 |grep -E 'debug|eh'
  [13] .debug_abbrev     PROGBITS         0000000000000000  02e61000
  [14] .debug_line       PROGBITS         0000000000000000  02e61135
  [15] .debug_frame      PROGBITS         0000000000000000  0311f672
  [16] .debug_gdb_s[...] PROGBITS         0000000000000000  031c7ed2
  [17] .debug_info       PROGBITS         0000000000000000  031c7eff
  [18] .debug_loc        PROGBITS         0000000000000000  0368e961
  [19] .debug_ranges     PROGBITS         0000000000000000  03a55bbf

go run 默认使用 “–ldflags ‘-s -w’”, 故删除了 symbol table(-s) 和 DWARF debug info(-w)，不能用于调试。

为了更好的调试 go 程序，需要使用 go build -gcflags=all="-N -l" 命令。

使用 bpftrace -l 查看 go 二进制中的函数列表：

# bpftrace -l uprobe:myagentt/current/bin/my-agent/my-agent |grep SaveNet
uprobe:/cloud/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
uprobe:/cloud/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo

如果二进制包含 DWARF 的 .debug_xx 信息，对于高版本的 bpftrace 可以使用 -lv 参数查看 uprobe 函数参数列表：

二进制需要包含 DWARF 信息；

#readelf -S /tmp/my-agent |grep -E 'eh|debug'
  [13] .debug_abbrev     PROGBITS         0000000000000000  02e5b000
  [14] .debug_line       PROGBITS         0000000000000000  02e5b135
  [15] .debug_frame      PROGBITS         0000000000000000  03118b86
  [16] .debug_gdb_script PROGBITS         0000000000000000  031c13e6
  [17] .debug_info       PROGBITS         0000000000000000  031c1413
  [18] .debug_loc        PROGBITS         0000000000000000  03687edb
  [19] .debug_ranges     PROGBITS         0000000000000000  03a4f139

#/tmp/bpftrace4 --version
/tmp/bpftrace4: stat /static-python: No such file or directory
bpftrace v0.20.4

#/tmp/bpftrace4 -lv 'uprobe:/tmp/my-agent:*SaveNetworkLocateInfo'
/tmp/bpftrace4: stat /static-python: No such file or directory
uprobe:/tmp/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
    git.com/my-agent/pkg/storage.ProcessStore* ps
    git.com/my-agent/pkg/storage.NetworkLocateInfo* n
    error ~r0
uprobe:/tmp/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo
    git.com/my-agent/pkg/storage.ProcessStore* ps
    git.com/my-agent/pkg/storage.NetworkInfo* ni
    struct string pid
    bool isFailed
    error ~r0

C++/Rust/Go 函数名 demangling
#

bpftrace 也支持 C++、Rust、Go 函数名的 demangling：

Add support for rust demangling #3688 :https://github.com/bpftrace/bpftrace/pull/3688/files

// Note that legacy rust programs use the same C++ mangling convention,
// and therefore will always start with _Z. This is fine, but they also
// include a symbol hash at the end which would normally be stripped
// off. If users have legacy rust programs, they can just use a
// wildcard match against the hash component, and the rust specific
// bits will match against the newer v0 rust mangling convention.

// The legacy mangling scheme for rust actually uses the C++
// demangler with an extra hash at the end. We use the same scheme,
// and users will need to explicitly wildcard against this hash.

// We may choose to parse the v0 mangled symbols defined by:
// https://rust-lang.github.io/rfcs/2603-rust-symbol-name-mangling-v0.html
//
// Or may vendor/link an alternate library to do so.

COMMAND ${CMAKE_COMMAND} -E env "RUSTFLAGS=-C symbol_mangling_version=v0" ${CARGO_EXECUTABLE} build --target-dir ${CMAKE_CURRENT_BINARY_DIR}

go 和 bpftrace 的兼容性问题
#

uprobe Go 函数名问题
#

bpftrace uprobe 函数名不支持特殊字符，如点号、括号等。

新的 bpftrace 版本支持点号，https://github.com/bpftrace/bpftrace/issues/548
但是最新版还是不支持括号

而 golang 的函数明一般包含完整的 go package 路径，如 git.com/my-agent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo，这会导致 bpftrace 解析函数名称时报错：

如果函数名太长，也会报错，可以通过环境变量 BPFTRACE_MAX_STRLEN 来设置，最大值为 32k
https://github.com/bpftrace/bpftrace/issues/3617

#bpftrace  --version
bpftrace v0.11.2

#bpftrace -l uprobe:/cloud/my-agent |grep SaveNet
uprobe:myagentt/current/bin/my-agent/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
uprobe:myagentt/current/bin/my-agent/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo

#bpftrace -e 'uprobe:/cloud/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo { @[ustack] = count(); }'
stdin:1:1-97: ERROR: syntax error, unexpected -, expecting {
uprobe:/cloud/my-agent:git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo { @[ustack] = count(); }
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

解决办法：使用 1. 函数地址，2. 函数名路径 wildcard ，3. 函数名路径字符串

bpftrace 的 uprobe 后面的函数名路径支持 wildcard，所以可以使用 * 来匹配或忽略特殊字符。

# OK
#bpftrace -e 'uprobe:/cloud/my-agent:*SaveNetworkLocateInfo { @[ustack] = count(); }'

# OK
#bpftrace -e 'uprobe:/cloud/my-agent:gitlab.comp*SaveNetworkLocateInfo { @[ustack] = count(); }'
Attaching 2 probes...

# OK
#bpftrace -e 'uprobe:/cloud/my-agent:gitlab.comp*SaveNetwork*Info { @[ustack] = count(); }'

# 为函数名路径添加字符串双引号，OK
#bpftrace -e 'uprobe:/cloud/my-agent:"git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo" { @[ustack] = count(); }'

#nm -n /cloud/my-agent |grep SaveNetwork
00000000012e0220 T git.com/my-agent/pkg/storage.(*ProcessStore).SaveNetworkLocateInfo
00000000012e0480 T git.com/my-agent/pkg/storage.(*ProcessStore).doSaveNetworkLocateInfo

# 使用 地址 OK
# bpftrace -e 'uprobe:/cloud/my-agent:0x12e0220 { @[ustack] = count(); }'

uretprobe 不兼容 golang
#

通过 uretprobe 检查 golang 方法的返回值可能存在风险。这是因为 uretprobe 是通过修改栈来加入探针的，这和 golang 本身对栈的管理存在冲突的可能.

// example.go
func myprint(s string) {
  fmt.Printf("Input: %s\n", s)
}

func main() {
  ss := []string{"a", "b", "c"}
  for _, s := range ss {
    go myprint(s)
  }
  time.Sleep(1*time.Second)
}

bpftrace uretprobe 出错：

# bpftrace -e 'uretprobe:./test:main.myprint { @=count(); }' -c ./test
runtime: unexpected return pc for main.myprint called from 0x7fffffffe000
stack: frame={sp:0xc00008cf60, fp:0xc00008cfd0} stack=[0xc00008c000,0xc00008d000)
fatal error: unknown caller pc

虽然在 golang 程序中使用 uretprobe 是不安全的，但是好在 uprobe 还可以放心用。其实换个角度看，即便我们不使用 uretprobe，依然有办法获取返回时，比如我们可以通过在本方法 return 的时候或者在一个方法开始的时候设置一个 uprobe 来获取返回值。

参考：

https://github.com/bpftrace/bpftrace/blob/master/man/adoc/bpftrace.adoc#uprobe-uretprobe

bpftrace skb 解析
#

if the $ipheader->daddr is 192.168.2.44, just convert this four number to hex chars, which are c0 a8 02 2c. reversal chars are 2c 02 a8 c0. so you can just write: https://stackoverflow.com/questions/75172893/comparing-ip-addresses-in-bpftrace

root@lima-ebpf-dev:~# bpftrace -lv 'kfunc:ip_finish_output'
kfunc:vmlinux:ip_finish_output
struct net * net
struct sock * sk
struct sk_buff * skb
int retval

# bpftrace 可以读取系统的内核头文件
root@lima-ebpf-dev:~# cat /Users/zhangjun/skb.bt
#include <linux/skbuff.h>
#include <linux/icmp.h>
#include <linux/ip.h>
#include <linux/ipv6.h>
#include <linux/in.h>

kfunc:ip_finish_output {
  $skb = (struct sk_buff *)args.skb;
  $dev = $skb->dev;
  $name = $dev->name;
  $ipheader = ((struct iphdr *) ($skb->head + $skb->network_header));
  $version = ($ipheader->version) >>4;

  if($ipheader->protocol == IPPROTO_ICMP) {
    // get ICMP header; see skb_transport_header():
     $icmph = (struct icmphdr *)($skb->head + $skb->transport_header);
     if ($icmph->type == ICMP_ECHO) {
       $id = $icmph->un.echo.id;
       $seq = $icmph->un.echo.sequence;

     printf("icmp: pid %d, comm: %s, [%d] %d\t%s > %s\n, id: %d, seq: %d, dev: %s\n", pid, comm, $version, $ipheader->protocol,
                 ntop($ipheader->saddr), ntop($ipheader->daddr), $id, $seq, $name);
       }
  }
}

root@lima-ebpf-dev:~# bpftrace /Users/zhangjun/skb.bt
Attaching 1 probe...
icmp: pid 206295, comm: ping, [0] 1     192.168.5.1 > 114.114.114.114
, id: 5376, seq: 48640, dev: eth0
icmp: pid 206295, comm: ping, [0] 1     192.168.5.1 > 114.114.114.114
, id: 5376, seq: 48896, dev: eth0
icmp: pid 206295, comm: ping, [0] 1     192.168.5.1 > 114.114.114.114
, id: 5376, seq: 49152, dev: eth0

另一个例子：https://lwn.net/Articles/793749/

There is an important capability missing from those one-liners: struct navigation. Here is the function prototype again: int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);

bpftrace provides arg0-argN for kprobe function arguments, simply mapping them to the registers for the calling convention (arg2 becomes %rdx on x86_64, for example). Since bpftrace can read kernel headers, which are often installed on production systems, accessing struct data is possible by including the right header and casting the arguments:

#include <net/sock.h>
[...]
        $sk = (struct sock *)arg0;

Here’s an example of a bpftrace tool that prints the address information, size, and return value from tcp_sendmsg(). Example output:

# ./tcp_sendmsg.bt
Attaching 2 probes...
10.0.0.65       49978 -> 52.37.243.173   443  : 63 bytes, retval 63
127.0.0.1       58566 -> 127.0.0.1       22   : 36 bytes, retval 36
127.0.0.1       22    -> 127.0.0.1       58566: 36 bytes, retval 36
[...]

The source of tcp_sendmsg.bt:

#!/usr/local/bin/bpftrace

#include <net/sock.h>

k:tcp_sendmsg
{
  @sk[tid] = arg0;
  @size[tid] = arg2;
}

kr:tcp_sendmsg
/@sk[tid]/
{
  $sk = (struct sock *)@sk[tid];
  $size = @size[tid];
  $af = $sk->__sk_common.skc_family;
  if ($af == AF_INET) {
    $daddr = ntop($af, $sk->__sk_common.skc_daddr);
    $saddr = ntop($af, $sk->__sk_common.skc_rcv_saddr);
    $lport = $sk->__sk_common.skc_num;

    $dport = $sk->__sk_common.skc_dport;
    $dport = ($dport >> 8) | (($dport << 8) & 0xff00);

    printf("%-15s %-5d -> %-15s %-5d: %d bytes, retval %d\n",
        $saddr, $lport, $daddr, $dport, $size, retval);
  } else {
    printf("IPv6...\n");
  }
  delete(@sk[tid]);
  delete(@size[tid]);
}

In the kprobe, sk and size are saved in per-thread-ID maps, so they can be retrieved in the kretprobe when tcp_sendmsg() returns. The kretprobe casts sk and prints out details, if it is an IPv4 message, using the bpftrace function =ntop() to convert the address to a string=. The destination port is =flipped from network to host order=. To keep this short I skipped IPv6, but you can add code to handle it too (ntop() does support IPv6 addresses).

There is work underway for bpftrace to use BPF Type Format (BTF) information as well, which brings various advantages including struct definitions that are missing from kernel headers.

使用 linux perf 进行内核和应用性能分析

2025-01-12