Garden | SystemTap

SystemTap 是一个 Linux 非常有用的调试工具，提供了对用户级和内核级的静态和动态追踪功能，比如可以获取一个函数里面运行时的变量、调用堆栈，甚至可以直接修改变量的值，对诊断性能或功能问题非常有帮助。SystemTap 采用其他内核框架做数据源，静态探针用 tracepoints、动态探针用 kprobes、用户级探针用 uprobe。SystemTap 提供非常简单的命令行接口和很简洁的脚本语言，以及非常丰富的 tapset 和例子。本文将介绍其原理和使用方法，文中所有代码可以在 Github 中找到。

工作原理

Systemtap 工作原理如下图所示，主要分为以下几步：

开发者根据 systemtap 语法编写 stp 脚本语言 probe.stp
解析 stp 脚本语言代码，主要是词法分析和语法分析
解读 stp 代码，语义分析
翻译成 C 语言，也即是生成中间代码 probe.c
将 probe.c 编译成内核模块
加载内核模块之后，将所有探测的事件以钩子的方式挂到内核上，当任何处理器上的某个事件发生时，相应钩子上句柄就会被执行
当 systemtap 会话结束之后，钩子从内核上取下，移除模块

下面是一个简单的 hello world 样例，通过 stap hello.stp 命令，便会执行以上所有过程。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# cat hello.stp
probe begin {
    print("Hello World\n")
    exit()
}

probe end {
    print("Goodbye World\n")
}

# stap hello.stp
Hello World
Goodbye World

语法规则

systemtap 的核心思想是定义一个事件（event），以及给出处理该事件的句柄（Handler）。当一个特定的事件发生时，内核运行该处理句柄，就像快速调用一个子函数一样，处理完之后恢复到内核原始状态。这里有两个概念：

Event：systemtap 定义了很多种事件，例如进入或退出某个内核函数、定时器时间到、整个 systemtap 会话启动或退出等等。
Handler：就是一些脚本语句，描述了当事件发生时要完成的工作，通常是从事件的上下文提取数据，将它们存入内部变量中，或者打印出来。

probe point

probe 是 SystemTap 进行具体地收集数据的关键字
probe point 是 probe 动作的时机，是 probe 程序监视的某事件点，一旦侦测的事件触发了，则 probe 将从此处插入内核或者用户进程中
probe handle 是当 probe 插入内核或者用户进程后所做的具体动作

probe 用法，在 Hello World 例子中 begin和 end 就是 probe-point， statement就是该探测点的处理逻辑，在 Hello World 例子中 statement 只有一行 print，statement 可以是复杂的代码块。

1

probe probe-point { statement }

探测点语法：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


kernel.function(PATTERN)
kernel.function(PATTERN).call
kernel.function(PATTERN).return
kernel.function(PATTERN).return.maxactive(VALUE)
kernel.function(PATTERN).inline
kernel.function(PATTERN).label(LPATTERN)
module(MPATTERN).function(PATTERN)
module(MPATTERN).function(PATTERN).call
module(MPATTERN).function(PATTERN).return.maxactive(VALUE)
module(MPATTERN).function(PATTERN).inline
kernel.statement(PATTERN)
kernel.statement(ADDRESS).absolute
module(MPATTERN).statement(PATTERN)
process(PROCESSPATH).function(PATTERN)
process(PROCESSPATH).function(PATTERN).call
process(PROCESSPATH).function(PATTERN).return
process(PROCESSPATH).function(PATTERN).inline
process(PROCESSPATH).statement(PATTERN)

PATTERN 语法为：

1
2


func[@file]
func@file:linenumber

例如：

1
2
3
4


kernel.function("*init*")
module("ext3").function("*")
kernel.statement("*@kernel/time.c:296")
process("/home/admin/bin/nginx").function("ngx_http_process_request")

在 return 探测点可以用 $return 获取该函数的返回值，inline函数无法安装 .return 探测点，也无法用 $return 获取其返回值。

Systemtap 支持许多内置探测点，这些事件是 systemtap 官方预先写好的脚本，被称为 tapset。可以参考官方的tapsets 手册使用这些库函数，在安装完成后，一般在本地位置是 /usr/share/systemtap/tapset，如果想引用其他路径下的 stap 脚本，需要添加参数 -I。

常用的探测点有：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


                                     begin  # The startup of the systemtap session.
                                       end  # The end of the systemtap session.
               kernel.function("sys_open")  # The entry to the function named sys_open in the kernel.
                      syscall.close.return  # The return from the close system call.
      module("ext3").statement(0xdeadbeef)  # The addressed instruction in the ext3 filesystem driver.
                             timer.ms(200)  # A timer that fires every 200 milliseconds.
                             timer.profile  # A timer that fires periodically on every CPU.
                      perf.hw.cache_misses  # A particular number of CPU cache misses have occurred.
                     procfs("status").read  # A process trying to read a synthetic file.
process("a.out").statement("*@main.c:200")  # Line 200 of the a.out program.

另外还封装了一些常用的可打印值，例如：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


             tid()  # The id of the current thread.
             pid()  # The process (task group) id of the current thread.
             uid()  # The id of the current user.
        execname()  # The name of the current process.
             cpu()  # The current cpu number.
  gettimeofday_s()  # Number of seconds since epoch.
      get_cycles()  # Snapshot of hardware cycle counter.
              pp()  # A string describing the probe point being currently handled.
          ppfunc()  # If known, the the function name in which this probe was placed.
            $$vars  # If available, a pretty-printed listing of all local variables in scope.
 print_backtrace()  # If possible, print a kernel backtrace.
print_ubacktrace()  # If possible, print a user-space backtrace.

下面演示了通过 systemtap 探测一个用户程序的例子：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


#include <stdio.h>
int main(int argc, char *argv[])
{
    int a;
    a = 1;
    printf("a:%d\n", a);
    a = 2;
    printf("a:%d\n", a);
    return 0;
}

使用下面这个 systemtap 脚本针对代码中的第 8 行和第 10 行打印当时变量 a 的值：

1
2
3
4
5
6
7
8


probe process("./a.out").statement("main@./cc_stap_test.c:8")
{
    printf("systemtap probe line 8 a:%d\n", $a);
}
probe process("./a.out").statement("main@./cc_stap_test.c:10")
{
    printf("systemtap probe line 10 a:%d\n", $a);
}

输出如下：

$ sudo stap cc_stap_test.stp -c ./a.out
a:1
a:2
systemtap probe line 8 a:1
systemtap probe line 10 a:2

基本语法

脚本命名

脚本名字符合 linux 文件命名即可。一般名字后辍使用 .stp，方便人们知道它是 systemtap 脚本，比如 memory.stp

注释

脚本支持多种注释方式，# 、//、/**/ 均可。可依据个人习惯使用。另外类似其他脚本，systemtap 脚本在开头也需要标明脚本解析器的路径，一般是 #!/usr/bin/stap，不确定的可以通过命令 whereis stap找到脚本解析器位置；

变量

变量需要字母开头，一般由字母、数字组成，当然还可以包括美元符号和下划线字符。变量可以在函数的任意处声明，也可以直接使用（通过第一次使用探测变量类型）。变量默认作用域是函数或括号内部，定义全局变量需要加 global（写在函数外任意处）。

1
2
3
4


global date1
{
    data2 = 1;
}

数组

数组必须被定义成 global 变量，默认大小不超过 2048(MAXMAPENTRIES)，定义时可以省略大小，除非是想定义超过 2048 的大数组：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


global mybigarr[20000]
global myarr
probe begin
{
	myarr[0] = 1
	myarr[1] = 2
	myarr[3] = 4
	foreach(x in myarr) {
		printf("%d\n",myarr[x])
	}
}

另外还支持关联数组，关联数组中的索引或键由一个或多个字符串或整数值(逗号隔开)组成：

1
2
3
4
5
6
7


# key值就是索引
arr1["foo"] = 14
arr2["coords",3,42,7] = "test"
# 删除数组
delete myarr
# 删除数组元素
delete myarr[tid()]

条件语句

用法和 C 语言一样：

1
2
3
4


if (xxx)
    xxx
else
    xxx

循环

基础用法和 C 语言一样，比如：

1
2


for (i = 0; i < 10; i++) { ... }
while (i<10) { ... }

除此之外，还提供一种用于数组的特殊循环 foreach:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


global myarr
probe begin
{
	myarr[0] = 1
	myarr[1] = 2
	myarr[3] = 4
	foreach(x in myarr) {
		printf("%d\n",myarr[x])
	}
}

函数

普通函数使用 function声明，函数返回值类型通过 : 跟在函数名后面；参数类型通过 : 跟在函数参数后面，多个参数通过",“隔开，例如：

1
2
3
4
5
6
7
8
9


# 返回值和参数均为long
function is_open_creating:long (flag:long)
{
      CREAT_FLAG = 4
      if (flag & CREAT_FLAG){
          return 1
      }
      return 0
}

另外一种函数是 probe 函数，下面以探测内核函数和模块函数为例，介绍几种常见用法：

probe 内核和模块函数通用格式：

1
2
3
4


# kernel
probe kernel.function("kernel_function_name") { ... }
# module：
probe module("module_name").function("module_function_name") { ... }

函数名支持通配符，例如：

1
2


# 所有的ext3_get* 前缀函数
probe module("ext3").function("ext3_get*") { ... }

对于使用相同 handle 函数的 probe 函数，可以叠加定义：

1
2
3
4
5
6


# 可以叠加多个, 如果probe的函数不存在，在编译时就会保错
probe module("ext3").function("ext3_get*") ,
probe module("ext3").function("ext3_get*")
{
    print("getting or setting something here\n")
}

有时候因为内核版本不同，有些函数名字不一样，或者某些版本里函数不存在，systemtap 提供了几种 条件函数和 可选择函数 供灵活使用：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# 通过条件符号,如果函数存在才生效：
kernel.function("may_not_exist") ? { ... }

# 如果第一个不存在再判断后续的；如果存在只会probe靠前的
kernel.function("this_might_exist") !,
kernel.function("if_not_then_this_should") !,
kernel.function("if_all_else_fails") { ... }

# 通过条件语句, 一般用于动态条件。即脚本运行时才可以确定的条件：
probe kernel.function("some_func") if ( someval > 10) { ... }

还可以在函数末尾加上 .call 或 .return，分别表示函数被调用和返回时 probe：

1
2
3
4


# 在调用build_open_flags时probe，handle是：打印rbp寄存器的值
probe kernel.function("build_open_flags").call {
    printf("rbp=%p\n", register("rbp"));
}

通过命令行传递参数

和 shell 等脚本类似，可以在脚本里引用命令行传递的参数。不过 stp 脚本需要预先知道参数的类型，因为引用不同类型参数方式不同。

对于整数类型参数，通过“$N”引用，N 是第几个参数(base 1)；
对于字符串参数，通过“@N”引用，N 是第几个参数(base 1)，如果字符串中间有空格，需要在字符串两边加上双引号(不加就是两个变量)；

举例：

1
2
3
4
5
6
7
8
9


命令行：
    stap script.stp sometext 42
引用：
    printf(“arg1: %s, arg2: %d\n”, @1, $2)

命令行：
    stap script.stp "sometext nexttxt" 42
引用：
    printf(“arg1: %s, arg2: %d\n”, @1, $2)

实战演示

安装了 systemtap 之后，可以看到许多的用例脚本，主要包括 network、io、interrupt、locks、memory、process、vistualization 几个方面。

下面简单演示一个监控所有进程的收发包情况示例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


global recv, xmit

probe begin {
	printf("Starting network capture...Press ^C to terminate\n")
}

probe netdev.receive {
	recv[dev_name, pid(), execname()] <<< length
}

probe netdev.transmit {
	xmit[dev_name, pid(), execname()] <<< length
}

probe end {
	printf("\nCapture terminated\n\n")
	printf("%-5s %-15s %-10s %-10s %-10s\n",
		"If", "Process", "Pid", "RcvPktCnt", "XmtPktCnt")

	foreach([dev, pid, name] in recv) {
		recvcnt = @count(recv[dev, pid, name])
		xmtcnt =  @count(xmit[dev, pid, name])
		printf("%-5s %-15s %-10d %-10d %-10d\n", dev, name, pid, recvcnt, xmtcnt)
	}
}

如果运行上述脚本出现以下错误，显示 semantic error: while resolving probe point: identifier 'kernel' 错误，表明是系统没有符号信息，我们需要手动的安装符号 rpm，我们可以上 http://debuginfo.centos.org 查找

kernel-debuginfo-common-uname -r
kernel-debuginfo-uname -r

下载好，使用 rpm 命令安装即可：

1

rpm -ivh kernel-debuginfo-*.rpm

解决上述问题之后，可以看到显示以下信息：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


# stap net.stp
Starting network capture...Press ^C to terminate
^C
Capture terminated

If    Process         Pid        RcvPktCnt  XmtPktCnt
cbr0  coredns         7459       11         3
cbr0  kubelet         5445       10         13
eth0  kubelet         5445       7          36
eth0  swapper/0       0          177        67
eth0  dockerd         5328       5          2
eth0  ksoftirqd/0     9          1          0
cbr0  swapper/1       0          27         32
eth0  sssd_nss        662        1          1
eth0  YDService       41592      4          40
eth0  docker-containe 5336       5          0
cbr0  dockerd         5328       1          1
eth0  iptables        45087      1          0
eth0  swapper/1       0          235        239