Garden | Kprobes

Linux Kprobes 调试技术是内核开发者们专门为了便于跟踪内核函数执行状态所设计的一种轻量级内核调试技术。利用 kprobes 技术，内核开发人员可以在内核的绝大多数指定函数中动态的插入探测点来收集所需的调试状态信息而基本不影响内核原有的执行流程。kprobes 技术目前提供了 3 种探测手段：kprobe、jprobe 和 kretprobe，其中 jprobe 和 kretprobe 是基于 kprobe 实现的，它们分别应用于不同的探测场景中。本文首先简单描述这 3 种探测技术的原理与区别，然后主要围绕其中的 kprobe 技术进行分析并给出一个简单的实例介绍如何利用 kprobe 进行内核函数探测，最后分析 kprobe 的实现过程，文中所有示例代码可以在 Github 中找到。

技术背景

开发人员在内核或者模块的调试过程中，往往会需要要知道其中的一些函数有无被调用、何时被调用、执行是否正确以及函数的入参和返回值是什么等等。比较简单的做法是在内核代码对应的函数中添加日志打印信息，但这种方式往往需要重新编译内核或模块，重新启动设备之类的，操作较为复杂甚至可能会破坏原有的代码执行过程。

而利用 kprobes 技术，用户可以定义自己的回调函数，然后在内核或者模块中几乎所有的函数中（有些函数是不可探测的，例如 kprobes 自身的相关实现函数，后文会有详细说明）动态的插入探测点，当内核执行流程执行到指定的探测函数时，会调用该回调函数，用户即可收集所需的信息了，同时内核最后还会回到原本的正常执行流程。如果用户已经收集足够的信息，不再需要继续探测，则同样可以动态的移除探测点。因此 kprobes 技术具有对内核执行流程影响小和操作方便的优点。

基本组成

linux kprobes技术包括的 3 种探测手段：

kprobe：是最基本的探测方式，是实现后两种的基础，它可以在任意的位置放置探测点（就连函数内部的某条指令处也可以），它提供了探测点的调用前、调用后和内存访问出错 3 种回调方式：
- pre_handler：将在被探测指令被执行前回调，
- post_handler：在被探测指令执行完毕后回调（注意不是被探测函数）
- fault_handler，在内存访问出错时被调用
jprobe基于 kprobe 实现，它用于获取被探测函数的入参值
kretprobe从名字种就可以看出其用途了，它同样基于 kprobe 实现，用于获取被探测函数的返回值

kprobes 的技术原理并不仅仅包含存软件的实现方案，它也需要硬件架构提供支持，因此并不是所有的架构均支持。

CPU 的异常处理：让程序的执行流程陷入到用户注册的回调函数中去
单步调试技术：用于单步执行被探测点指令

目前 kprobes 技术已经支持多种架构，包括 i386、x86_64、ppc64、ia64、sparc64、arm、ppc 和 mips 等，更多可以参考 Linux Kernel Documentation kprobes

技术特点

kprobes 允许在同一个被被探测位置注册多个 kprobe，但是目前 jprobe 却不可以；同时也不允许以其他的 jprobe 回掉函数和 kprobe 的 post_handler 回调函数作为被探测点
一般情况下，可以探测内核中的任何函数，包括中断处理函数。不过在 kernel/kprobes.c 和 arch/*/kernel/kprobes.c程序中用于实现 kprobes 自身的函数是不允许被探测的，另外还有 do_page_fault 和 notifier_call_chain；
如果以一个内联函数为探测点，则 kprobes 可能无法保证对该函数的所有实例都注册探测点。由于 gcc 可能会自动将某些函数优化为内联函数，因此可能无法达到用户预期的探测效果；
一个探测点的回调函数可能会修改被探测函数运行的上下文，例如通过修改内核的数据结构或者保存与 struct pt_regs 结构体中的触发探测之前寄存器信息。因此 kprobes 可以被用来安装 bug 修复代码或者注入故障测试代码；
kprobes 会避免在处理探测点函数时再次调用另一个探测点的回调函数，例如在 printk() 函数上注册了探测点，则在它的回调函数中可能再次调用 printk 函数，此时将不再触发 printk 探测点的回调，仅仅时增加了 kprobe 结构体中 nmissed 字段的数值；
在 kprobes 的注册和注销过程中不会使用 mutex 锁和动态的申请内存；
kprobes 回调函数的运行期间是关闭内核抢占的，同时也可能在关闭中断的情况下执行，具体要视 CPU 架构而定。因此不论在何种情况下，在回调函数中不要调用会放弃 CPU 的函数（如信号量、mutex 锁等）；
kretprobe 通过替换返回地址为预定义的 trampoline 的地址来实现，因此栈回溯和 gcc 内嵌函数 __builtin_return_address() 调用将返回 trampoline 的地址而不是真正的被探测函数的返回地址；
如果一个函数的调用此处和返回次数不相等，则在类似这样的函数上注册 kretprobe 将可能不会达到预期的效果，例如 do_exit()函数会存在问题，而 do_execve() 函数和 do_fork() 函数不会；
如果当在进入和退出一个函数时，CPU 运行在非当前任务所有的栈上，那么往该函数上注册 kretprobe 可能会导致不可预料的后果，因此，kprobes 不支持在 X86_64 的结构下为 __switch_to() 函数注册 kretprobe，将直接返回 -EINVAL。

工作原理

下面来介绍一下 kprobe 是如何工作的，具体流程见下图：

当用户注册一个探测点后，kprobe 首先备份被探测点的对应指令，然后将原始指令的入口点替换为断点指令，该指令是 CPU 架构相关的，如 i386 和 x86_64 是 int3，arm 是设置一个未定义指令（目前的 x86_64 架构支持一种跳转优化方案 Jump Optimization，内核需开启 CONFIG_OPTPROBES 选项，该种方案使用跳转指令来代替断点指令）；
当 CPU 流程执行到探测点的断点指令时，就触发了一个 trap，在 trap 处理流程中会保存当前 CPU 的寄存器信息并调用对应的 trap 处理函数，该处理函数会设置 kprobe 的调用状态并调用用户注册的 pre_handler 回调函数，kprobe 会向该函数传递注册的 struct kprobe 结构地址以及保存的 CPU 寄存器信息；
随后 kprobe 单步执行前面所拷贝的被探测指令，具体执行方式各个架构不尽相同，arm 会在异常处理流程中使用模拟函数执行，而 x86_64 架构则会设置单步调试 flag 并回到异常触发前的流程中执行；
在单步执行完成后，kprobe 执行用户注册的 post_handler 回调函数；
最后，执行流程回到被探测指令之后的正常流程继续执行。

使用实例

目前，使用 kprobe 可以通过两种方式对函数进行探测，下面将演示使用第一种方式使用 kprobe。

开发人员自行编写内核模块，向内核注册探测点，探测函数可根据需要自行定制，使用灵活方便；
使用 kprobes on ftrace，这种方式是 kprobe 和 ftrace 结合使用，即可以通过 kprobe 来优化 ftrace 来跟踪函数的调用

下图描述了 KProbes 的结构：

探针的逻辑大多都是在断点和调试异常函数的上下文中完成的，它们构成了 KProbes 架构依赖层（Architecture Dependent Layer）
KProbes Manager是架构无关层（Architecture Independent Layer），它是用来注册和注销探针的
用户在内核模块中准备的探针处理函数通过 KProbes Manager 来注册

kprobe interface

内核提供了一个 struct kprobe 结构体以及一系列的内核 API 函数接口，用户可以通过这些接口自行实现探测回调函数并实现 struct kprobe 结构，然后将它注册到内核的 kprobes 子系统中来达到探测的目的。同时在内核的 samples/kprobes 目录下有一个例程 kprobe_example.c 描述了 kprobe 模块最简单的编写方式，开发者可以以此为模板编写自己的探测模块。

struct kprobe 结构体定义如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


struct kprobe {
    struct hlist_node hlist;                    /* Internal */
    struct list_head list;                      /* list of kprobes for multi-handler support */
    kprobe_opcode_t *addr;                      /* location of the probe point */
    kprobe_pre_handler_t pre_handler;           /* called before addr is executed. */
    kprobe_post_handler_t post_handler;         /* called after addr is executed, unless... */
    kprobe_fault_handler_t fault_handler;       /* called if executing addr causes a fault (eg. page fault). */
    kprobe_break_handler_t break_handler;       /* called if breakpoint trap occurs in probe handler. */
    kprobe_opcode_t opcode;                     /* saved opcode (which has been replaced with breakpoint) */
    const char *symbol_name;                    /* allow user to indicate symbol name of the probe point */
    unsigned int offset;                        /* offset into the symbol */
    unsigned long nmissed;                      /* count the number of times this probe was temporarily disarmed */
    struct arch_specific_insn ainsn;            /* copy of the original instruction */
    u32 flags;                                  /* Indicates various status flags */
};

其中各个字段的含义如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


struct hlist_node hlist：被用于kprobe全局hash，索引值为被探测点的地址；
struct list_head list：用于链接同一被探测点的不同探测kprobe；
kprobe_opcode_t *addr：被探测点的地址；
const char *symbol_name：被探测函数的名字；
unsigned int offset：被探测点在函数内部的偏移，用于探测函数内部的指令，如果该值为0表示函数的入口；
kprobe_pre_handler_t pre_handler：在被探测点指令执行之前调用的回调函数；
kprobe_post_handler_t post_handler：在被探测指令执行之后调用的回调函数；
kprobe_fault_handler_t fault_handler：在执行pre_handler、post_handler或单步执行被探测指令时出现内存异常则会调用该回调函数；
kprobe_break_handler_t break_handler：在执行某一kprobe过程中触发了断点指令后会调用该函数，用于实现jprobe；
kprobe_opcode_t opcode：保存的被探测点原始指令；
struct arch_specific_insn ainsn：被复制的被探测点的原始指令，用于单步执行，架构强相关（可能包含指令模拟函数）；
u32 flags：状态标记

涉及的 API 函数接口如下：

1
2
3
4
5
6


int register_kprobe(struct kprobe *kp)      //向内核注册kprobe探测点
void unregister_kprobe(struct kprobe *kp)   //卸载kprobe探测点
int register_kprobes(struct kprobe **kps, int num)     //注册探测函数向量，包含多个探测点
void unregister_kprobes(struct kprobe **kps, int num)  //卸载探测函数向量，包含多个探测点
int disable_kprobe(struct kprobe *kp)       //临时暂停指定探测点的探测
int enable_kprobe(struct kprobe *kp)        //恢复指定探测点的探测

各个 handler 的函数原型定义如下：

1
2
3


typedef int (*kprobe_pre_handler_t)(struct kprobe*, struct pt_regs*);
typedef void (*kprobe_post_handler_t)(struct kprobe*, struct pt_regs*, unsigned long flags);
typedef int (*kprobe_fault_handler_t)(struct kprobe*, struct pt_regs*, int trapnr);

kprobes manager

KProbes Manager 负责注册和注销 KProbes 、 JProbes。 kernel/kprobes.c 文件实现 KProbes manager。每个探针是由一个 struct kprobe 结构体来表示的，且保存在一个用探针的目标地址来计算的 hash 表中。用 kprobe_lock 自旋锁来串行化对哈希表的访问。在注册新的探针、注销已存在的探针之前或者命中探针的时候，自旋锁都是被锁定的。这样会阻止在 SMP 机器上并行的执行这些操作。无论什么时候命中探针，探针处理函数都是在禁用中断的情况下调用的。禁用中断，是因为处理探针是个多步骤过程，涉及断点处理以及被探测指令的单步执行。没有简单的方法来保存这些操作之间的状态，因此在处理探针期间中断一直是禁用的。

Manager 是由以下这些函数构成，且附带一点对它们的简短描述。这些函数是架构无关的。同步阅读 kernel/kprobes.c 文件中的代码以及这些内容将会阐明整个实现。

void lock_kprobes(void) ：锁定 KProbes 且记录锁定它的 CPU
void unlock_kprobes(void) ：解锁 KProbes 且重置已记录的 CPU
struct kprobe *get_kprobe(void *addr) ：传入被探测指令的地址，从 hash 表中取回探针
int register_kprobe(struct kprobe *p) ：函数在特定的地址上注册一个探针。注册涉及在探针专用缓冲区中的探针地址处复制指令。在 x86 上，最大的指令大小是 16 个字节，因此这 16 个字节会被复制到特定的地址。然后，用 breakpoint 指令替换位于被探测地址处的指令
void unregister_kprobe(struct kprobe *p) ：注销探针。在指定地址恢复原始指令，且从哈希表中移除探针结构体
int register_jprobe(struct jprobe *jp) ：在一个函数的地址上注册一个 JProbe。 JProbes 使用 KProbes 的机制，在 KProbe 的 pre_handler 处理函数中， JProbes 保存了它自己的函数 setjmp_pre_handler，而且还在 break_handler 函数中保存了 longjmp_break_handler 函数的地址。然后，调用 register_kprobe() 函数注册 kprobe 结构体 jp->kp
void unregister_jprobe(struct jprobe *jp) ：注销 JProbe 使用的 kprobe 结构体

以上涉及处理探针的步骤都是架构相关的，由 arch/i386/kernel/kprobes.c 文件中定义的函数来处理。

注册探针后，那些处于激活状态的地址包含了 breakpoint 指令（在 x86 上是 int3）
一旦执行到被探测的地址就会执行 int3 指令，也因此控制权会转到 arch/i386/kernel/traps.c 文件中的 do_int3() 函数
do_int3() 是通过中断门调用的，所以在控制权转到这里的时候中断是被禁用的
这个函数会通知 KProbes 产生了一个中断， KProbes 会检查中断是不是由 KProbes 的注册函数设置的
如果命中的探测地址上没有探针，只会返回 0。相反，它会调用已注册的探针函数。

JProbe 必须将控制权转移到另外一个函数，这函数的原型与放置探针的函数相同，然后再将控制权交给原始函数，状态与执行 JProbe 之前相同。JProbe 利用了 KProbe 使用的机制。 JProbe 不是调用用户定义的 pre-handler ，而是指定自己的 pre-handler ，名为 setjmp_pre_handler() ，而且使用了另外一个称为 break_handler 的函数，这个过程有三个步骤：

在命中断点的时候控制权转到 kprobe_handler() 函数，它会调用 JProbe 的 pre-handler 函数(setjmp_pre_handler())。在把 eip 改成用户定义函数的地址之前，这个函数会把栈和寄存器保存下来。然后，它会返回 1 让 kprobe_handler() 函数直接返回，而不像 KProbe 那样设置单步执行。在返回时，控制权转到用户定义的函数，这样就可以访问原始函数的参数。在用户定义的函数完事后，该调用 jprobe_return() 函数，而不是做普通的 return
jprobe_return() 函数截断当前栈帧并生成一个断点，通过 do_int3() 函数把控制权转移到 kprobe_handler() 函数。 kprobe_handler() 函数发现生成的断点地址（jprobe_handler() 函数中 int3 指令的地址）没有注册探针，但 KProbes 在当前 CPU 上处于活跃状态。它假设断点一定是 JProbes 生成的，因此调用了它先前保存的 current_kprobe break_hanlder 函数。 break_handler 函数会恢复栈以及 在控制权转移到用户定义的函数和返回之前保存的寄存器
kprobe_handler() 函数在已设置 JProbe 的指令处设置单步执行，剩下的一系列步骤与 KProbe 相同

kprobe demo

下面是一个简单的 kprobe demo ，整个用例函数非常简单，它实现了内核函数 _do_fork的探测，该函数会在 fork 系统调用或者内核 kernel_thread 函数创建进程时被调用，触发也十分的频繁。下面来分析一下用例代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


/* For each probe you need to allocate a kprobe structure */
static struct kprobe kp = {
    .symbol_name    = "_do_fork",
};

static int __init kprobe_init(void)
{
    int ret;
    kp.pre_handler = handler_pre;
    kp.post_handler = handler_post;
    kp.fault_handler = handler_fault;

    ret = register_kprobe(&kp);
    if (ret < 0) {
        printk(KERN_INFO "register_kprobe failed, returned %d\n", ret);
        return ret;
    }
    printk(KERN_INFO "Planted kprobe at %p\n", kp.addr);
    return 0;
}

static void __exit kprobe_exit(void)
{
    unregister_kprobe(&kp);
    printk(KERN_INFO "kprobe at %p unregistered\n", kp.addr);
}

module_init(kprobe_init)
module_exit(kprobe_exit)
MODULE_LICENSE("GPL");

程序中定义了一个 struct kprobe 结构实例 kp 并初始化其中的 symbol_name字段为 _do_fork，表明它将要探测 _do_fork 函数。
在模块的初始化函数中，注册了 pre_handler、post_handler和 fault_handler这 3 个回调函数分别为 handler_pre、handler_post 和 handler_fault，最后调用 register_kprobe 注册。
在模块的卸载函数中调用 unregister_kprobe 函数卸载 kp 探测点。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


static int handler_pre(struct kprobe *p, struct pt_regs *regs)
{
#ifdef CONFIG_X86
    printk(KERN_INFO "pre_handler: p->addr = 0x%p, ip = %lx, flags = 0x%lx\n",
        p->addr, regs->ip, regs->flags);
#endif
#ifdef CONFIG_PPC
    printk(KERN_INFO "pre_handler: p->addr = 0x%p, nip = 0x%lx, msr = 0x%lx\n",
        p->addr, regs->nip, regs->msr);
#endif
#ifdef CONFIG_MIPS
    printk(KERN_INFO "pre_handler: p->addr = 0x%p, epc = 0x%lx, status = 0x%lx\n",
        p->addr, regs->cp0_epc, regs->cp0_status);
#endif
#ifdef CONFIG_TILEGX
    printk(KERN_INFO "pre_handler: p->addr = 0x%p, pc = 0x%lx, ex1 = 0x%lx\n",
        p->addr, regs->pc, regs->ex1);
#endif

    /* A dump_stack() here will give a stack backtrace */
    return 0;
}

handler_pre 回调函数的第一个入参是注册的 struct kprobe 探测实例，第二个参数是保存的触发断点前的寄存器状态，
handler_pre 在 do_fork 函数被调用之前被调用，该函数仅仅是打印了被探测点的地址，保存的个别寄存器参数。
由于受 CPU 架构影响，这里对不同的架构进行了宏区分（虽然没有实现 arm 架构的，但是支持的，可以自行添加）；

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


/* kprobe post_handler: called after the probed instruction is executed */
static void handler_post(struct kprobe *p, struct pt_regs *regs,
                unsigned long flags)
{
#ifdef CONFIG_X86
    printk(KERN_INFO "post_handler: p->addr = 0x%p, flags = 0x%lx\n",
        p->addr, regs->flags);
#endif
#ifdef CONFIG_PPC
    printk(KERN_INFO "post_handler: p->addr = 0x%p, msr = 0x%lx\n",
        p->addr, regs->msr);
#endif
#ifdef CONFIG_MIPS
    printk(KERN_INFO "post_handler: p->addr = 0x%p, status = 0x%lx\n",
        p->addr, regs->cp0_status);
#endif
#ifdef CONFIG_TILEGX
    printk(KERN_INFO "post_handler: p->addr = 0x%p, ex1 = 0x%lx\n",
        p->addr, regs->ex1);
#endif
}

handler_post 回调函数的前两个入参同 handler_pre，第三个参数目前尚未使用，全部为 0
该函数在 _do_fork 函数调用之后被调用，这里打印的内容同 handler_pre 类似

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


/*
 * fault_handler: this is called if an exception is generated for any
 * instruction within the pre- or post-handler, or when Kprobes
 * single-steps the probed instruction.
 */
static int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
{
    printk(KERN_INFO "fault_handler: p->addr = 0x%p, trap #%dn",
        p->addr, trapnr);
    /* Return 0 because we don't handle the fault. */
    return 0;
}

handler_fault 回调函数会在执行 handler_pre、handler_post 或单步执行 _do_fork时出现错误时调用
这里第三个参数时具体发生错误的 trap number，与架构相关，例如 i386 的 page fault为 14。

使用以下 Makefile 单独编译 kprobe.ko模块：

1
2
3
4
5
6
7
8


obj-m := kprobe.o

CROSS_COMPILE=''
KDIR := /lib/modules/$(shell uname -r)/build
all:
        make -C $(KDIR) M=$(PWD) modules
clean:
        rm -f *.ko *.o *.mod.o *.mod.c .*.cmd *.symvers  modul*

加载到内核中后，可以看到 dmesg 中打印如下信息：

1
2
3
4
5
6
7


[120203.998653] Planted kprobe at 0000000047eac6a1
[120206.440309] pre_handler:  p->addr = 0x0000000047eac6a1, ip = ffffffff94aa0451, flags = 0x287
[120206.440311] post_handler: p->addr = 0x0000000047eac6a1, flags = 0x287
[120202.581030] pre_handler: p->addr = 0x0000000047eac6a1, ip = ffffffff94aa0451, flags = 0x287
[120202.581030] post_handler: p->addr = 0x0000000047eac6a1, flags = 0x287
//...
[120202.581030] kprobe at 0000000047eac6a1 unregistered

可以看到被探测点的地址为 0000000047eac6a1 ，查看 ip 为 ffffffff94aa0451用以下命令确定这个地址就是 _do_fork 的入口地址。

1
2


 $ sudo cat /proc/kallsyms | grep do_fork
 ffffffff94aa0450 T _do_fork

源码解析

本节从源码角度分析 linux kernel 的 kprobes 框架的实现原理，包括 kprobes 框架的初始化、注册 kprobe 实例、触发 kprobe 的回调函数和单步执行。

kprobes 初始化

kprobes 作为一个模块，其初始化函数为 init_kprobes

1
2
3


static int kprobes_initialized;
static struct hlist_head kprobe_table[KPROBE_TABLE_SIZE];
static struct hlist_head kretprobe_inst_table[KPROBE_TABLE_SIZE];

首先初始化 hash 表的各个链表头，用来保存后面调用 kprobe_register 函数注册的 struct kprobes 实例，同时初始化 kretprobe 用到的自旋锁。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


static int __init init_kprobes(void)
{
    int i, err = 0;

    /* FIXME allocate the probe table, currently defined statically */
    /* initialize all list heads */
    for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
        INIT_HLIST_HEAD(&kprobe_table[i]);
        INIT_HLIST_HEAD(&kretprobe_inst_table[i]);
        raw_spin_lock_init(&(kretprobe_table_locks[i].lock));
    }

    err = populate_kprobe_blacklist(__start_kprobe_blacklist,
                    __stop_kprobe_blacklist);
    if (err) {
        pr_err("kprobes: failed to populate blacklist: %d\n", err);
        pr_err("Please take care of using kprobes.\n");
    }

  //...

    err = arch_init_kprobes();
    if (!err)
        err = register_die_notifier(&kprobe_exceptions_nb);
    if (!err)
        err = register_module_notifier(&kprobe_module_nb);

    kprobes_initialized = (err == 0);

  //...
}

接下来调用 populate_kprobe_blacklist 函数将 kprobe 实现相关的代码函数保存到 kprobe_blacklist 这个链表中去，用于后面注册探测点时判断使用。 kprobe_blacklist 中保存了实现 kprobes 的关键代码路径，这些代码是不可以被 kprobe 自己所探测的，在源码定义相关函数时使用 NOKPROBE_SYMBOL 宏将函数放到这个段中，例如其中的 get_kprobe 函数：

1
2
3
4
5


struct kprobe *get_kprobe(void *addr)
{
  //...
}
NOKPROBE_SYMBOL(get_kprobe);

随后调用 arch_init_kprobes 进行架构相关的初始化，x86 架构的实现为空，arm 架构的实现如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


int __init arch_init_kprobes()
{
    arm_probes_decode_init();
#ifdef CONFIG_THUMB2_KERNEL
    register_undef_hook(&kprobes_thumb16_break_hook);
    register_undef_hook(&kprobes_thumb32_break_hook);
#else
    register_undef_hook(&kprobes_arm_break_hook);
#endif
    return 0;
}

回到 init_kprobes函数，接下来分别注册 die 和 module 的内核通知链 kprobe_exceptions_nb 和 kprobe_module_nb：

1
2
3
4
5
6
7
8
9


static struct notifier_block kprobe_exceptions_nb = {
    .notifier_call = kprobe_exceptions_notify,
    .priority = 0x7fffffff /* we need to be notified first */
};

static struct notifier_block kprobe_module_nb = {
    .notifier_call = kprobes_module_callback,
    .priority = 0
};

kprobe_exceptions_nb 的优先级很高，如果在执行回调函数和单步执行被探测指令期间若发生了内存异常，将优先调用 kprobe_exceptions_notify函数处理（架构相关，x86 会调用 kprobe 的 fault 回调函数，而 arm 则为空）
kprobes_module_callback 作用是若当某个内核模块发生卸载操作时有必要检测并移除注册到该模块函数的探测点。

最后 init_kprobes 函数置位 kprobes_initialized 标识，初始化完成，总结整体流程如下：

注册 kprobe 实例

kprobe 探测模块调用 register_kprobe 向 kprobe 子系统注册一个 kprobe 探测点实例

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63


int register_kprobe(struct kprobe *p)
{
    int ret;
    struct kprobe *old_p;
    struct module *probed_mod;
    kprobe_opcode_t *addr;

    /* Adjust probe address from symbol */
    addr = kprobe_addr(p);
    if (IS_ERR(addr))
        return PTR_ERR(addr);
    p->addr = addr;

    /* 防止同一个kprobe实例被重复注册 */
    ret = check_kprobe_rereg(p);
    if (ret)
        return ret;

    /* User can pass only KPROBE_FLAG_DISABLED to register_kprobe */
    p->flags &= KPROBE_FLAG_DISABLED;
    p->nmissed = 0;
    INIT_LIST_HEAD(&p->list);

    ret = check_kprobe_address_safe(p, &probed_mod);
    if (ret)
        return ret;

    mutex_lock(&kprobe_mutex);

    old_p = get_kprobe(p->addr);
    if (old_p) {
        /* Since this may unoptimize old_p, locking text_mutex. */
        ret = register_aggr_kprobe(old_p, p);
        goto out;
    }

    mutex_lock(&text_mutex);    /* Avoiding text modification */
    ret = prepare_kprobe(p);
    mutex_unlock(&text_mutex);
    if (ret)
        goto out;

    INIT_HLIST_NODE(&p->hlist);
    hlist_add_head_rcu(&p->hlist,
               &kprobe_table[hash_ptr(p->addr, KPROBE_HASH_BITS)]);

    // 如果 kprobes_all_disarmed 为 false 并且 kprobe 没有被 disable
    // 则调用arm_kprobe函数，它会把触发trap的指令写到被探测点处替换原始指令
    if (!kprobes_all_disarmed && !kprobe_disabled(p))
        arm_kprobe(p);

    /* Try to optimize kprobe */
    try_to_optimize_kprobe(p);

out:
    mutex_unlock(&kprobe_mutex);

    if (probed_mod)
        module_put(probed_mod);

    return ret;
}
EXPORT_SYMBOL_GPL(register_kprobe);

获取被探测地址

函数首先调用 kprobe_addr 函数初始化被探测点的地址 p->addr，在前面的 demo 中我们可以看到，开发者一般通过传入函数名来指定要探测的函数，而不会直接指定想要探测的 addr 地址，kprobe_addr 函数的作用就是将函数名转换为最终的被探测地址：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


#define kprobe_lookup_name(name, addr) \
    addr = ((kprobe_opcode_t *)(kallsyms_lookup_name(name)))

static kprobe_opcode_t *kprobe_addr(struct kprobe *p)
{
    kprobe_opcode_t *addr = p->addr;

    // 检查入参，不允许函数名和地址同时设置或同时为空的情况
    if ((p->symbol_name && p->addr) || (!p->symbol_name && !p->addr))
        goto invalid;

    // 如果用户指定被探测函数名，则调用kallsyms_lookup_name函数根据函数名查找其运行的虚拟地址
    if (p->symbol_name) {
        kprobe_lookup_name(p->symbol_name, addr);
        if (!addr)
            return ERR_PTR(-ENOENT);
    }

    // 最终的被探测地址为函数虚拟地址加上offset，在绝大多数的情况下，offset值被用户设置为0，即用户探测指定函数的入口
    addr = (kprobe_opcode_t *)(((char *)addr) + p->offset);
    if (addr)
        return addr;

invalid:
    return ERR_PTR(-EINVAL);
}

检查被探测地址有效性

然后调用 check_kprobe_address_safe 函数检测被探测地址是否可探测：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48


static int check_kprobe_address_safe(struct kprobe *p,
                     struct module **probed_mod)
{
    int ret;

    ret = arch_check_ftrace_location(p);
    if (ret)
        return ret;
    jump_label_lock();
    preempt_disable();

    /* Ensure it is not in reserved area nor out of text */
    if (!kernel_text_address((unsigned long) p->addr) ||
        within_kprobe_blacklist((unsigned long) p->addr) ||
        jump_label_text_reserved(p->addr, p->addr)) {
        ret = -EINVAL;
        goto out;
    }

    /* Check if are we probing a module */
    *probed_mod = __module_text_address((unsigned long) p->addr);
    if (*probed_mod) {
        /*
         * We must hold a refcount of the probed module while updating
         * its code to prohibit unexpected unloading.
         */
        if (unlikely(!try_module_get(*probed_mod))) {
            ret = -ENOENT;
            goto out;
        }

        /*
         * If the module freed .init.text, we couldn't insert
         * kprobes in there.
         */
        if (within_module_init((unsigned long)p->addr, *probed_mod) &&
            (*probed_mod)->state != MODULE_STATE_COMING) {
            module_put(*probed_mod);
            *probed_mod = NULL;
            ret = -ENOENT;
        }
    }
out:
    preempt_enable();
    jump_label_unlock();

    return ret;
}

被探测地址有效性检测主要需要满足 3 个条件：

通过 kernel_text_address 函数检查被探测地址在内核的地址段中
通过 within_kprobe_blacklist 函数检查被探测地址不在 kprobe 的 blacklist 之中
通过 jump_label_text_reserved 函数检查不在 jump lable 保留的地址空间中

满足这三个条件之后，继续判断被探测地址是否属于某一个内核模块的 init_text 段或 core_text 段。如果属于某一个模块的话则增加这个模块的引用计数以防止模块被意外动态卸载，同时不允许在已经完成加载模块的 init_text段中的函数注册 kprobe。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


struct module *__module_text_address(unsigned long addr)
{
    struct module *mod = __module_address(addr);
    if (mod) {
        /* Make sure it's within the text section. */
        if (!within(addr, mod->module_init, mod->init_text_size)
            && !within(addr, mod->module_core, mod->core_text_size))
            mod = NULL;
    }
    return mod;
}

当以上判断都通过之后重新打开内核抢占并解锁，回到 register_kprobe 函数继续注册流程。

替换探测地址指令为异常触发指令

根据在全局 hash 表中查找是否之前已经为同一个被探测地址注册了 kprobe 探测点，有两种情况：

如果已注册，则调用 register_aggr_kprobe 函数继续注册流程，具体流程参见 汇聚 kprobe
如果是初次注册，则调用 prepare_kprobe 函数，该函数会根据被探测地址是否已经被 ftrace 了而进入不同的流程，这里假设没有启用 ftrace，则直接调用 arch_prepare_kprobe 函数进入架构相关的注册流程

以 x86 为例，这里 ainsn 为 arch specific instruction，是被复制的被探测点的原始指令，用于单步执行，与架构强相关。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


int arch_prepare_kprobe(struct kprobe *p)
{
    if (alternatives_text_reserved(p->addr, p->addr))
        return -EINVAL;

    if (!can_probe((unsigned long)p->addr))
        return -EILSEQ;
    /* insn: must be on special executable page on x86. */
    p->ainsn.insn = get_insn_slot();
    if (!p->ainsn.insn)
        return -ENOMEM;

    return arch_copy_kprobe(p);
}

arch_prepare_kprobe 主要的工作是：

对于 smp 系统，被探测地址不能被用于 smp-alternatives，非 smp 无此要求
判断该被探测地址的指令有效并调用 get_insn_slot函数申请用于拷贝原始指令的指令 slot 内存空间
调用 arch_copy_kprobe 函数执行指令复制动作

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


static int arch_copy_kprobe(struct kprobe *p)
{
    int ret;

    /* Copy an instruction with recovering if other optprobe modifies it.*/
    ret = __copy_instruction(p->ainsn.insn, p->addr);
    if (!ret)
        return -EINVAL;

    /*
     * __copy_instruction can modify the displacement of the instruction,
     * but it doesn't affect boostable check.
     */
    if (can_boost(p->ainsn.insn))
        p->ainsn.boostable = 0;
    else
        p->ainsn.boostable = -1;

    /* Check whether the instruction modifies Interrupt Flag or not */
    p->ainsn.if_modifier = is_IF_modifier(p->ainsn.insn);

    /* Also, displacement change doesn't affect the first byte */
    p->opcode = p->ainsn.insn[0];

    return 0;
}

arch_copy_kprobe 函数主要工作：

调用 __copy_instruction 将 kprobe->addr 被探测地址的指令拷贝到 kprobe->ainsn.insn保存起来（可能会对指令做适当的修改）
初始化 kprobe->ainsn 结构体
将指令的第一个字节保存到 kprobe->opcode 字段中（x86 架构的 kprobe_opcode_t 是 u8 类型的）

通过上述工作，被探测点指令就被拷贝保存起来了。架构相关的初始化完成以后，register_kprobe函数初始化 kprobe 的 hlist 字段并将它添加到全局的 hash 表中。如果 kprobes_all_disarmed 为 false 并且 kprobe 没有被 disable，则调用 arm_kprobe 函数，它会把触发 trap 的指令写到被探测点处替换原始指令。

汇聚 kprobe

至此 kprobe 的注册流程分析完毕，再回头分析对一个已经被注册过 kprobe 的探测点注册新的 kprobe 的执行流程，即 register_aggr_kprobe 函数：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74


/*
 * This is the second or subsequent kprobe at the address - handle
 * the intricacies
 */
static int register_aggr_kprobe(struct kprobe *orig_p, struct kprobe *p)
{
    int ret = 0;
    struct kprobe *ap = orig_p;

    /* For preparing optimization, jump_label_text_reserved() is called */
    jump_label_lock();
    /*
     * Get online CPUs to avoid text_mutex deadlock.with stop machine,
     * which is invoked by unoptimize_kprobe() in add_new_kprobe()
     */
    get_online_cpus();
    mutex_lock(&text_mutex);

    if (!kprobe_aggrprobe(orig_p)) {
        /* If orig_p is not an aggr_kprobe, create new aggr_kprobe. */
        ap = alloc_aggr_kprobe(orig_p);
        if (!ap) {
            ret = -ENOMEM;
            goto out;
        }
        init_aggr_kprobe(ap, orig_p);
    } else if (kprobe_unused(ap))
        /* This probe is going to die. Rescue it */
        reuse_unused_kprobe(ap);

    if (kprobe_gone(ap)) {
        /*
         * Attempting to insert new probe at the same location that
         * had a probe in the module vaddr area which already
         * freed. So, the instruction slot has already been
         * released. We need a new slot for the new probe.
         */
        ret = arch_prepare_kprobe(ap);
        if (ret)
            /*
             * Even if fail to allocate new slot, don't need to
             * free aggr_probe. It will be used next time, or
             * freed by unregister_kprobe.
             */
            goto out;

        /* Prepare optimized instructions if possible. */
        prepare_optimized_kprobe(ap);

        /*
         * Clear gone flag to prevent allocating new slot again, and
         * set disabled flag because it is not armed yet.
         */
        ap->flags = (ap->flags & ~KPROBE_FLAG_GONE)
                | KPROBE_FLAG_DISABLED;
    }

    /* Copy ap's insn slot to p */
    copy_kprobe(ap, p);
    ret = add_new_kprobe(ap, p);

out:
    mutex_unlock(&text_mutex);
    put_online_cpus();
    jump_label_unlock();

    if (ret == 0 && kprobe_disabled(ap) && !kprobe_disabled(p)) {
        ap->flags &= ~KPROBE_FLAG_DISABLED;
        if (!kprobes_all_disarmed)
            /* Arm the breakpoint again. */
            arm_kprobe(ap);
    }
    return ret;
}

在前文中看到，该函数会在对同一个被探测地址注册多个 kprobe 实例时会被调用到，该函数会引入一个 kprobe aggregator的概念，即由一个统一的 kprobe 实例接管所有注册到该地址的 kprobe。

至此整个 kprobe 注册流程分析结束，下面来分析以上注册的探测回调函数是如何被执行的以及被探测指令是如何被单步执行的。

触发 kprobe 探测和回调

从 register_kprobe 函数注册 kprobe 的流程已经看到，用户指定的被探测函数入口地址处的指令已经被替换成架构相关的 BREAKPOINT_INSTRUCTION 指令，若是正常的代码流程执行到该指令，将会触发异常，进入架构相关的异常处理函数，kprobe 注册的回调函数及被探测函数的单步执行流程均在该流程中执行，这里主要分析 x86 架构下触发探测和回调的原理。

x86_64 架构下，执行到前文中替换的 BREAKPOINT_INSTRUCTION 指令后将触发 INT3 中断，进而调用到 do_int3函数。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


/* May run on IST stack. */
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
{
  // ...

#ifdef CONFIG_KPROBES
    if (kprobe_int3_handler(regs))
        goto exit;
#endif
    // ...
}
NOKPROBE_SYMBOL(do_int3);

do_init3函数做的事情比较多，但是和 kprobe 相关的仅代码中列出的这 1 处，下面来看 kprobe_int3_handler函数：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


/*
 * Interrupts are disabled on entry as trap3 is an interrupt gate and they
 * remain disabled throughout this function.
 */
int kprobe_int3_handler(struct pt_regs *regs)
{
    kprobe_opcode_t *addr;
    struct kprobe *p;
    struct kprobe_ctlblk *kcb;

    if (user_mode(regs))
        return 0;

    addr = (kprobe_opcode_t *)(regs->ip - sizeof(kprobe_opcode_t));
    /*
     * We don't want to be preempted for the entire
     * duration of kprobe processing. We conditionally
     * re-enable preemption at the end of this function,
     * and also in reenter_kprobe() and setup_singlestep().
     */
    preempt_disable();

    kcb = get_kprobe_ctlblk();
    p = get_kprobe(addr);

本地中断在处理 kprobe 期间依然被禁止，同时调用 user_mode 函数确保本处理函数处理的 int3 中断是在内核态执行流程期间被触发的（因为 kprobe 不会从用户态触发），这里之所以要做这么一个判断是因为同 arm 定义特殊未处理指令回调函数不同，这里的 do_int3 要通用的多，并不是单独为 kprobe 所设计的。
然后获取被探测指令的地址保存到 addr 中（对于 int3 中断，其被 Intel 定义为 trap，那么异常发生时 EIP 寄存器内指向的为异常指令的后一条指令），同时会禁用内核抢占，注释中说明在 reenter_kprobe 和单步执行时会有选择的重新开启内核抢占。

接下来下面同 arm 一样获取当前 cpu 的 kprobe_ctlblk 控制结构体和本次要处理的 kprobe 实例 p，然后根据不同的情况进行不同分支的处理。在继续分析前先来看一下 x86_64 架构 kprobe_ctlblk 结构体的定义

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


/* per-cpu kprobe control block */
struct kprobe_ctlblk {
    unsigned long kprobe_status;
    unsigned long kprobe_old_flags;
    unsigned long kprobe_saved_flags;
    unsigned long *jprobe_saved_sp;
    struct pt_regs jprobe_saved_regs;
    kprobe_opcode_t jprobes_stack[MAX_STACK_SIZE];
    struct prev_kprobe prev_kprobe;
};

该其中保存了 kprobe 的一些状态信息以及 jpboe 用到的字段，目前需要关注的是其中的 kprobe_status 和 prev_kprobe 字段。其中 kprobe_status 代表了当前 kprobe 的处理状态，一共包括以下几种：

1
2
3
4


#define KPROBE_HIT_ACTIVE    0x00000001      //开始处理kprobe
#define KPROBE_HIT_SS        0x00000002        //kprobe单步执行阶段
#define KPROBE_REENTER        0x00000004      //重复触发kprobe
#define KPROBE_HIT_SSDONE    0x00000008      //kprobe单步执行阶段结束

而 prev_kprobe 则是用于在 kprobe 重入情况下保存当前正在处理的 kprobe 实例和状态的。内核为每个 cpu 都定义了一个该类型全局变量。

然后调用 kprobe_running 函数获取当前 cpu 上正在处理的 kprobe：

1
2
3
4
5


/* kprobe_running() will just return the current_kprobe on this CPU */
static inline struct kprobe *kprobe_running(void)
{
    return (__this_cpu_read(current_kprobe));
}

这里的 current_kprobe 也是一个 per_cpu 变量，其中保存了当前 cpu 正在处理的 kprobe 实例，若没有正在处理的则为 NULL。

p 存在且 curent_kprobe 存在

对于 kprobe 重入的情况，调用 reenter_kprobe 函数单独处理：

1
2
3


    if (kprobe_running()) {
        if (reenter_kprobe(p, regs, kcb))
            return 1;

这个流程同 arm 实现的很像，最后函数会返回 1，do_int3 也会直接返回，表示该中断已被 kprobe 截取并处理，无需再处理其他分支。

只不过对于 KPROBE_HIT_SS 阶段不会报 BUG，也同 KPROBE_HIT_SSDONE 和 KPROBE_HIT_ACTIVE 一样，递增 nmissed 值并调用 setup_singlestep 函数进入单步处理流程（该函数最后一个入参此时设置为 1，针对 reenter 的情况会将 kprobe_status 状态设置为 KPROBE_REENTER 并调用 save_previous_kprobe 执行保存当前 kprobe 的操作）。对于 KPROBE_REENTER 阶段还是直接报 BUG。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


/*
 * We have reentered the kprobe_handler(), since another probe was hit while
 * within the handler. We save the original kprobes variables and just single
 * step on the instruction of the new probe without calling any user handlers.
 */
static int reenter_kprobe(struct kprobe *p, struct pt_regs *regs,
              struct kprobe_ctlblk *kcb)
{
    switch (kcb->kprobe_status) {
    case KPROBE_HIT_SSDONE:
    case KPROBE_HIT_ACTIVE:
    case KPROBE_HIT_SS:
        kprobes_inc_nmissed_count(p);
        setup_singlestep(p, regs, kcb, 1);
        break;
    case KPROBE_REENTER:
        /* A probe has been hit in the codepath leading up to, or just
         * after, single-stepping of a probed instruction. This entire
         * codepath should strictly reside in .kprobes.text section.
         * Raise a BUG or we'll continue in an endless reentering loop
         * and eventually a stack overflow.
         */
        printk(KERN_WARNING "Unrecoverable kprobe detected at %p.\n",
               p->addr);
        dump_kprobe(p);
        BUG();
    default:
        /* impossible cases */
        WARN_ON(1);
        return 0;
    }

    return 1;
}
NOKPROBE_SYMBOL(reenter_kprobe);

p 存在但 curent_kprobe 不存在

这是一般最通用的 kprobe 执行流程：

调用 set_current_kprobe 绑定 p 为当前正在处理的 kprobe
处理 pre_handler 回调函数，有注册的话就调用执行
调用 setup_singlestep 启动单步执行，在调试完成后直接返回 1

注意这里并没有向 arm 实现那样直接调用 post_handler 回调函数并解除 kprobe 绑定，因为 x86_64 架构的 post_handler 采用另一种方式调用，后文会讲到。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


} else {
    set_current_kprobe(p, regs, kcb);
    kcb->kprobe_status = KPROBE_HIT_ACTIVE;

    /*
     * If we have no pre-handler or it returned 0, we
     * continue with normal processing.  If we have a
     * pre-handler and it returned non-zero, it prepped
     * for calling the break_handler below on re-entry
     * for jprobe processing, so get out doing nothing
     * more here.
     */
    if (!p->pre_handler || !p->pre_handler(p, regs))
        setup_singlestep(p, regs, kcb, 0);
    return 1;
}

p 不存在且被探测地址的指令也不是 BREAKPOINT_INSTRUCTION

这种情况表示 kprobe 可能已经被其他 CPU 注销了，则让他执行原始指令即可，因此这里设置 regs->ip 值为 addr 并重新开启内核抢占返回 1。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


    } else if (*addr != BREAKPOINT_INSTRUCTION) {
        /*
         * The breakpoint instruction was removed right
         * after we hit it.  Another cpu has removed
         * either a probepoint or a debugger breakpoint
         * at this address.  In either case, no further
         * handling of this interrupt is appropriate.
         * Back up over the (now missing) int3 and run
         * the original instruction.
         */
        regs->ip = (unsigned long)addr;
        preempt_enable_no_resched();
        return 1;

p 不存在但 curent_kprobe 存在

这种情况一般用于实现 jprobe，因此会调用 curent_kprobe 的 break_handler 回调函数，然后在 break_handler返回非 0 的情况下执行单步执行，最后返回 1

1
2
3
4
5
6
7


} else if (kprobe_running()) {
    p = __this_cpu_read(current_kprobe);
    if (p->break_handler && p->break_handler(p, regs)) {
        if (!skip_singlestep(p, regs, kcb))
            setup_singlestep(p, regs, kcb, 0);
        return 1;
    }

单步执行

单步执行其实就是执行被探测点的原始指令，对应着前面的 setup_singlestep 函数，其中会涉及许多 cpu 架构相关的知识。

下面从原理角度逐一分析：

当程序执行到某条想要单独执行 CPU 指令时，在执行之前产生一次 CPU 异常
此时把异常返回时的 CPU 的 EFLAGS 寄存器的 TF(调试位)位置为 1，把 IF(中断屏蔽位) 标志位置为 0，然后把 EIP 指向单步执行的指令
当单步指令执行完成后，CPU 会自动产生一次调试异常（由于 TF 被置位）
Kprobes 会利用 debug 异常，执行 post_handler()

下面来简单看一下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


static void setup_singlestep(struct kprobe *p, struct pt_regs *regs,
                 struct kprobe_ctlblk *kcb, int reenter)
{
    if (setup_detour_execution(p, regs, reenter))
        return;

  // ...

    if (reenter) {
        save_previous_kprobe(kcb);
        set_current_kprobe(p, regs, kcb);
        kcb->kprobe_status = KPROBE_REENTER;
    } else
        kcb->kprobe_status = KPROBE_HIT_SS;

    /* Prepare real single stepping */
    clear_btf();
    regs->flags |= X86_EFLAGS_TF;
    regs->flags &= ~X86_EFLAGS_IF;
    /* single step inline if the instruction is an int3 */
    if (p->opcode == BREAKPOINT_INSTRUCTION)
        regs->ip = (unsigned long)p->addr;
    else
        regs->ip = (unsigned long)p->ainsn.insn;
}

首先在前文中已经介绍了，函数的最后一个入参 reenter 表示是否重入：

对于重入的情况：
- 调用 save_previous_kprobe 函数保存当前正在运行的 kprobe
- 绑定 p 和 current_kprobe
- 设置 kprobe_status 为 KPROBE_REENTER
对于非重入的情况则设置 kprobe_status 为 KPROBE_HIT_SS

接下来准备单步执行：

设置 regs->flags 中的 TF 位并清空 IF 位
把 int3 异常返回的指令寄存器地址改为前面保存的被探测指令，当 int3 异常返回时这些设置就会生效，即立即执行保存的原始指令（注意这里是在触发 int3 之前原来的上下文中执行，因此直接执行原始指令即可，无需特别的模拟操作）
该函数返回后 do_int3 函数立即返回，由于 cpu 的标识寄存器被设置，在单步执行完被探测指令后立即触发 debug 异常，进入 debug 异常处理函数 do_debug

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53


dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
{
  // ...
#ifdef CONFIG_KPROBES
    if (kprobe_debug_handler(regs))
        goto exit;
#endif
  // ...

exit:
    ist_exit(regs, prev_state);
}

/*
 * Interrupts are disabled on entry as trap1 is an interrupt gate and they
 * remain disabled throughout this function.
 */
int kprobe_debug_handler(struct pt_regs *regs)
{
    struct kprobe *cur = kprobe_running();
    struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();

    if (!cur)
        return 0;

    resume_execution(cur, regs, kcb);
    regs->flags |= kcb->kprobe_saved_flags;

    if ((kcb->kprobe_status != KPROBE_REENTER) && cur->post_handler) {
        kcb->kprobe_status = KPROBE_HIT_SSDONE;
        cur->post_handler(cur, regs, 0);
    }

    /* Restore back the original saved kprobes variables and continue. */
    if (kcb->kprobe_status == KPROBE_REENTER) {
        restore_previous_kprobe(kcb);
        goto out;
    }
    reset_current_kprobe();
out:
    preempt_enable_no_resched();

    /*
     * if somebody else is singlestepping across a probe point, flags
     * will have TF set, in which case, continue the remaining processing
     * of do_debug, as if this is not a probe hit.
     */
    if (regs->flags & X86_EFLAGS_TF)
        return 0;

    return 1;
}
NOKPROBE_SYMBOL(kprobe_debug_handler);

调用 resume_execution 函数将 debug 异常返回的下一条指令设置为被探测之后的指令，这样异常返回后程序的流程就会按正常的流程继续执行；
恢复 kprobe 执行前保存的 flags 标识；
如果 kprobe 不是重入的并且设置了 post_handler 回调函数，就设置 kprobe_status 状态为 KPROBE_HIT_SSDONE 并调用 post_handler 函数；
如果 kprobe 是重入的则调用 restore_previous_kprobe 函数恢复之前保存的 kprobe
调用 reset_current_kprobe 函数解除本 kprobe 和 current_kprobe 的绑定，如果本 kprobe 由单步执行触发，则说明 do_debug异常处理还有其他流程带处理，返回 0，否则返回 1。

x86_64 架构利用了 cpu 提供的单步调试技术，使得原始指令在正常的原上下文中执行，而两个回调函数则分别在 int3 和 debug 两次异常处理流程中执行。

至此，kprobe 的一般处理流程就分析完了，最后分析一下剩下的最后一个回调函数 fault_handler

出错回调

出错会调函数 fault_handler 会在执行 pre_handler、single_step 和 post_handler 期间触发内存异常时被调用，对应的调用函数为 kprobe_fault_handler，它同样时架构相关的，下面以 x86 为基础介绍出错回调的过程：

do_page_fault->__do_page_fault->kprobes_fault

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


static nokprobe_inline int kprobes_fault(struct pt_regs *regs)
{
    int ret = 0;

    /* kprobe_running() needs smp_processor_id() */
    if (kprobes_built_in() && !user_mode(regs)) {
        preempt_disable();
        if (kprobe_running() && kprobe_fault_handler(regs, 14))
            ret = 1;
        preempt_enable();
    }

    return ret;
}

可见在触发缺页异常之后，若当前正在处理 kprobe 流程期间，会调用 kprobe_fault_handler 进行处理。

do_general_protection->notify_die->kprobe_exceptions_notify

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


int kprobe_exceptions_notify(struct notifier_block *self, unsigned long val,
                 void *data)
{
    struct die_args *args = data;
    int ret = NOTIFY_DONE;

    if (args->regs && user_mode(args->regs))
        return ret;

    if (val == DIE_GPF) {
        /*
         * To be potentially processing a kprobe fault and to
         * trust the result from kprobe_running(), we have
         * be non-preemptible.
         */
        if (!preemptible() && kprobe_running() &&
            kprobe_fault_handler(args->regs, args->trapnr))
            ret = NOTIFY_STOP;
    }
    return ret;
}

前文中 init_kprobes 初始化时会注册 die 内核通知链 kprobe_exceptions_nb，它的回调函数为 kprobe_exceptions_notify，在内核触发 DIE_GPF 类型的 notify_die 时，该函数会调用 kprobe_fault_handler进行处理。

下面来简单看一下 x86_64 架构的 kprobe_fault_handler 函数实现：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57


int kprobe_fault_handler(struct pt_regs *regs, int trapnr)
{
    struct kprobe *cur = kprobe_running();
    struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();

    if (unlikely(regs->ip == (unsigned long)cur->ainsn.insn)) {
        /* This must happen on single-stepping */
        WARN_ON(kcb->kprobe_status != KPROBE_HIT_SS &&
            kcb->kprobe_status != KPROBE_REENTER);
        /*
         * We are here because the instruction being single
         * stepped caused a page fault. We reset the current
         * kprobe and the ip points back to the probe address
         * and allow the page fault handler to continue as a
         * normal page fault.
         */
        regs->ip = (unsigned long)cur->addr;
        regs->flags |= kcb->kprobe_old_flags;
        if (kcb->kprobe_status == KPROBE_REENTER)
            restore_previous_kprobe(kcb);
        else
            reset_current_kprobe();
        preempt_enable_no_resched();
    } else if (kcb->kprobe_status == KPROBE_HIT_ACTIVE ||
           kcb->kprobe_status == KPROBE_HIT_SSDONE) {
        /*
         * We increment the nmissed count for accounting,
         * we can also use npre/npostfault count for accounting
         * these specific fault cases.
         */
        kprobes_inc_nmissed_count(cur);

        /*
         * We come here because instructions in the pre/post
         * handler caused the page_fault, this could happen
         * if handler tries to access user space by
         * copy_from_user(), get_user() etc. Let the
         * user-specified handler try to fix it first.
         */
        if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
            return 1;

        /*
         * In case the user-specified fault handler returned
         * zero, try to fix up.
         */
        if (fixup_exception(regs))
            return 1;

        /*
         * fixup routine could not handle it,
         * Let do_page_fault() fix it.
         */
    }

    return 0;
}

技术总结

kprobes 内核探测技术作为一种内核代码的跟踪及调试手段，开发人员可以动态的跟踪内核函数的执行，相较与传统的添加内核日志等调试手段，它具有操作简单，使用灵活，对原始代码破坏小等多方面优势。本文首先介绍了 kprobes 的技术背景，然后介绍了其中 kprobe 技术使用方法并且通过源代码详细分析了 x86_64 架构的原理和实现方式。