Garden | eBPF Map 操作

eBPF Map 是用户空间和内核空间进行数据交换、信息传递的桥梁，它以 key/value 方式将数据存储在内核中，可以被任何知道它们的BPF程序访问。在内核空间的程序创建 BPF Map 并返回对应的 文件描述符，在用户空间运行的程序就可以通过这个文件描述符来访问并操作BPF Map。eBPF Map 支持多种数据结构类型，在上一篇博客中已经简单介绍过，本文将通过代码实例展示其使用方法，所有代码可以在我的 Github 中找到。

创建BPF Map

最初创建 BPF Map 的方式都是通过 bpf 系统调用函数，传入的第一个参数是BPF_MAP_CREATE，在上一篇博客中已经介绍，此处不在详述。

1
2
3
4
5
6
7
8
9


union bpf_attr my_map_attr {
  .map_type = BPF_MAP_TYPE_ARRAY,
  .key_size = sizeof(int),
  .value_size = sizeof(int),
  .max_entries = 1024,
  .map_flags = BPF_F_NO_PREALLOC,
};

int fd = bpf(BPF_MAP_CREATE, &my_map_attr, sizeof(my_map_attr));

相对于直接使用 bpf 系统调用函数来创建BPF Map，在实际场景中常用的是基于 SEC("maps") 这个语法糖来做到声明即创建：

1
2
3
4
5
6
7


struct bpf_map_def SEC("maps") my_bpf_map = {
  .type       = BPF_MAP_TYPE_HASH, 
  .key_size   = sizeof(int),
  .value_size   = sizeof(int),
  .max_entries = 100,
  .map_flags   = BPF_F_NO_PREALLOC,
};

关键点就是SEC("maps")，ELF convention，它的工作原理是这样的：

声明 ELF Section 属性 SEC("maps")
内核代码bpf_load.c 扫描目标文件中所有 Section 信息，它会扫描目标文件里定义的 Section，其中就有用来创建BPF Map的SEC("maps")，我们可以到相关代码里看到说明：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


// https://elixir.bootlin.com/linux/v4.15/source/samples/bpf/bpf_load.h#L41
/* parses elf file compiled by llvm .c->.o
 * . parses 'maps' section and creates maps via BPF syscall // 就是这里
 * . parses 'license' section and passes it to syscall
 * . parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns by
 *   storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD
 * . loads eBPF programs via BPF syscall
 *
 * One ELF file can contain multiple BPF programs which will be loaded
 * and their FDs stored stored in prog_fd array
 *
 * returns zero on success
 */
int load_bpf_file(char *path);

bpf_load.c扫描到SEC("maps")后，对BPF Map相关的操作是由load_maps函数完成，其中的bpf_create_map_node()和bpf_create_map_in_map_node()就是创建BPF Map的关键函数
它们背后都是调用了定义在内核代码tools/lib/bpf/bpf.c中的方法，而这个方法就是使用上文提到的BPF_MAP_CREATE命令进行的系统调用。
最后在编译程序时，通过添加bpf_load.o作为依赖库，并合并为最终的可执行文件中，这样在程序运行起来时，就可以通过声明SEC("maps")即可完成创建BPF Map的行为了。

从上面梳理的过程可以看到，这个简化版虽然使用了语法糖，但最后还是会去使用 bpf() 函数完成系统调用。

数据结构

本小节将介绍 eBPF Map 的几种常见的数据结构，包括其使用场景和使用方法。

Hash Table

对于 BPF_MAP_TYPE_HASH 类型的 eBPF Map，其 key 和 value 都是可自定义的数据结构，使用方法如下所示：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


// define the struct for the key of bpf map
struct pair {
  __u32 src_ip;
  __u32 dest_ip;
};

struct stats {
  __u64 tx_cnt; // the sending request count
  __u64 rx_cnt; // the received request count
  __u64 tx_bytes; // the sending request bytes
  __u64 rx_bytes; // the sending received bytes
};

struct bpf_map_def SEC("maps") tracker_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(struct pair),
    .value_size = sizeof(struct stats),
    .max_entries = 2048,
};


struct stats *stats, newstats = {0, 0, 0, 0};

stats = bpf_map_lookup_elem(&tracker_map, pair);
if (stats)
{
    stats->rx_cnt++;
    stats->rx_bytes += bytes;
} else {
    newstats.rx_cnt = 1;
    newstats.rx_bytes = bytes;
    bpf_map_update_elem(&tracker_map, pair, &newstats, BPF_NOEXIST);
}

Array

对于 BPF_MAP_TYPE_ARRAY 类型的 eBPF Map，有以下特性：

它的 key 是作为一个数组的索引，只能是 4 个字节
在 Array 初始化的时候，Array 中所有的元素都 pre-allocated 并且初始化未 0
map_delete_elem() 函数会返回 EINVAL，因为 Array 中的元素不能够被删除
map_update_elem() 函数更新元素的时候是 non-atomic 的，并没有并发保护

BPF_MAP_TYPE_ARRAY 类型的 eBPF Map 主要用于以下两种情景：

全局变量：可以申请一个只有一个元素的 Array，key = 0，value 是一些全局变量的集合
aggregation of tracing events into fixed set of buckets

下面展示了使用 BPF_MAP_TYPE_ARRAY 作为全局变量的方法：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


struct globals {
    u64 lat_ave;
    u64 lat_sum;
    u64 missed;
    u64 max_lat;
    int num_samples;
};

struct bpf_map_def SEC("maps") global_map = {
    .type = BPF_MAP_TYPE_ARRAY,
    .key_size = sizeof(int),
    .value_size = sizeof(struct globals),
    .max_entries = 1,
};

int bpf_prog(struct bpf_context *ctx)
{
    ...
    int ind = 0;
    struct globals *g = bpf_map_lookup_elem(&global_map, &ind);
    if (!g)
            return 0;
    if (g->lat_ave == 0) {
            g->num_samples++;
            g->lat_sum += delta;
            if (g->num_samples >= 100) {
                    g->lat_ave = g->lat_sum / g->num_samples;
    ...

Prog Array

BPF_MAP_TYPE_PROG_ARRAY 类型的 eBPF Map 主要用于尾调用，尾调用执行涉及两个步骤：

设置类型为 BPF_MAP_TYPE_PROG_ARRAY 的 map，这个 map 可以从用户空间通过 key/value 操作
调用辅助函数 bpf_tail_call() 如下所示，内核将这个辅助函数调用内联到一个特殊的 BPF 指令内。目前，这样的程序数组在用户空间侧是只写模式
- 一个对程序数组的引用（a reference to the program array）
- 一个查询 map 所用的 key。

1

long bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)

内核根据传入的文件描述符查找相关的 BPF 程序，自动替换给定的 map slot 处的程序指针。如果没有找到给定的 key 对应的 value，内核会跳过（fall through）这一步，继续执行 bpf_tail_call() 后面的指令。

尾调用是一个强大的功能，它可以实现：

通过尾调用结构化地解析网络报头
运行时原子地添加或替换功能，也即动态地改变 BPF 程序的执行行为

在 samples/bpf 中可以看到 BPF_MAP_TYPE_PROG_ARRAY 的使用示例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


struct {
	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
	__uint(key_size, sizeof(u32));
	__uint(value_size, sizeof(u32));
	__uint(max_entries, 8);
} jmp_table SEC(".maps");

#define PARSE_VLAN 1
#define PARSE_MPLS 2
#define PARSE_IP 3
#define PARSE_IPV6 4

/* Protocol dispatch routine. It tail-calls next BPF program depending
 * on eth proto. Note, we could have used ...
 *
 *   bpf_tail_call(skb, &jmp_table, proto);
 *
 * ... but it would need large prog_array and cannot be optimised given
 * the map key is not static.
 */
static inline void parse_eth_proto(struct __sk_buff *skb, u32 proto)
{
	switch (proto) {
	case ETH_P_8021Q:
	case ETH_P_8021AD:
		bpf_tail_call(skb, &jmp_table, PARSE_VLAN);
		break;
	case ETH_P_MPLS_UC:
	case ETH_P_MPLS_MC:
		bpf_tail_call(skb, &jmp_table, PARSE_MPLS);
		break;
	case ETH_P_IP:
		bpf_tail_call(skb, &jmp_table, PARSE_IP);
		break;
	case ETH_P_IPV6:
		bpf_tail_call(skb, &jmp_table, PARSE_IPV6);
		break;
	}
}

Map In Map

eBPF 提供了两种特殊的 Map 类型，BPF_MAP_TYPE_ARRAY_OF_MAPS 和 BPF_MAP_TYPE_HASH_OF_MAPS，实现了 map-in-map，也就是 eBPF Map 中每一个 entry 的 Value 也是一个 Map，如下所示：

BPF_MAP_TYPE_ARRAY_OF_MAPS 和 BPF_MAP_TYPE_HASH_OF_MAPS 的区别在于，outer map 是一个 Array 还是 HashTable。

Create

之前的常规 eBPF Map 是在 load time 创建的，对于 map-in-map，我们需要定义一个 outer map，inner map 是在 runtime 被用户创建并插入到 outer map。outer map 定义如下所示：

1
2
3
4
5
6


struct bpf_map_def SEC("maps") outer_map = {
    .type = BPF_MAP_TYPE_HASH_OF_MAPS,
    .key_size = sizeof(__u32),
    .value_size = sizeof(__u32), // Must be u32 becuase it is inner map id
    .max_entries = 1,
};

这里需要注意：

outer map 的 value_size 必须是 __u32，这正好是 inner map id 的大小

尽管你不需要在 BPF C 程序中定义 inner map，verifier 需要在 load time 知道 inner map 的定义。所以，在调用 bpf_object__load 前，你必须创建一个 dummy inner map 并且通过调用 bpf_map__set_inner_map_fd 设置它的 fd 到 outer map 。注意，verifier 要求 dummy inner map 的 fd 必须在 load 之后关闭。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


const char* outer_map_name = "outer_map";
struct bpf_map* outer_map = bpf_object__find_map_by_name(obj, outer_map_name);
int inner_map_fd = bpf_create_map(
    BPF_MAP_TYPE_HASH,  // type
    sizeof(__u32),      // key_size
    sizeof(__u32),      // value_size
    8,                  // max_entries
    0);                 // flag
bpf_map__set_inner_map_fd(outer_map, inner_map_fd);
bpf_object__load(obj);
close(inner_map_fd); // Important

Insert

Insert Into Outer Map

插入到 outer map 步骤如下：

创建一个新的 inner map
将创建的 inner map 的 fd 作为 value 插入到 outer map
关闭 inner map fd

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


int inner_map_fd = bpf_create_map_name(
    BPF_MAP_TYPE_HASH,   // type
    "hechaol_inner_map", // name
    sizeof(__u32),       // key_size
    sizeof(__u32),       // value_size
    8,                   // max_entries
    0);                  // flag
__u32 outer_key = 42;
bpf_map_update_elem(outer_map_fd, &outer_key, &inner_map_fd, 0 /* flag */);
close(inner_map_fd); // Important!

注意：

outer map 的每一项 entry 的 value 是 the id of an inner map，但是调用 bpf_map_update_elem API 时给的参数是 the fd of the inner map
在插入之后你必须关闭 inner map fd 以避免内存泄漏。

Insert Into Inner Map

如前所述，outer map 的每一项 entry 的 value 是 the id of an inner map，而不是 the fd of the inner map。即使我们在调用 bpf_map_update_elem 传递的参数是 inner map fd，使用 bpf_map_lookup_elem 的时候我们的到的 value 是 inner map id，为了获得 inner map fd，可以调用 bpf_map_get_fd_by_id。拿到 inner map fd 之后，就可以像之前一样操作 inner map 了。

1
2
3
4
5
6
7
8
9


const __u32 outer_key = 42;
__u32 inner_map_id;
bpf_map_lookup_elem(outer_map_fd, &outer_key, &inner_map_id);
int inner_map_fd = bpf_map_get_fd_by_id(inner_map_id);
const __u32 inner_key = 12;
__u32 inner_value;
bpf_map_lookup_elem(inner_map_fd, &inner_key, &inner_value);
// ... Use inner_value;
close(inner_map_fd); // Important!

注意，每次调用 bpf_map_get_fd_by_id 都会返回一个新的 fd，你必须在使用之后关闭它以避免内存泄露。

Delete

对于 inner map 的删除和常规 Map 一样，可以调用 bpf_map_delete_elem：

1
2


const __u32 outer_key = 42;
bpf_map_delete_elem(outer_map_fd, &outer_key);

Perf Event Array

有时候我们期望 eBPF 程序能够通知用户态程序数据准备好了，array、hash 类型的 eBPF map 不满足此类使用场景，这时候就轮到 BPF_MAP_TYPE_PERF_EVENT_ARRAY 了。与普通 hash、array 类型有些不同，它没有 bpf_map_lookup_elem() 方法，使用的是 bpf_perf_event_output() 向用户态传递数据。它的 value_size 只能是 sizeof(u32)，代表的是 perf_event 的文件描述符；max_entries 则是 perf_event 的文件描述符数量。

有关源码如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


struct msg {
	__s32 seq;
	__u64 cts;
	__u8 comm[MAX_LENGTH];
};

struct bpf_map_def SEC("maps") map = {
	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
	.key_size = sizeof(int),
	.value_size = sizeof(__u32),
	.max_entries = 0,
};

SEC("kprobe/vfs_read")
int hello(struct pt_regs *ctx) {
	unsigned long cts = bpf_ktime_get_ns();
	struct msg val = {0};
	static __u32 seq = 0;

	val.seq = seq = (seq + 1) % 4294967295U;
	val.cts = bpf_ktime_get_ns();
	bpf_get_current_comm(val.comm, sizeof(val.comm));

	bpf_perf_event_output(ctx, &map, 0, &val, sizeof(val));

	return 0;
}

Note:

这里的 seq 代表的是消息序列号

若用户态不向内核态传递消息，PERF_EVENT_ARRAY map 中的 max_entries 没有意义。该 map 向用户态传递的数据暂存在 perf ring buffer 中，而由 max_entries 指定的 map 存储空间存放的是 perf_event 文件描述符，若用户态程序不向 map 传递 perf_event 的文件描述符，其值可以为 0。用户态程序使用 bpf(BPF_MAP_UPDATE_ELEM) 将由 sys_perf_event_open() 取得的文件描述符传递给 eBPF 程序，eBPF 程序再使用 bpf_perf_event_{read, read_value}() 得到该文件描述符。于此有关的用法见 linux kernel 下的 sample/bpf/tracex6_{user, kern.c}。

libbpf 提供了 PERF_EVENT_ARRAY map 在用户态开箱即用的 API，它使用了 epoll 进行封装，仅需调用 perf_buffer__new()、perf_buffer__poll() 即可使用：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


static void print_bpf_output(void *ctx, int cpu, void *data, __u32 size) {
	struct msg *msg = data;

	fprintf(stdout, "%.4f: @seq=%d @comm=%s\n",
		 (float)msg->cts/1000000000ul, msg->seq, msg->comm);
}

int main(int argc, char *argv[]) {
	struct perf_buffer_opts pb_opts = {};
	struct perf_buffer *pb;
	...

	pb_opts.sample_cb = print_bpf_output;
	pb = perf_buffer__new(map_fd, 8, &pb_opts);

	while (true) {
		perf_buffer__poll(pb, 1000);
		if (stop)
			break;
	}
	...
}

实战入门

现在我们就可以借助 BPF Map 来实现在内核空间收集网络包信息，主要包括源地址和目标地址，在用户空间展示这些信息。代码主要分两个部分：

一个是运行在内核空间的程序，主要功能为创建出定制版BPF Map，收集目标信息并存储至BPF Map中。
另一个是运行在用户空间的程序，主要功能为读取上面内核空间创建出的BPF Map里的数据，并进行格式化展示，以演示BPF Map在两者之间进行数据传递。

请注意，该程序的编译运行是基于Linux内核代码中BPF示例环境，如果你还不熟悉，可以参考上一篇博客。

内核空间

下面首先介绍运行在内核空间的示例代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98


#define KBUILD_MODNAME "foo"
#include <uapi/linux/bpf.h>
#include <uapi/linux/if_ether.h>
#include <uapi/linux/if_packet.h>
#include <uapi/linux/if_vlan.h>
#include <uapi/linux/ip.h>
#include <uapi/linux/in.h>
#include <uapi/linux/tcp.h>
#include <uapi/linux/udp.h>
#include "bpf_helpers.h"
#include "bpf_endian.h"
#include "xdp_ip_tracker_common.h"

#define bpf_printk(fmt, ...)                       \
    ({                                             \
        char ____fmt[] = fmt;                      \
        bpf_trace_printk(____fmt, sizeof(____fmt), \
                         ##__VA_ARGS__);           \
    })

struct bpf_map_def SEC("maps") tracker_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(struct pair),
    .value_size = sizeof(struct stats),
    .max_entries = 2048,
};

static __always_inline bool parse_and_track(bool is_rx, void *data_begin, void *data_end, struct pair *pair)
{
    struct ethhdr *eth = data_begin;

    if ((void *)(eth + 1) > data_end)
        return false;

    if (eth->h_proto == bpf_htons(ETH_P_IP))
    {
        struct iphdr *iph = (struct iphdr *)(eth + 1);
        if ((void *)(iph + 1) > data_end)
            return false;

        pair->src_ip = is_rx ? iph->daddr : iph->saddr;
        pair->dest_ip = is_rx ? iph->saddr : iph->daddr;

        // update the map for track
        struct stats *stats, newstats = {0, 0, 0, 0};
        long long bytes = data_end - data_begin;

        stats = bpf_map_lookup_elem(&tracker_map, pair);
        if (stats)
        {
            if (is_rx)
            {
                stats->rx_cnt++;
                stats->rx_bytes += bytes;
            }
            else
            {
                stats->tx_cnt++;
                stats->tx_bytes += bytes;
            }
        }
        else
        {
            if (is_rx)
            {
                newstats.rx_cnt = 1;
                newstats.rx_bytes = bytes;
            }
            else
            {
                newstats.tx_cnt = 1;
                newstats.tx_bytes = bytes;
            }
            bpf_map_update_elem(&tracker_map, pair, &newstats, BPF_NOEXIST);
        }
        return true;
    }
    return false;
}

SEC("xdp_ip_tracker")
int _xdp_ip_tracker(struct xdp_md *ctx)
{
    // the struct to store the ip address as the keys of bpf map
    struct pair pair;

    bpf_printk("starting xdp ip tracker...\n");

    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    // pass if the network packet is not ipv4
    if (!parse_and_track(true, data, data_end, &pair))
        return XDP_PASS;

    return XDP_DROP;
}

char _license[] SEC("license") = "GPL";

我们先来看运行在内核空间的BPF程序代码重点内容：

通过SEC("maps")声明并创建了一个名为tracker_map 的BPF Map，它的类型是BPF_MAP_TYPE_HASH，它的 key 和 value 都是自定义的struct，定义在了xdp_ip_tracker_common.h头文件中，具体如下所示：

函数parse_and_track是对网络包进行分析和过滤，把源地址和目的地址联合起来作为BPF Map的key，把当前网络包的大小以 byte 单位记录下来，并联合网络包计数器作为BPF Map的value。对于连续的网络包，如果生成的key已经存在，就把value累加，否则就新增一对key-value存入BPF Map中。其中通过bpf_map_lookup_elem()函数来查找元素，bpf_map_update_elem()函数来新增元素。

用户空间

接下来是运行在用户空间的示例代码：

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111


#include <linux/bpf.h>
#include <linux/if_link.h>
#include <assert.h>
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/resource.h>
#include <arpa/inet.h>
#include <netinet/ether.h>
#include <unistd.h>
#include <time.h>
#include "bpf_load.h"
#include <bpf/bpf.h>
#include "bpf_util.h"
#include "xdp_ip_tracker_common.h"

static int ifindex = 6; // target network interface to attach, you can find it via `ip a`
static __u32 xdp_flags = 0;

// unlink the xdp program and exit
static void int_exit(int sig)
{
    printf("stopping\n");
    set_link_xdp_fd(ifindex, -1, xdp_flags);
    exit(0);
}

// An XDP program which track packets with IP address
// Usage: ./xdp_ip_tracker
int main(int argc, char **argv)
{
    char *filename = "xdp_ip_tracker_kern.o";
    // change limits
    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
    if (setrlimit(RLIMIT_MEMLOCK, &r))
    {
        perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
        return 1;
    }

    // load the kernel bpf object file
    if (load_bpf_file(filename))
    {
        printf("error - bpf_log_buf: %s", bpf_log_buf);
        return 1;
    }

    // confirm the bpf prog fd is available
    if (!prog_fd[0])
    {
        printf("load_bpf_file: %s\n", strerror(errno));
        return 1;
    }

    // add signal handlers
    signal(SIGINT, int_exit);
    signal(SIGTERM, int_exit);

    // link the xdp program to the network interface
    if (set_link_xdp_fd(ifindex, prog_fd[0], xdp_flags) < 0)
    {
        printf("link set xdp fd failed\n");
        return 1;
    }

    int result;
    struct pair next_key, lookup_key = {0, 0};
    struct stats value = {};
    while (1)
    {
        sleep(2);
        // retrieve the bpf map of statistics
        while (bpf_map_get_next_key(map_fd[0], &lookup_key, &next_key) != -1)
        {
            //printf("The local ip of next key in the map is: '%d'\n", next_key.src_ip);
            //printf("The remote ip of next key in the map is: '%d'\n", next_key.dest_ip);
            struct in_addr local = {next_key.src_ip};
            struct in_addr remote = {next_key.dest_ip};
            printf("The local ip of next key in the map is: '%s'\n", inet_ntoa(local));
            printf("The remote ip of next key in the map is: '%s'\n", inet_ntoa(remote));
            
            // get the value via the key
            // TODO: change to assert
            // assert(bpf_map_lookup_elem(map_fd[0], &next_key, &value) == 0)
            result = bpf_map_lookup_elem(map_fd[0], &next_key, &value);
            if (result == 0)
            {
                // print the value
                printf("rx_cnt value read from the map: '%llu'\n", value.rx_cnt);
                printf("rx_bytes value read from the map: '%llu'\n", value.rx_bytes);
            }
            else
            {
                printf("Failed to read value from the map: %d (%s)\n", result, strerror(errno));
            }
            lookup_key = next_key;
            printf("\n\n");
        }
        printf("start a new loop...\n");
        // reset the lookup key for a fresh start
        lookup_key.src_ip = 0;
        lookup_key.dest_ip = 0;
    }

    printf("end\n");
    // unlink the xdp program
    set_link_xdp_fd(ifindex, -1, xdp_flags);
    return 0;
}

用户空间的代码跟一般看到的C程序的结构是一样的，都是有main函数作为入口。基本流程是，通过load_bpf_file()函数（本质就是用BPF_PROG_LOAD命令进行系统调用）加载对应内核空间的BPF程序编译出来的**.o**文件，这种通过编程加载BPF程序的方式，和我们之前通过命令行工具的方式相比，更具灵活性，适合实际场景中的产品分发。
加载完BPF程序之后，使用set_link_xdp_fd()函数 attach 到目标hook上，看函数名就知道了，这是XDP network hook。它接受的两个主要的参数是：
- ifindex，这个是目标网卡的序号（可以通过ip a查看），我这里填写的是6，它是对应了一个docker容器的veth虚拟网络设备；
- prog_fd[0]，这个是BPF程序加载到内存后生成的文件描述符fd。
有两个神奇的变量 prog_fd 和 map_fd 得说明下：
- 它们都是定义在bpf_load.c的全局变量；
- prog_fd是一个数组，在加载内核空间BPF程序时，一旦fd生成后，就添加到这个数组中去；
- map_fd也是一个数组，在运行上文提到的load_maps()函数时，一旦完成创建BPF Map系统调用生成fd后，同样会添加到这个数组中去。因此在bpf sample文件夹下的程序可以直接使用这两个变量，作为对于BPF程序和BPF Map的引用。
从代码 71 行开始是一个无限循环，里面是每2秒获取一下目标BPF Map的数据。获取的逻辑是通过bpf_map_get_next_key(map_fd[0], &lookup_key, &next_key)函数，map_fd[0]是你的目标BPF Map； lookup_key是需要查找的BPF Map目标key，这个参数是要主动传入的，而next_key是这个目标key相邻的下一个key，这个参数是被动赋值的。如果你想从头开始遍历BPF Map，就可以通过传入一个一定不存在的key作为lookup_key，然后next_key会被自动赋值为BPF Map中第一个key，key知道了，对应的value也就可以被读取了，直到bpf_map_get_next_key()返回为-1，即next_key没有可以被赋值的了，遍历也就完成了，这个函数工作起来是不是像一个iterator。通过上面两层循环，不停遍历BPF Map并打印里面的内容，一旦有新的网络包进来，也能及时获取到相关信息。

还有一段非常陌生的代码，如下所示：

1
2
3
4
5
6


struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
if (setrlimit(RLIMIT_MEMLOCK, &r))
{
   perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
   return 1;
}

这里有一个struct叫rlimit，全称是resource limit，顾名思义，它是控制应用进程能使用资源的限额。
常量RLIM_INFINITY看起来就是无限的意思，因此第一行代码就是定义了一个没有上限的资源配额。
第二行代码使用了函数setrlimit()，传入的第一个参数是一个资源规格名称——RLIMIT_MEMLOCK，即内存；第二个参数是刚才定义的无限资源配额，可以猜出这行代码的意思就是为内存资源配置了无限配额，即没有内存上限。
为什么要把内存限制放开呢？因为操作系统在不同的CPU架构，对于应用进程能使用的内存限制是不统一的，而不同的BPF程序需要使用到的内存资源也是可变的，比如你的BPF Map申请了很大的max_entries，那么这个BPF程序一定会使用不少的内存。因此为了成功运行BPF程序，就把对于内存的限制放开成无限了。

匿名 inode

在Unix/Linux的世界，一切皆是文件，BPF Map也不例外。从上文看到我们是可以通过文件描述符fd来访问BPF Map内的数据，因此BPF Map创建是遵循Linux文件创建的过程。实现BPF_MAP_CREATE系统调用命令的函数是map_create()，即创建BPF Map的核心函数：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68


static int map_create(union bpf_attr *attr)
{
  int numa_node = bpf_map_attr_numa_node(attr);
  struct bpf_map *map;
  int f_flags;
  int err;

  err = CHECK_ATTR(BPF_MAP_CREATE);
  if (err)
    return -EINVAL;

  f_flags = bpf_get_file_flag(attr->map_flags);
  if (f_flags < 0)
    return f_flags;

  if (numa_node != NUMA_NO_NODE &&
      ((unsigned int)numa_node >= nr_node_ids ||
       !node_online(numa_node)))
    return -EINVAL;

  /* find map type and init map: hashtable vs rbtree vs bloom vs ... */
  map = find_and_alloc_map(attr);
  if (IS_ERR(map))
    return PTR_ERR(map);

  err = bpf_obj_name_cpy(map->name, attr->map_name);
  if (err)
    goto free_map_nouncharge;

  atomic_set(&map->refcnt, 1);
  atomic_set(&map->usercnt, 1);

  err = security_bpf_map_alloc(map);
  if (err)
    goto free_map_nouncharge;

  err = bpf_map_charge_memlock(map);
  if (err)
    goto free_map_sec;

  err = bpf_map_alloc_id(map);
  if (err)
    goto free_map;

  // assign a fd for bpf map
  err = bpf_map_new_fd(map, f_flags);
  if (err < 0) {
    /* failed to allocate fd.
     * bpf_map_put() is needed because the above
     * bpf_map_alloc_id() has published the map
     * to the userspace and the userspace may
     * have refcnt-ed it through BPF_MAP_GET_FD_BY_ID.
     */
    bpf_map_put(map);
    return err;
  }

  trace_bpf_map_create(map, err);
  return err;

free_map:
  bpf_map_uncharge_memlock(map);
free_map_sec:
  security_bpf_map_free(map);
free_map_nouncharge:
  map->ops->map_free(map);
  return err;
}

其中bpf_map_new_fd()函数就是用来为BPF Map分配fd的，下面是其函数主体：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


// https://elixir.bootlin.com/linux/v4.15/source/kernel/bpf/syscall.c#L327
int bpf_map_new_fd(struct bpf_map *map, int flags)
{
  int ret;

  ret = security_bpf_map(map, OPEN_FMODE(flags));
  if (ret < 0)
    return ret;
/**
 * anon_inode_getfd - creates a new file instance by hooking it up to an
 *                    anonymous inode, and a dentry that describe the "class"
 *                    of the file
 *
 * @name:    [in]    name of the "class" of the new file
 * @fops:    [in]    file operations for the new file
 * @priv:    [in]    private data for the new file (will be file's private_data)
 * @flags:   [in]    flags
 *
 * Creates a new file by hooking it on a single inode. This is useful for files
 * that do not need to have a full-fledged inode in order to operate correctly.
 * All the files created with anon_inode_getfd() will share a single inode,
 * hence saving memory and avoiding code duplication for the file/inode/dentry
 * setup.  Returns new descriptor or an error code.
 */
  return anon_inode_getfd("bpf-map", &bpf_map_fops, map,
        flags | O_CLOEXEC);
}

要说的是anon_inode_getfd()这个函数，它不是一般的分配 fd 的方式，是一种特殊的匿名方式，它的inode没有被绑定到磁盘上的某个文件，而是仅仅在内存里。一旦fd关闭后，对应的内存空间就会被释放，相关数据，即我们的 BPF Map也就被删除了。它的comment doc写得非常好，详细大家可以自行了解。

也可以通过lsof和cat /proc/[pid]/fd命令看到BPF Map作为 anon_inode 的效果（其实普通的BPF程序也是这个type）：

BPF Map 调试

如果想看当前操作系统上面是否有正在使用BPF Map，可以使用BPF社区大力推荐的命令行工具——BPFtool，它是专门用来查看BPF程序和BPF Map的命令行工具，并且可以对它们做一些简单操作。BPFtool源码被维护在Linux内核代码里，因此一般都是通过make命令自行编译出可执行文件，操作起来并不麻烦，如下所示：

1
2
3
4
5


cd linux-source-code/tools
make -C  bpf/bpftool/
cd bpf/bpftool/
# the output is a binary named as `bpftool`
./bpftool [prog|map]

需要注意的是，不同内核版本下的BPFtool代码有所差异，其功能也不一样，一般来说高版本内核下的BPFtool功能更多，也是向下兼容的。我使用的就是在5.6.6内核版本下编译出来的BPFtool，并且在内核版本是4.15.0操作系统上运行顺畅。

接下来给大家简单演示如何使用bpftool查看BPF Map信息，主要用两个命令进行查看：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


# command #1, list all the bpf map in the current node
# you can find map id, map type, map name, key type, value type, the number of max entry and memory allocation in the output
> bpftool map 
29: hash  name tracker_map  flags 0x0
  key 8B  value 32B  max_entries 2048  memlock 217088B


# command #2, show the bpf map details including keys and value in hex-format
# the map id can be found in the output of command #1
# you can also find the element number
> bpftool map dump id [map id]
key:
c0 a8 3a 01 ac 11 00 02
value:
00 00 00 00 00 00 00 00  0a 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  e4 02 00 00 00 00 00 00
key:
ac 11 00 01 ac 11 00 02
value:
00 00 00 00 00 00 00 00  07 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  06 02 00 00 00 00 00 00
Found 2 elements