在内核中经常会有需要用到队列来传递数据的时候，而在 Linux 内核中就有一个轻量而且实现非常巧妙的队列实现——kfifo。简单来说 kfifo 是一个有限定大小的环形 buffer，借用网络上的一个图片来说明一下是最清楚的：

kfifo本身并没有队列元素的概念，其内部只是一个 buffer。在使用的时候需要用户知道其内部存储的内容，所以最好是用来存储定长对象。

kfifo有一个重要的特性，就是当使用场景是单生产者单消费者(1 Producer 1 Consumer，以下简称 1P1C)的情况下，不需要加锁，所以在这种情况下的性能较高。

本文中的所有代码均来自 linux kernel 2.6.32，所以 License 也是 GPLv2 的。

定义及 API

kfifo 主要定义在 include/linux/kfifo.h里面：

struct kfifo {
  unsigned char *buffer;	/* the buffer holding the data */
  unsigned int size;  /* the size of the allocated buffer */
  unsigned int in;  /* data is added at offset (in % size) */
  unsigned int out;  /* data is extracted from off. (out % size) */
  spinlock_t *lock;  /* protects concurrent modifications */
};

extern struct kfifo *kfifo_init(
    unsigned char *buffer, unsigned int size,
    gfp_t gfp_mask, spinlock_t *lock);
extern struct kfifo *kfifo_alloc(
    unsigned int size, gfp_t gfp_mask,
    spinlock_t *lock);
extern void kfifo_free(struct kfifo *fifo);
extern unsigned int __kfifo_put(struct kfifo *fifo,
    const unsigned char *buffer, unsigned int len);
extern unsigned int __kfifo_get(struct kfifo *fifo,
    unsigned char *buffer, unsigned int len);

可以看到在 kfifo 本身的定义里面，有一个 spinlock_t，这是用来在多线程同时修改队列的时候加锁的。而其余的成员就很明显了，是用来表示队列的当前状态的。队列本身的内容存储在 buffer里面。

需要注意的是，kfifo 要求队列的 size 是 2 的幂(2^n)，这样在后面操作的时候求余操作可以通过与运算来完成，从而更高效。

初始化通过 kfifo_init和 kfifo_alloc完成。而对于队列操作的主要函数的是 kfifo_put和 kfifo_get。这两个函数会先加锁，然后调用 __kfifo_put或者 __kfifo_get。也就是说真正的逻辑是实现在这两个函数里。之前也说过 kfifo在 1P1C 的情况下是不需要加锁的，所以这里我们会着重看看这两个函数。

入队

__kfifo_put的定义很短：

unsigned int __kfifo_put(struct kfifo *fifo,
  const unsigned char *buffer, unsigned int len)
{
  unsigned int l;
  len = min(len, fifo->size - fifo->in + fifo->out);

  /*
   * Ensure that we sample the fifo->out index -before- we
   * start putting bytes into the kfifo.
   */
  smp_mb();

  /* first put the data starting from fifo->in to buffer end */
  l = min(len, fifo->size - (fifo->in & (fifo->size - 1)));
  memcpy(fifo->buffer + (fifo->in & (fifo->size - 1)), buffer, l);

  /* then put the rest (if any) at the beginning of the buffer */
  memcpy(fifo->buffer, buffer + l, len - l);

  /*
   * Ensure that we add the bytes to the kfifo -before-
   * we update the fifo->in index.
   */
  smp_wmb();
  fifo->in += len;

  return len;
}

可以看到里面加了一些 memory barrier 来确保 1P1C 场景的正确，这里我们可以暂时忽略。

主要的步骤如下：

计算 len 和队列余下容量的较小值，如果队列容量不足，则只会拷贝剩余容量的大小。
先拷贝一部分内容到队列的尾部。
如果队列尾部并不能容下所有的内容，则再在队列的头部空闲空间继续拷贝。
把队列内容长度加上 len
返回新增内容的长度 len

这里注意到 in 只有在 __kfifo_put里面才会修改，而这个函数里面只会对 in 增加，所以 in 的值只会增加，不会减少。而 in 本身是 unsigned int类型的，所以当 in 超出了 2^32 的时候，会自动从 0 开始继续。

同时前面也说过，kfifo的 size 是 2^n。所以当 in > 2^n的时候，(in & 2^n - 1) == (in % 2^n)，所以这里可以用与操作替代求余来获取 in 在队列中实际的位置。

出队

__kfifo_get的定义和 __kfifo_put长度差不多：

unsigned int __kfifo_get(struct kfifo *fifo,
  unsigned char *buffer, unsigned int len)
{
  unsigned int l;
  len = min(len, fifo->in - fifo->out);

  /*
   * Ensure that we sample the fifo->in index -before- we
   * start removing bytes from the kfifo.
   */
  smp_rmb();

  /* first get the data from fifo->out until the end of the buffer */
  l = min(len, fifo->size - (fifo->out & (fifo->size - 1)));
  memcpy(buffer, fifo->buffer + (fifo->out & (fifo->size - 1)), l);

  /* then get the rest (if any) from the beginning of the buffer */
  memcpy(buffer + l, fifo->buffer, len - l);

  /*
   * Ensure that we remove the bytes from the kfifo -before-
   * we update the fifo->out index.
   */
  smp_mb();
  fifo->out += len;

  return len;
}

忽略掉 memory barrier 之后，主要步骤如下：

计算 len 和队列长度的较小值，如果队列内容不够，则只拷贝较小值的大小。
拷贝队列尾部的内容到输出 buffer 里面。
如果仍然有部分内容没有拷贝的话，则从队列头部拷贝余下的内容。
队列内容长度减少 len(也就是 out += len)。
返回拷贝内容的长度。

其实基本就是 __kfifo_put的逆过程。

那这里就有一个问题了，其实队列的长度并不一定要用 in和 out两个变量来表示啊，也可以用一个 len变量来表示啊。那这里就涉及到了多线程的互斥问题了。

多线程互斥

这里我们只考虑最简单的多线程场景——1P1C。如果我们只用一个 len来表示队列长度的话，那么看看 __kfifo_put和 __kfifo_get里面对这个变量都需要做修改，而且一个是 +=操作，一个是 -=。如果在不加锁的情况下，这两个操作并不是原子操作，所以如果只用一个 len，我们必须用锁来保护，无论是多么简单的多线程场景。

如果我们用 in和 out来表示队列的读边界和写边界的话，那么队列的长度可以用 in - out来表示。而且就像我们看到的那样，in只会在 __kfifo_put里面修改，而 out也只会在 __kfifo_get里面修改，所以无论是 in或 out都只会有一个线程修改，所以不会有互斥的问题。

那是不是这样就线程安全了呢？并不是。

还记得之前忽略掉的那些 memory barrier 吗？如果没有了那些 barrier 的话，代码仍然是不安全的。因为在多线程里面，我们不单只需要确保原子性，还需要保证不会有乱序(可见性)。而在没有锁或者 memory barrier 的情况下，没有办法保证在所有 CPU 上都不会出现乱序。而上面代码里面的 memory barrier 就是为了确保不出现乱序而加入的。

简单介绍一下这几个 memory barrier 的作用：

smp_rmb保证读操作之间不会出现乱序
smp_wmb保证写操作之间不会出现乱序
smp_mb保证读写操作都不会出现乱序

接着我们可以把 kfifo 里面对 in、out和 buffer的读写操作归类一下，那么 __kfifo_put的是下面这样：

R(in), R(out)
R(in), W(buffer)
W(in)

而 __kfifo_get则是下面这样：

R(in), R(out)
R(out), R(buffer)
W(out)

我们先来看 __kfifo_put，有几个内存操作是不可以出现乱序的：

R(out)和 W(buffer)：因为我们需要知道 out的最新值，否则可能出现明明有队列有空间，但是我们仍写不进去数据的情况。这里因为是要保证读写操作之间的顺序，所以需要用 smp_mb。实际上在 x86/64 平台，连这个 barrier 也可以忽略，因为在 x86 上面，读后写是保证不会乱序的，不过 Linux 内核由于需要保证各个平台都能 work，所以仍然需要这里加上。
W(buffer)和 W(in)：这个顺序是必须要保证的，否则可能我们更新了 in之后，这个时候 buffer 的内容其实并没有 copy 进去，但是这时候来了一个 __kfifo_get，就把内容拷贝出去了，这个是不允许的。所以这里我们需要用 smp_wmb。

我们可以用下面这个图来表示 kfifo在 put 的时候的状态：

类似的，__kfifo_get也有几个内存操作不可以乱序：

R(in)和 R(buffer)：我们需要获取最新的 in值，否则可能会出现明明队列有内容，但是我们却读不到。这里需要用 smp_rmb。
R(buffer)和 W(out)：这个顺序也是必须保证的，因为如果我们在读 buffer 之前就更新的 out 的话，则可能出现正要读 buffer 之前，该内容已经被 __kfifo_put覆盖了，则读出来并不是我们想要的内容。这里需要用 smp_mb。

kfifo在 get 的时候的状态可以用下面的图来表示：

所以有了上面 kfifo 的实现，也就有了一个非常高效的 1P1C 队列。当然如果是在其他的多线程场景，我们仍然需要用 spinlock 来保护 kfifo。

性能比较

我建了一个 repo(kfifo-benchmark)来简单地比较了一下 kfifo 的性能。我把 kfifo port 到了 user space，同时简单地把 spinlock_t替换成了 pthread_mutex_t(pthread_spinlock_t默认并不在 pthread，需要另外配置)。

比较里面的三个 case(可以自行到main.cc里面去看)及性能如下(我用的是 real time/wall time，所以时间越短表示越快)：

使用 __kfifo_put和 __kfifo_get的 1P1C(无锁)：0m3.496s
使用 kfifo_put和 kfifo_get的 1P1C 场景(mutex)：0m13.291s
使用 tpool 里面的 BoundedBlockingQueue默认特化的 1P1C 场景(mutex+condition variable)：0m17.791s

可以看出来，在 1P1C 场景下，kfifo 的无锁版比加锁版本要快 3.8x。而就算是 kfifo 的加锁版本，也比 tpool 中的 BoundedBlockingQueue要快 33%。

参考资料

http://www.infoq.com/cn/articles/cache-coherency-primer http://blog.csdn.net/muxiqingyang/article/details/6615199 http://www.parallellabs.com/2010/03/06/why-should-programmer-care-about-sequential-consistency-rather-than-cache-coherence/ http://stackoverflow.com/questions/30684974/are-cache-line-ping-pong-and-false-sharing-the-same http://psy-lob-saw.blogspot.co.uk/2013/11/spsc-iv-look-at-bqueue.html http://psy-lob-saw.blogspot.com/p/lock-free-queues.html https://github.com/seL4/refos/blob/master/impl/apps/process_server/src/system/memserv/ringbuffer.c http://www.geekgrade.com/geeksheet/nitsanw/blogs http://www.linuxjournal.com/content/lock-free-multi-producer-multi-consumer-queue-ring-buffer https://mechanical-sympathy.blogspot.com/ http://psy-lob-saw.blogspot.com/2013/10/spsc-revisited-part-iii-fastflow-sparse.html https://github.com/nitsanw/examples/blob/revisited/src/main/java/psy/lob/saw/ff3/FFBufferOrdered3.java#L42 https://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular#heading_references http://www.boost.org/doc/libs/1_64_0/doc/html/lockfree.html https://github.com/stv0g/c11-queues https://github.com/rigtorp/SPSCQueue https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads https://codereview.stackexchange.com/questions/49142/single-consumer-and-single-producer-lock-free-circular-buffer https://willnewton.name/uncategorized/simple-lock-free-ring-buffer-implementation/ http://courses.cs.washington.edu/courses/cse451/03wi/section/prodcons.htm http://joeduffyblog.com/2009/10/04/fast-synchronization-between-a-single-producer-and-single-consumer/ https://nativecoding.wordpress.com/2015/06/17/multithreading-lockless-thread-safe-spsc-ring-buffer-queue/ http://www.cgo.org/cgo2010/epic8/papers/jablin.pdf http://www.blogjava.net/yongboy/archive/2015/02/12/422893.html http://www.rossbencina.com/code/lockfree http://xiaorui.cc/2015/12/02/%E4%BD%BF%E7%94%A8socket-so_reuseport%E6%8F%90%E9%AB%98%E6%9C%8D%E5%8A%A1%E7%AB%AF%E6%80%A7%E8%83%BD/ http://m.blog.chinaunix.net/uid-10167808-id-3807060.html https://zhuanlan.zhihu.com/p/25533528 http://cpper.info/2016/07/30/Concurrency-Network-Server-Models.html#title0 http://www.jianshu.com/p/128352841aab https://code.lm7.fr/robotux/rt_benchs/src/master http://www-public.tem-tsp.eu/~thomas_g/research/biblio/2010/preudhomme10sbac-batchqueue.pdf http://www.linuxidc.com/Linux/2016-12/137936.htm http://www.linuxidc.com/Linux/2016-12/137937.htm http://blog.codingnow.com/2012/06/dev_note_21.html http://wudaijun.com/2015/02/cpp-disruptor/ https://github.com/fsaintjacques/disruptor-- http://ifeve.com/disruptor/ http://tech.meituan.com/disruptor.html https://lmax-exchange.github.io/disruptor/ http://lmax-exchange.github.io/disruptor/ https://zhuanlan.zhihu.com/p/23863915 http://blog.csdn.net/chen19870707/article/details/39994303 https://github.com/fsaintjacques/disruptor--/blob/master/test/ring_buffer_test.cc https://github.com/wudaijun/Code/blob/master/Demo/disruptor/main.cpp http://blog.jobbole.com/109648/ https://kukuruku.co/post/lock-free-data-structures-yet-another-treatise/ http://blog.csdn.net/zzulp/article/details/6259866 http://coolshell.cn/articles/8239.html https://en.wikipedia.org/wiki/ABA_problem http://aigo.iteye.com/blog/2292169 http://moodycamel.com/blog/2013/a-fast-lock-free-queue-for-c++ http://ju.outofmemory.cn/entry/213197 http://wizmann.tk/inspiration-from-zeromq.html https://taozj.org/201609/lockless-in-multi-thread.html http://www.parallellabs.com/2010/10/25/practical-concurrent-queue-algorithm/ http://purecpp.org/?tag=%E6%97%A0%E9%94%81%E9%98%9F%E5%88%97 https://github.com/LMAX-Exchange/disruptor http://biz.sse.com.cn/sseportal/cs/zhs/ywyy/hyb/ngtrade/new/itrdc/ITRDC_TechMag_008_201209_6.pdf http://blog.csdn.net/chen19870707/article/details/39896655 http://blog.codingnow.com/2007/12/fence_in_multi_core.html http://cration.rcstech.org/program/2014/01/22/linux-kfifo/ http://blog.csdn.net/linyt/article/details/53355355 https://www.ibm.com/developerworks/cn/linux/l-cn-lockfree/ http://wiki.jikexueyuan.com/project/disruptor-getting-started/storage-barrier.html http://linfengying.com/?p=2386 http://hugozhu.myalert.info/2013/03/28/22-memory-barriers-or-fences.html http://www.cnblogs.com/Anker/p/3481373.html http://www.cnblogs.com/dodng/p/4367791.html http://www.lai18.com/content/2459345.html http://blog.kongfy.com/2017/02/%E6%97%A0%E9%94%81%E9%98%9F%E5%88%97%E7%9A%84%E4%B8%80%E7%A7%8D%E5%AE%9E%E7%8E%B0/ https://zhuanlan.zhihu.com/p/23979167 http://www.nathanyan.com/2015/08/14/%E4%BC%AA%E5%85%B1%E4%BA%AB-False-Sharing/ https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2 http://dreamrunner.org/blog/2014/06/28/qian-tan-memory-reordering/ http://m.blog.chinaunix.net/uid-21169302-id-446211.html http://www.cs.utexas.edu/~pingali/CS377P/2017sp/lectures/mesi.pdf http://www.cnblogs.com/my_life/articles/5386244.html http://blog.jobbole.com/54345/ https://software.intel.com/zh-cn/blogs/2010/01/14/cpucpu/ http://www.infoq.com/cn/articles/atomic-operation http://www.infoq.com/cn/articles/atomic-operations-and-contention http://hedengcheng.com/?p=803 https://www.ibm.com/developerworks/cn/linux/l-synch/part2/ https://www.ibm.com/developerworks/cn/linux/l-synch/part1/ https://zhuanlan.zhihu.com/p/24146167 http://preshing.com/20120913/acquire-and-release-semantics/ http://preshing.com/20120515/memory-reordering-caught-in-the-act/ https://github.com/MintCN/linux-insides-zh/tree/master/SyncPrim http://blog.csdn.net/wentianyao/article/details/51250781 http://www.infoq.com/cn/articles/High-Performance-Java-Inter-Thread-Communications https://www.zhihu.com/question/24301047 http://preshing.com/20120612/an-introduction-to-lock-free-programming/ http://www.valleytalk.org/wp-content/uploads/2011/07/lock-free-intro1.pdf http://blog.csdn.net/ithomer/article/details/8774006 https://booksite.elsevier.com/9780123705914/?ISBN=9780123705914