Abstract memory access model

What are memory barriers?

根据上面的描述可以看到,CPU 为了提高执行效率,相互独立的内存访问操作可能会被以任意随机的顺序执行。但是这种内存访问的重排序可能会给 CPU-CPU 的交互,或者给 I/O 操作带来问题。为了解决重排序带来的问题,需要引入某种机制,来告诉编译器和 CPU,对重排序进行部分限制。

这种介入机制即是 memory barrier(内存屏障),它们会在 barrier 两侧对内存访问顺序进行部分的限制。memory barrier 的这种强制是必要的,因为 CPU 或者是系统中的其他设备为了提高性能,会引入一系列的 tricks。比如:

  • 内存访问重排序、延迟、合并
  • 推测加载
  • 推测分支预测
  • 各种形式的缓存

通过 memory barrier,我们可以部分限制这些操作,从而使得在多 CPU 交互或者和 I/O 设备交互时能够按照程序开发人员想要的逻辑执行。

Varieties of memory barrier

memory barrier主要有以下四种类型:

Write (or store) memory barriers

A write memory barrier gives a guarantee that all the STORE operations specified before the barrier will appear to happen before all the STORE operations specified after the barrier with respect to the other components of the system.

对于 write memory barrier,所有在 barrier 之前的 STORE 操作肯定会在 barrier 之后的 store 操作前发生(happens before)。写屏障只对 store 操作有影响,不会影响其两侧的 load 操作。

{% note info %}

Note that write barriers should normally be paired with read or data dependency barriers; see the “SMP barrier pairing” subsection

{% endnote %}

Data dependency barriers

A data dependency barrier is a weaker form of read barrier. In the case where two loads are performed such that the second depends on the result of the first (eg: the first load retrieves the address to which the second load will be directed), a data dependency barrier would be required to make sure that the target of the second load is updated after the address obtained by the first load is accessed.

数据依赖屏障,要求在两次 load 中,第二次的 load 必须在第一次 load 执行完后才更新,典型的例子就是,第一次 load 获得一个内存地址,第二次 load 到那个内存地址取数据。

The other CPUs in the system can be viewed as committing sequences of stores to the memory system that the CPU being considered can then perceive. A data dependency barrier issued by the CPU under consideration guarantees that for any load preceding it, if that load touches one of a sequence of stores from another CPU, then by the time the barrier completes, the effects of all the stores prior to that touched by the load will be perceptible to any loads issued after the data dependency barrier.

在多 CPU 架构中,一个 CPU 发出了数据依赖屏障,如果这个屏障前的某个 load 操作加载了其他 CPU store 的内容,那么这个 store 前面的所有操作都会被这个 CPU 的数据依赖屏障后的所有 load 感知到。

{% note info %}

Note that the first load really has to have a data dependency and not a control dependency. If the address for the second load is dependent on the first load, but the dependency is through a conditional rather than actually loading the address itself, then it’s a control dependency and a full read barrier or better is required. See the “Control dependencies” subsection for more information.

{% endnote %}

Read (or load) memory barriers

A read barrier is a data dependency barrier plus a guarantee that all the LOAD operations specified before the barrier will appear to happen before all the LOAD operations specified after the barrier with respect to the other components of the system.

与 写屏障 类似,所有在 读屏障 之前的 load 操作肯定会在 读屏障 之后的 load 操作前发生(happens before)。读屏障只对 load 操作有影响,不会影响其两侧的 store 操作。

读屏障就意味着是一种数据依赖屏障,所以当然可以用读屏障来替代数据依赖屏障。

{% note info %}

Note that read barriers should normally be paired with write barriers; see the “SMP barrier pairing” subsection.

{% endnote %}

General memory barriers

A general memory barrier gives a guarantee that all the LOAD and STORE operations specified before the barrier will appear to happen before all the LOAD and STORE operations specified after the barrier with respect to the other components of the system.

通用内存屏障,或者叫读写屏障,会对 barrier 之前的 loadstore 操作都有影响。

下面是两个隐式的内存屏障,ACQUIRE 和 RELEASE

ACQUIRE 操作

ACQUIRE 操作保证了所有 ACQUIRE 操作之后的操作都会在 ACQUIRE 操作之后进行。但是 ACQUIRE 操作之前的操作有可能在 ACQUIRE 操作之后进行。

This acts as a one-way permeable barrier. It guarantees that all memory operations after the ACQUIRE operation will appear to happen after the ACQUIRE operation with respect to the other components of the system.

ACQUIRE操作包括 LOCK 操作 和 smp_load_acquire()smp_cond_load_acquire()

RELEASE 操作

RELEASE 操作保证了所有的 RELEASEE 操作之前的操作都会 RELEASE 操作之前进行。但是 RELEASEE 操作之后的操作有可能在 RELEASE 操作之前执行。

This also acts as a one-way permeable barrier. It guarantees that all memory operations before the RELEASE operation will appear to happen before the RELEASE operation with respect to the other components of the system.

RELEASE操作包括 UNLOCK 操作和 smp_store_release()

The use of ACQUIRE and RELEASE operations generally precludes the need for other sorts of memory barrier. In addition, a RELEASE+ACQUIRE pair is -not- guaranteed to act as a full memory barrier. However, after an ACQUIRE on a given variable, all memory accesses preceding any prior RELEASE on that same variable are guaranteed to be visible. In other words, within a given variable’s critical section, all accesses of all previous critical sections for that variable are guaranteed to have completed.

ACQUIRE 和 RELEASE 经常一起用,就像 LOCK 和 UNLOCK 一样。对于同一个变量的 ACQUIRE 和 RELEASE,这期间的代码就是 Critical Section,能够保证在 Critical Section 内代码执行时,所有 Critical Section 之前的代码都已经执行完毕。

What may not be assumed about memory barriers?

  • 不能保证在内存屏障指令执行完后,内存屏障指令之前的内存访问操作都执行完毕。这里的内存屏障指令相当于在内存访问操作里划了一条线,保证内存访问按照某种顺序进行。
  • 不能保证在一个 CPU 上执行的内存屏障指令会对另一个 CPU 产生直接影响。会有间接影响,这里的间接影响指的是,另一个 CPU 如何看到第一个 CPU 的内存访问操作的顺序。
  • 不能保证在另一个 CPU 内存访问操作之后,这个 CPU 能够看到正确的访问操作影响,即使另一个 CPU 已经使用了内存屏障
  • 不能保证 CPU 片外的一些硬件(比如 DMA,PCI)不会对内存访问重排序。CPU 缓存一致性机制会保证将内存屏障的间接影响传递到各个 CPU,但可能不是按顺序。

Data dependency barriers (historical)

在 DEC Alpha 体系架构的机器还会碰到 data dependency barrier 的问题,具体参考原文,此处不再复述。

Control dependencies

对于 load-load 的控制依赖,为了使其能够正常工作,需要一个完整的 read memory barrier,而不是一个简单的数据依赖屏障。

1
2
3
4
5
q = READ_ONCE(a);
if (q) {
  <data dependency barrier>  /* BUG: No data dependency!!! */
  p = READ_ONCE(b);
}

对于上面这段代码,因为 CPU 可能会尝试去预测 load a 的值,从而预先 load b。所以就有可能造成 load b 在 load a 前发生。这个时候采用下列的代码会更加合适:

1
2
3
4
5
q = READ_ONCE(a);
if (q) {
  <read barrier>
  p = READ_ONCE(b);
}

但是,CPU 是不会去预先 store 的,所以对于 load-store 的控制依赖,原有的顺序是保证的,比如这段代码

1
2
3
4
	q = READ_ONCE(a);
	if (q) {
		WRITE_ONCE(b, 1);
	}

注意,对于控制依赖,这里的 READ_ONCE 和 WRITE_ONCE 都是必须的,否则编译器可能会将 load from a 和 other load from a 结合起来,或者将 store to b 与 other store to b 结合起来。

进一步,如果编译器能够证明 a 的值是非零的,那么如果没有 READ_ONCE,上面的代码会被优化成:

1
2
	q = a;
	b = 1;  /* BUG: Compiler and CPU can both reorder!!! */

有时候,我们期待对 identical stores 有这样的顺序,比如下面的代码:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
	q = READ_ONCE(a);
	if (q) {
		barrier();
		WRITE_ONCE(b, 1);
		do_something();
	} else {
		barrier();
		WRITE_ONCE(b, 1);
		do_something_else();
	}

但是,现在的编译器会将上面代码优化

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
	q = READ_ONCE(a);
	barrier();
	WRITE_ONCE(b, 1);  /* BUG: No ordering vs. load from a!!! */
	if (q) {
		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
		do_something();
	} else {
		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
		do_something_else();
	}

这样一来,load a 和 store b 之间就没有条件关系了,这样 CPU 可以将它们重排序,这就改变了原来的代码逻辑。这个时候,你需要显式的内存屏障:

1
2
3
4
5
6
7
8
	q = READ_ONCE(a);
	if (q) {
		smp_store_release(&b, 1);
		do_something();
	} else {
		smp_store_release(&b, 1);
		do_something_else();
	}

{% note info %}

这段不是很懂…

{% endnote %}

作为对比,如果没有显式的内存屏障,当两个 store 不同时,下面代码是的顺序是可以保证的

1
2
3
4
5
6
7
8
	q = READ_ONCE(a);
	if (q) {
		WRITE_ONCE(b, 1);
		do_something();
	} else {
		WRITE_ONCE(b, 2);
		do_something_else();
	}

The initial READ_ONCE() is still required to prevent the compiler from proving the value of ‘a’.

此外,对于这里的局部变量 q,要注意在它上面的操作,否则编译器将会推测它的值,然后自己优化移掉条件。比如:

1
2
3
4
5
6
7
8
	q = READ_ONCE(a);
	if (q % MAX) {
		WRITE_ONCE(b, 1);
		do_something();
	} else {
		WRITE_ONCE(b, 2);
		do_something_else();
	}

如果这里的 MAX = 1,那么编译器知道 q % MAX = 0 , 然后就会将代码转换如下

1
2
3
	q = READ_ONCE(a);
	WRITE_ONCE(b, 2);
	do_something_else();

这样一来,CPU 可能会重排 load a 和 store b 的顺序,所以这个时候应该这样

1
2
3
4
5
6
7
8
9
	q = READ_ONCE(a);
	BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
	if (q % MAX) {
		WRITE_ONCE(b, 1);
		do_something();
	} else {
		WRITE_ONCE(b, 2);
		do_something_else();
	}

In addition, control dependencies apply only to the then-clause and else-clause of the if-statement in question. In particular, it does not necessarily apply to code following the if-statement:

	q = READ_ONCE(a);
	BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
	if (q % MAX) {
		WRITE_ONCE(b, 1);
		do_something();
	} else {
		WRITE_ONCE(b, 2);
		do_something_else();
	}

关于控制依赖的总结:

  • Control dependencies can order prior loads against later stores. However, they do -not- guarantee any other sort of ordering: Not prior loads against later loads, nor prior stores against later anything. If you need these other forms of ordering, use smp_rmb(), smp_wmb(), or, in the case of prior stores and later loads, smp_mb().
  • 其他见原文

SMP barrier pairing

内存屏障总是成对出现(啥意思)

	CPU 1		      CPU 2
	===============	      ===============
	WRITE_ONCE(a, 1);
	<write barrier>
	WRITE_ONCE(b, 2);     x = READ_ONCE(b);
			      <read barrier>
			      y = READ_ONCE(a);
	CPU 1		      CPU 2
	===============	      ===============================
	a = 1;
	<write barrier>
	WRITE_ONCE(b, &a);    x = READ_ONCE(b);
			      <data dependency barrier>
			      y = *x;
	CPU 1		      CPU 2
	===============	      ===============================
	r1 = READ_ONCE(y);
	<general barrier>
	WRITE_ONCE(x, 1);     if (r2 = READ_ONCE(x)) {
			         <implicit control dependency>
			         WRITE_ONCE(y, 1);
			      }

	assert(r1 == 0 || r2 == 0);

Examples of memory barrier sequences

首先,写屏障对 store 操作有部分排序,比如

	CPU 1
	=======================
	STORE A = 1
	STORE B = 2
	STORE C = 3
	<write barrier>
	STORE D = 4
	STORE E = 5

当这串指令序列被提交到内存一致系统 memory coherence system,系统的其他部分会认为这串序列是

the unordered set of { STORE A,STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E}

	+-------+       :      :
	|       |       +------+
	|       |------>| C=3  |     }     /\
	|       |  :    +------+     }-----  \  -----> Events perceptible to
	|       |  :    | A=1  |     }        \/       the rest of the system
	|       |  :    +------+     }
	| CPU 1 |  :    | B=2  |     }
	|       |       +------+     }
	|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
	|       |       +------+     }        requires all stores prior to the
	|       |  :    | E=5  |     }        barrier to be committed before
	|       |  :    +------+     }        further stores may take place
	|       |------>| D=4  |     }
	|       |       +------+
	+-------+       :      :
	                   |
	                   | Sequence in which stores are committed to the
	                   | memory system by CPU 1
	                   V

其次,数据依赖屏障对于数据依赖的 load 有部分排序功能

	CPU 1			CPU 2
	=======================	=======================
		{ B = 7; X = 9; Y = 8; C = &Y }
	STORE A = 1
	STORE B = 2
	<write barrier>
	STORE C = &B		LOAD X
	STORE D = 4		LOAD C (gets &B)
				LOAD *C (reads B)
	+-------+       :      :                :       :
	|       |       +------+                +-------+  | Sequence of update
	|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
	|       |  :    +------+     \          +-------+  | CPU 2
	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
	|       |       +------+       |        +-------+
	|       |   wwwwwwwwwwwwwwww   |        :       :
	|       |       +------+       |        :       :
	|       |  :    | C=&B |---    |        :       :       +-------+
	|       |  :    +------+   \   |        +-------+       |       |
	|       |------>| D=4  |    ----------->| C->&B |------>|       |
	|       |       +------+       |        +-------+       |       |
	+-------+       :      :       |        :       :       |       |
	                               |        :       :       |       |
	                               |        :       :       | CPU 2 |
	                               |        +-------+       |       |
	    Apparently incorrect --->  |        | B->7  |------>|       |
	    perception of B (!)        |        +-------+       |       |
	                               |        :       :       |       |
	                               |        +-------+       |       |
	    The load of X holds --->    \       | X->9  |------>|       |
	    up the maintenance           \      +-------+       |       |
	    of coherence of B             ----->| B->2  |       +-------+
	                                        +-------+
	                                        :       :

在多 CPU 架构中,一个 CPU 发出了数据依赖屏障,如果这个屏障前的某个 load 操作加载了其他 CPU store 的内容,那么这个 store 前面的所有操作都会被这个 CPU 的数据依赖屏障后的所有 load 感知到。

具体的,这里的 CPU2 数据依赖屏障前的 load c 加载了 CPU1 store 的 C,那么 CPU1 中 store c 前的所有操作都被 CPU2 感知到。

	CPU 1			CPU 2
	=======================	=======================
		{ B = 7; X = 9; Y = 8; C = &Y }
	STORE A = 1
	STORE B = 2
	<write barrier>
	STORE C = &B		LOAD X
	STORE D = 4		LOAD C (gets &B)
				<data dependency barrier>
				LOAD *C (reads B)
	+-------+       :      :                :       :
	|       |       +------+                +-------+
	|       |------>| B=2  |-----       --->| Y->8  |
	|       |  :    +------+     \          +-------+
	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
	|       |       +------+       |        +-------+
	|       |   wwwwwwwwwwwwwwww   |        :       :
	|       |       +------+       |        :       :
	|       |  :    | C=&B |---    |        :       :       +-------+
	|       |  :    +------+   \   |        +-------+       |       |
	|       |------>| D=4  |    ----------->| C->&B |------>|       |
	|       |       +------+       |        +-------+       |       |
	+-------+       :      :       |        :       :       |       |
	                               |        :       :       |       |
	                               |        :       :       | CPU 2 |
	                               |        +-------+       |       |
	                               |        | X->9  |------>|       |
	                               |        +-------+       |       |
	  Makes sure all effects --->   \   ddddddddddddddddd   |       |
	  prior to the store of C        \      +-------+       |       |
	  are perceptible to              ----->| B->2  |------>|       |
	  subsequent loads                      +-------+       |       |
	                                        :       :       +-------+

第三,读屏障对于读有部分排序功能。

	CPU 1			CPU 2
	=======================	=======================
		{ A = 0, B = 9 }
	STORE A=1
	<write barrier>
	STORE B=2
				LOAD B
				LOAD A

如果只有写屏障的话,那么 CPU2 上看到的可能是这个样子的

	+-------+       :      :                :       :
	|       |       +------+                +-------+
	|       |------>| A=1  |------      --->| A->0  |
	|       |       +------+      \         +-------+
	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
	|       |       +------+        |       +-------+
	|       |------>| B=2  |---     |       :       :
	|       |       +------+   \    |       :       :       +-------+
	+-------+       :      :    \   |       +-------+       |       |
	                             ---------->| B->2  |------>|       |
	                                |       +-------+       | CPU 2 |
	                                |       | A->0  |------>|       |
	                                |       +-------+       |       |
	                                |       :       :       +-------+
	                                 \      :       :
	                                  \     +-------+
	                                   ---->| A->1  |
	                                        +-------+
	                                        :       :

CPU2 上加上读屏障

	CPU 1			CPU 2
	=======================	=======================
		{ A = 0, B = 9 }
	STORE A=1
	<write barrier>
	STORE B=2
				LOAD B
				<read barrier>
				LOAD A

那么 CPU2 上看到的是这个样子的

	+-------+       :      :                :       :
	|       |       +------+                +-------+
	|       |------>| A=1  |------      --->| A->0  |
	|       |       +------+      \         +-------+
	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
	|       |       +------+        |       +-------+
	|       |------>| B=2  |---     |       :       :
	|       |       +------+   \    |       :       :       +-------+
	+-------+       :      :    \   |       +-------+       |       |
	                             ---------->| B->2  |------>|       |
	                                |       +-------+       | CPU 2 |
	                                |       :       :       |       |
	                                |       :       :       |       |
	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
	  barrier causes all effects      \     +-------+       |       |
	  prior to the storage of B        ---->| A->1  |------>|       |
	  to be perceptible to CPU 2            +-------+       |       |
	                                        :       :       +-------+

更加具体的介绍

	CPU 1			CPU 2
	=======================	=======================
		{ A = 0, B = 9 }
	STORE A=1
	<write barrier>
	STORE B=2
				LOAD B
				LOAD A [first load of A]
				<read barrier>
				LOAD A [second load of A]
	+-------+       :      :                :       :
	|       |       +------+                +-------+
	|       |------>| A=1  |------      --->| A->0  |
	|       |       +------+      \         +-------+
	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
	|       |       +------+        |       +-------+
	|       |------>| B=2  |---     |       :       :
	|       |       +------+   \    |       :       :       +-------+
	+-------+       :      :    \   |       +-------+       |       |
	                             ---------->| B->2  |------>|       |
	                                |       +-------+       | CPU 2 |
	                                |       :       :       |       |
	                                |       :       :       |       |
	                                |       +-------+       |       |
	                                |       | A->0  |------>| 1st   |
	                                |       +-------+       |       |
	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
	  barrier causes all effects      \     +-------+       |       |
	  prior to the storage of B        ---->| A->1  |------>| 2nd   |
	  to be perceptible to CPU 2            +-------+       |       |
	                                        :       :       +-------+
	+-------+       :      :                :       :
	|       |       +------+                +-------+
	|       |------>| A=1  |------      --->| A->0  |
	|       |       +------+      \         +-------+
	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
	|       |       +------+        |       +-------+
	|       |------>| B=2  |---     |       :       :
	|       |       +------+   \    |       :       :       +-------+
	+-------+       :      :    \   |       +-------+       |       |
	                             ---------->| B->2  |------>|       |
	                                |       +-------+       | CPU 2 |
	                                |       :       :       |       |
	                                 \      :       :       |       |
	                                  \     +-------+       |       |
	                                   ---->| A->1  |------>| 1st   |
	                                        +-------+       |       |
	                                    rrrrrrrrrrrrrrrrr   |       |
	                                        +-------+       |       |
	                                        | A->1  |------>| 2nd   |
	                                        +-------+       |       |
	                                        :       :       +-------+

Read memory barriers vs load speculation

Many CPUs speculate with loads: that is they see that they will need to load an item from memory, and they find a time where they’re not using the bus for any other loads, and so do the load in advance - even though they haven’t actually got to that point in the instruction execution flow yet. This permits the actual load instruction to potentially complete immediately because the CPU already has the value to hand.

Multicopy atomicity

Multicopy atomicity is a deeply intuitive notion about ordering that is
not always provided by real computer systems, namely that a given store
becomes visible at the same time to all CPUs, or, alternatively, that all
CPUs agree on the order in which all stores become visible.  However,
support of full multicopy atomicity would rule out valuable hardware
optimizations, so a weaker form called ``other multicopy atomicity''
instead guarantees only that a given store becomes visible at the same
time to all -other- CPUs.  The remainder of this document discusses this
weaker form, but for brevity will call it simply ``multicopy atomicity''.
To reiterate, if your code requires full ordering of all operations,
use general barriers throughout.

Explicit kernel barriers

linux 内核有两种不同的内存屏障,编译器屏障和 CPU 内存屏障。

Compiler barrier

CPU memory barriers

Linux 内核有八种 CPU 内存屏障

	TYPE		MANDATORY		SMP CONDITIONAL
	===============	=======================	===========================
	GENERAL		mb()			smp_mb()
	WRITE		wmb()			smp_wmb()
	READ		rmb()			smp_rmb()
	DATA DEPENDENCY				READ_ONCE()

Implicit kernel memory barriers

Lock acquisition functions

linux 内核有一系列的加锁原语

 (*) spin locks
 (*) R/W spin locks
 (*) mutexes
 (*) semaphores
 (*) R/W semaphores

Interrupt disabling functions

Sleep and wake-up functions

Miscellaneous functions

Inter-CPU acquiring barrier effects

Acquires vs memory accesses

##Where are memory barriers needed?

Interprocessor interaction

Atomic operations

Accessing devices

Interrupts

Kernel I/O barrier effects

Assumed minimum execution ordering model

The effects of the cpu cache

Cache coherency

Cache coherency vs DMA

Cache coherency vs MMIO

The things CPUs get up to

And then there’s the Alpha

Virtual Machine Guests

Example uses: Circular buffers

参考资料