[PATCH] Document Linux's memory barriers

David Howells dhowells at redhat.com
Wed Mar 8 04:40:45 EST 2006


The attached patch documents the Linux kernel's memory barriers.

Signed-Off-By: David Howells <dhowells at redhat.com>
---
warthog>diffstat -p1 mb.diff 
 Documentation/memory-barriers.txt |  359 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 359 insertions(+)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
new file mode 100644
index 0000000..c2fc51b
--- /dev/null
+++ b/Documentation/memory-barriers.txt
@@ -0,0 +1,359 @@
+			 ============================
+			 LINUX KERNEL MEMORY BARRIERS
+			 ============================
+
+Contents:
+
+ (*) What are memory barriers?
+
+ (*) Linux kernel memory barrier functions.
+
+ (*) Implied kernel memory barriers.
+
+ (*) i386 and x86_64 arch specific notes.
+
+
+=========================
+WHAT ARE MEMORY BARRIERS?
+=========================
+
+Memory barriers are instructions to both the compiler and the CPU to impose a
+partial ordering between the memory access operations specified either side of
+the barrier.
+
+Older and less complex CPUs will perform memory accesses in exactly the order
+specified, so if one is given the following piece of code:
+
+	a = *A;
+	*B = b;
+	c = *C;
+	d = *D;
+	*E = e;
+
+It can be guaranteed that it will complete the memory access for each
+instruction before moving on to the next line, leading to a definite sequence
+of operations on the bus:
+
+	read *A, write *B, read *C, read *D, write *E.
+
+However, with newer and more complex CPUs, this isn't always true because:
+
+ (*) they can rearrange the order of the memory accesses to promote better use
+     of the CPU buses and caches;
+
+ (*) reads are synchronous and may need to be done immediately to permit
+     progress, whereas writes can often be deferred without a problem;
+
+ (*) and they are able to combine reads and writes to improve performance when
+     talking to the SDRAM (modern SDRAM chips can do batched accesses of
+     adjacent locations, cutting down on transaction setup costs).
+
+So what you might actually get from the above piece of code is:
+
+	read *A, read *C+*D, write *E, write *B
+
+Under normal operation, this is probably not going to be a problem; however,
+there are two circumstances where it definitely _can_ be a problem:
+
+ (1) I/O
+
+     Many I/O devices can be memory mapped, and so appear to the CPU as if
+     they're just memory locations. However, to control the device, the driver
+     has to make the right accesses in exactly the right order.
+
+     Consider, for example, an ethernet chipset such as the AMD PCnet32. It
+     presents to the CPU an "address register" and a bunch of "data registers".
+     The way it's accessed is to write the index of the internal register you
+     want to access to the address register, and then read or write the
+     appropriate data register to access the chip's internal register:
+
+	*ADR = ctl_reg_3;
+	reg = *DATA;
+
+     The problem with a clever CPU or a clever compiler is that the write to
+     the address register isn't guaranteed to happen before the access to the
+     data register, if the CPU or the compiler thinks it is more efficient to
+     defer the address write:
+
+	read *DATA, write *ADR
+
+     then things will break.
+
+     The way to deal with this is to insert an I/O memory barrier between the
+     two accesses:
+
+	*ADR = ctl_reg_3;
+	mb();
+	reg = *DATA;
+
+     In this case, the barrier makes a guarantee that all memory accesses
+     before the barrier will happen before all the memory accesses after the
+     barrier. It does _not_ guarantee that all memory accesses before the
+     barrier will be complete by the time the barrier is complete.
+
+ (2) Multiprocessor interaction
+
+     When there's a system with more than one processor, these may be working
+     on the same set of data, but attempting not to use locks as locks are
+     quite expensive. This means that accesses that affect both CPUs may have
+     to be carefully ordered to prevent error.
+
+     Consider the R/W semaphore slow path. In that, a waiting process is
+     queued on the semaphore, as noted by it having a record on its stack
+     linked to the semaphore's list:
+
+	struct rw_semaphore {
+		...
+		struct list_head waiters;
+	};
+
+	struct rwsem_waiter {
+		struct list_head list;
+		struct task_struct *task;
+	};
+
+     To wake up the waiter, the up_read() or up_write() functions have to read
+     the pointer from this record to know as to where the next waiter record
+     is, clear the task pointer, call wake_up_process() on the task, and
+     release the task struct reference held:
+
+	READ waiter->list.next;
+	READ waiter->task;
+	WRITE waiter->task;
+	CALL wakeup
+	RELEASE task
+
+     If any of these steps occur out of order, then the whole thing may fail.
+
+     Note that the waiter does not get the semaphore lock again - it just waits
+     for its task pointer to be cleared. Since the record is on its stack, this
+     means that if the task pointer is cleared _before_ the next pointer in the
+     list is read, then another CPU might start processing the waiter and it
+     might clobber its stack before up*() functions have a chance to read the
+     next pointer.
+
+	CPU 0				CPU 1
+	===============================	===============================
+					down_xxx()
+					Queue waiter
+					Sleep
+	up_yyy()
+	READ waiter->task;
+	WRITE waiter->task;
+	<preempt>
+					Resume processing
+					down_xxx() returns
+					call foo()
+					foo() clobbers *waiter
+	</preempt>
+	READ waiter->list.next;
+	--- OOPS ---
+
+     This could be dealt with using a spinlock, but then the down_xxx()
+     function has to get the spinlock again after it's been woken up, which is
+     a waste of resources.
+
+     The way to deal with this is to insert an SMP memory barrier:
+
+	READ waiter->list.next;
+	READ waiter->task;
+	smp_mb();
+	WRITE waiter->task;
+	CALL wakeup
+	RELEASE task
+
+     In this case, the barrier makes a guarantee that all memory accesses
+     before the barrier will happen before all the memory accesses after the
+     barrier. It does _not_ guarantee that all memory accesses before the
+     barrier will be complete by the time the barrier is complete.
+
+     SMP memory barriers are normally no-ops on a UP system because the CPU
+     orders overlapping accesses with respect to itself.
+
+
+=====================================
+LINUX KERNEL MEMORY BARRIER FUNCTIONS
+=====================================
+
+The Linux kernel has six basic memory barriers:
+
+		MANDATORY (I/O)	SMP
+		===============	================
+	GENERAL	mb()		smp_mb()
+	READ	rmb()		smp_rmb()
+	WRITE	wmb()		smp_wmb()
+
+General memory barriers make a guarantee that all memory accesses specified
+before the barrier will happen before all memory accesses specified after the
+barrier.
+
+Read memory barriers make a guarantee that all memory reads specified before
+the barrier will happen before all memory reads specified after the barrier.
+
+Write memory barriers make a guarantee that all memory writes specified before
+the barrier will happen before all memory writes specified after the barrier.
+
+SMP memory barriers are no-ops on uniprocessor compiled systems because it is
+assumed that a CPU will be self-consistent, and will order overlapping accesses
+with respect to itself.
+
+There is no guarantee that any of the memory accesses specified before a memory
+barrier will be complete by the completion of a memory barrier; the barrier can
+be considered to draw a line in the access queue that accesses of the
+appropriate type may not cross.
+
+There is no guarantee that issuing a memory barrier on one CPU will have any
+direct effect on another CPU or any other hardware in the system. The indirect
+effect will be the order the first CPU commits its accesses to the bus.
+
+Note that these are the _minimum_ guarantees. Different architectures may give
+more substantial guarantees, but they may not be relied upon outside of arch
+specific code.
+
+
+There are some more advanced barriering functions:
+
+ (*) set_mb(var, value)
+ (*) set_wmb(var, value)
+
+     These assign the value to the variable and then insert at least a write
+     barrier after it, depending on the function.
+
+
+==============================
+IMPLIED KERNEL MEMORY BARRIERS
+==============================
+
+Some of the other functions in the linux kernel imply memory barriers. For
+instance all the following (pseudo-)locking functions imply barriers.
+
+ (*) interrupt disablement and/or interrupts
+ (*) spin locks
+ (*) R/W spin locks
+ (*) mutexes
+ (*) semaphores
+ (*) R/W semaphores
+
+In all cases there are variants on a LOCK operation and an UNLOCK operation.
+
+ (*) LOCK operation implication:
+
+     Memory accesses issued after the LOCK will be completed after the LOCK
+     accesses have completed.
+
+     Memory accesses issued before the LOCK may be completed after the LOCK
+     accesses have completed.
+
+ (*) UNLOCK operation implication:
+
+     Memory accesses issued before the UNLOCK will be completed before the
+     UNLOCK accesses have completed.
+
+     Memory accesses issued after the UNLOCK may be completed before the UNLOCK
+     accesses have completed.
+
+ (*) LOCK vs UNLOCK implication:
+
+     The LOCK accesses will be completed before the unlock accesses.
+
+Locks and semaphores may not provide any guarantee of ordering on UP compiled
+systems, and so can't be counted on in such a situation to actually do
+anything at all, especially with respect to I/O memory barriering.
+
+Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier
+memory and I/O accesses individually, or interrupt handling will barrier
+memory and I/O accesses on entry and on exit. This prevents an interrupt
+routine interfering with accesses made in a disabled-interrupt section of code
+and vice versa.
+
+This specification is a _minimum_ guarantee; any particular architecture may
+provide more substantial guarantees, but these may not be relied upon outside
+of arch specific code.
+
+
+As an example, consider the following:
+
+	*A = a;
+	*B = b;
+	LOCK
+	*C = c;
+	*D = d;
+	UNLOCK
+	*E = e;
+	*F = f;
+
+The following sequence of events on the bus is acceptable:
+
+	LOCK, *F+*A, *E, *C+*D, *B, UNLOCK
+
+But none of the following are:
+
+	*F+*A, *B,	LOCK, *C, *D,	UNLOCK, *E
+	*A, *B, *C,	LOCK, *D,	UNLOCK, *E, *F
+	*A, *B,		LOCK, *C,	UNLOCK, *D, *E, *F
+	*B,		LOCK, *C, *D,	UNLOCK, *F+*A, *E
+
+
+Consider also the following (going back to the AMD PCnet example):
+
+	DISABLE IRQ
+	*ADR = ctl_reg_3;
+	mb();
+	x = *DATA;
+	*ADR = ctl_reg_4;
+	mb();
+	*DATA = y;
+	*ADR = ctl_reg_5;
+	mb();
+	z = *DATA;
+	ENABLE IRQ
+	<interrupt>
+	*ADR = ctl_reg_7;
+	mb();
+	q = *DATA
+	</interrupt>
+
+What's to stop "z = *DATA" crossing "*ADR = ctl_reg_7" and reading from the
+wrong register? (There's no guarantee that the process of handling an
+interrupt will barrier memory accesses in any way).
+
+
+==============================
+I386 AND X86_64 SPECIFIC NOTES
+==============================
+
+Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the
+bus appear in program order - and so there's no requirement for any sort of
+explicit memory barriers.
+
+From the Pentium-III onwards were three new memory barrier instructions:
+LFENCE, SFENCE and MFENCE which correspond to the kernel memory barrier
+functions rmb(), wmb() and mb(). However, there are additional implicit memory
+barriers in the CPU implementation:
+
+ (*) Interrupt processing implies mb().
+
+ (*) The LOCK prefix adds implication of mb() on whatever instruction it is
+     attached to.
+
+ (*) Normal writes to memory imply wmb() [and so SFENCE is normally not
+     required].
+
+ (*) Normal writes imply a semi-rmb(): reads before a write may not complete
+     after that write, but reads after a write may complete before the write
+     (ie: reads may go _ahead_ of writes).
+
+ (*) Non-temporal writes imply no memory barrier, and are the intended target
+     of SFENCE.
+
+ (*) Accesses to uncached memory imply mb() [eg: memory mapped I/O].
+
+
+======================
+POWERPC SPECIFIC NOTES
+======================
+
+The powerpc is weakly ordered, and its read and write accesses may be
+completed generally in any order. It's memory barriers are also to some extent
+more substantial than the mimimum requirement, and may directly effect
+hardware outside of the CPU.



More information about the Linuxppc64-dev mailing list