[PATCH] Document Linux's memory barriers [try #4]

Fri Mar 10 07:29:22 EST 2006

The attached patch documents the Linux kernel's memory barriers.

I've updated it from the comments I've been given.

The per-arch notes sections are gone because it's clear that there are so many
exceptions, that it's not worth having them.

I've added a list of references to other documents.

I've tried to get rid of the concept of memory accesses appearing on the bus;
what matters is apparent behaviour with respect to other observers in the
system.

Interrupts barrier effects are now considered to be non-existent. They may be
there, but you may not rely on them.

I've added a couple of definition sections at the top of the document: one to
specify the minimum execution model that may be assumed, the other to specify
what this document refers to by the term "memory".

Signed-Off-By: David Howells <dhowells at redhat.com>
---
warthog>diffstat -p1 /tmp/mb.diff 
 Documentation/memory-barriers.txt |  855 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 855 insertions(+)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
new file mode 100644
index 0000000..04c5c88
--- /dev/null
+++ b/Documentation/memory-barriers.txt
@@ -0,0 +1,855 @@
+			 ============================
+			 LINUX KERNEL MEMORY BARRIERS
+			 ============================
+
+Contents:
+
+ (*) Assumed minimum execution ordering model.
+
+ (*) What is considered memory?
+
+     - Cached interactions.
+     - Uncached interactions.
+
+ (*) What are memory barriers?
+
+ (*) Where are memory barriers needed?
+
+     - Accessing devices.
+     - Multiprocessor interaction.
+     - Interrupts.
+
+ (*) Explicit kernel compiler barriers.
+
+ (*) Explicit kernel memory barriers.
+
+ (*) Implicit kernel memory barriers.
+
+     - Locking functions.
+     - Interrupt disabling functions.
+     - Miscellaneous functions.
+
+ (*) Inter-CPU locking barrier effects.
+
+     - Locks vs memory accesses.
+     - Locks vs I/O accesses.
+
+ (*) Kernel I/O barrier effects.
+
+ (*) References.
+
+
+========================================
+ASSUMED MINIMUM EXECUTION ORDERING MODEL
+========================================
+
+It has to be assumed that the conceptual CPU is weakly-ordered in all respects
+but that it will maintain the appearance of program causality with respect to
+itself.  Some CPUs (such as i386 or x86_64) are more constrained than others
+(such as powerpc or frv), and so the worst case must be assumed.
+
+This means that it must be considered that the CPU will execute its instruction
+stream in any order it feels like - or even in parallel - provided that if an
+instruction in the stream depends on the an earlier instruction, then that
+earlier instruction must be sufficiently complete[*] before the later
+instruction may proceed.
+
+ [*] Some instructions have more than one effect[**] and different instructions
+     may depend on different effects.
+
+ [**] Eg: changes to condition codes and registers; memory changes; barriers.
+
+A CPU may also discard any instruction sequence that ultimately winds up having
+no effect.  For example if two adjacent instructions both load an immediate
+value into the same register, the first may be discarded.
+
+
+Similarly, it has to be assumed that compiler might reorder the instruction
+stream in any way it sees fit, again provided the appearance of causality is
+maintained.
+
+
+==========================
+WHAT IS CONSIDERED MEMORY?
+==========================
+
+For the purpose of this specification what's meant by "memory" needs to be
+defined, and the division between CPU and memory needs to be marked out.
+
+
+CACHED INTERACTIONS
+-------------------
+
+As far as cached CPU vs CPU[*] interactions go, "memory" has to include the CPU
+caches in the system.  Although any particular read or write may not actually
+appear outside of the CPU that issued it (the CPU may may have been able to
+satisfy it from its own cache), it's still as if the memory access had taken
+place as far as the other CPUs are concerned since the cache coherency and
+ejection mechanisms will propegate the effects upon conflict.
+
+ [*] Also applies to CPU vs device when accessed through a cache.
+
+The system can be considered logically as:
+
+	    <--- CPU --->         :       <----------- Memory ----------->
+	                          :
+	+--------+    +--------+  :   +--------+    +-----------+
+	|        |    |        |  :   |        |    |           |    +---------+
+	|  CPU   |    | Memory |  :   | CPU    |    |           |    |	       |
+	|  Core  |--->| Access |----->| Cache  |<-->|           |    |	       |
+	|        |    | Queue  |  :   |        |    |           |--->| Memory  |
+	|        |    |        |  :   |        |    |           |    |	       |
+	+--------+    +--------+  :   +--------+    |           |    | 	       |
+	                          :                 | Cache     |    +---------+
+	                          :                 | Coherency |
+	                          :                 | Mechanism |    +---------+
+	+--------+    +--------+  :   +--------+    |           |    |	       |
+	|        |    |        |  :   |        |    |           |    |         |
+	|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device  |
+	|  Core  |--->| Access |----->| Cache  |<-->|           |    | 	       |
+	|        |    | Queue  |  :   |        |    |           |    | 	       |
+	|        |    |        |  :   |        |    |           |    +---------+
+	+--------+    +--------+  :   +--------+    +-----------+
+	                          :
+	                          :
+
+The CPU core may execute instructions in any order it deems fit, provided the
+expected program causality appears to be maintained.  Some of the instructions
+generate load and store operations which then go into the memory access queue
+to be performed.  The core may place these in the queue in any order it wishes,
+and continue execution until it is forced to wait for an instruction to
+complete.
+
+What memory barriers are concerned with is controlling the order in which
+accesses cross from the CPU side of things to the memory side of things, and
+the order in which the effects are perceived to happen by the other observers
+in the system.
+
+
+UNCACHED INTERACTIONS
+---------------------
+
+Note that the above model does not show uncached memory or I/O accesses.  These
+procede directly from the queue to the memory or the devices, bypassing any
+cache coherency:
+
+	    <--- CPU --->         :
+       	                          :		+-----+
+	+--------+    +--------+  :             |     |
+	|        |    |        |  :             |     |              +---------+
+	|  CPU   |    | Memory |  :             |     |              |	       |
+	|  Core  |--->| Access |--------------->|     |              |	       |
+	|        |    | Queue  |  :             |     |------------->| Memory  |
+	|        |    |        |  :             |     |              |	       |
+	+--------+    +--------+  :             |     |              | 	       |
+	                          :             |     |              +---------+
+	                          :             | Bus |
+	                          :             |     |              +---------+
+	+--------+    +--------+  :             |     |              |	       |
+	|        |    |        |  :             |     |              |         |
+	|  CPU   |    | Memory |  :             |     |<------------>| Device  |
+	|  Core  |--->| Access |--------------->|     |              | 	       |
+	|        |    | Queue  |  :             |     |              | 	       |
+	|        |    |        |  :             |     |              +---------+
+	+--------+    +--------+  :             |     |
+	                          :		+-----+
+	                          :
+
+
+=========================
+WHAT ARE MEMORY BARRIERS?
+=========================
+
+Memory barriers are instructions to both the compiler and the CPU to impose an
+apparent partial ordering between the memory access operations specified either
+side of the barrier.  They request that the sequence of memory events generated
+appears to other components of the system as if the barrier is effective on
+that CPU.
+
+Note that:
+
+ (*) there's no guarantee that the sequence of memory events is _actually_ so
+     ordered.  It's possible for the CPU to do out-of-order accesses _as long
+     as no-one is looking_, and then fix up the memory if someone else tries to
+     see what's going on (for instance a bus master device); what matters is
+     the _apparent_ order as far as other processors and devices are concerned;
+     and
+
+ (*) memory barriers are only guaranteed to act within the CPU processing them,
+     and are not, for the most part, guaranteed to percolate down to other CPUs
+     in the system or to any I/O hardware that that CPU may communicate with.
+
+
+For example, a programmer might take it for granted that the CPU will perform
+memory accesses in exactly the order specified, so that if a CPU is, for
+example, given the following piece of code:
+
+	a = *A;
+	*B = b;
+	c = *C;
+	d = *D;
+	*E = e;
+
+They would then expect that the CPU will complete the memory access for each
+instruction before moving on to the next one, leading to a definite sequence of
+operations as seen by external observers in the system:
+
+	read *A, write *B, read *C, read *D, write *E.
+
+
+Reality is, of course, much messier.  With many CPUs and compilers, this isn't
+always true because:
+
+ (*) reads are more likely to need to be completed immediately to permit
+     execution progress, whereas writes can often be deferred without a
+     problem;
+
+ (*) reads can be done speculatively, and then the result discarded should it
+     prove not to be required;
+
+ (*) the order of the memory accesses may be rearranged to promote better use
+     of the CPU buses and caches;
+
+ (*) reads and writes may be combined to improve performance when talking to
+     the memory or I/O hardware that can do batched accesses of adjacent
+     locations, thus cutting down on transaction setup costs (memory and PCI
+     devices may be able to do this); and
+
+ (*) the CPU's data cache may affect the ordering, though cache-coherency
+     mechanisms should alleviate this - once the write has actually hit the
+     cache.
+
+So what another CPU, say, might actually observe from the above piece of code
+is:
+
+	read *A, read {*C,*D}, write *E, write *B
+
+	(By "read {*C,*D}" I mean a combined single read).
+
+
+It is also guaranteed that a CPU will be self-consistent: it will see its _own_
+accesses appear to be correctly ordered, without the need for a memory barrier.
+For instance with the following code:
+
+	X = *A;
+	*A = Y;
+	Z = *A;
+
+assuming no intervention by an external influence, it can be taken that:
+
+ (*) X will hold the old value of *A, and will never happen after the write and
+     thus end up being given the value that was assigned to *A from Y instead;
+     and
+
+ (*) Z will always be given the value in *A that was assigned there from Y, and
+     will never happen before the write, and thus end up with the same value
+     that was in *A initially.
+
+(This is ignoring the fact that the value initially in *A may appear to be the
+same as the value assigned to *A from Y).
+
+
+=================================
+WHERE ARE MEMORY BARRIERS NEEDED?
+=================================
+
+Under normal operation, access reordering is probably not going to be a problem
+as a linear program will still appear to operate correctly.  There are,
+however, three circumstances where reordering definitely _could_ be a problem:
+
+
+ACCESSING DEVICES
+-----------------
+
+Many devices can be memory mapped, and so appear to the CPU as if they're just
+memory locations.  However, to control the device, the driver has to make the
+right accesses in exactly the right order.
+
+Consider, for example, an ethernet chipset such as the AMD PCnet32.  It
+presents to the CPU an "address register" and a bunch of "data registers".  The
+way it's accessed is to write the index of the internal register to be accessed
+to the address register, and then read or write the appropriate data register
+to access the chip's internal register, which could - theoretically - be done
+by:
+
+	*ADR = ctl_reg_3;
+	reg = *DATA;
+
+The problem with a clever CPU or a clever compiler is that the write to the
+address register isn't guaranteed to happen before the access to the data
+register, if the CPU or the compiler thinks it is more efficient to defer the
+address write:
+
+	read *DATA, write *ADR
+
+then things will break.
+
+
+In the Linux kernel, however, I/O should be done through the appropriate
+accessor routines - such as inb() or writel() - which know how to make such
+accesses appropriately sequential.
+
+On some systems, I/O writes are not strongly ordered across all CPUs, and so
+locking should be used, and mmiowb() should be issued prior to unlocking the
+critical section.
+
+See Documentation/DocBook/deviceiobook.tmpl for more information.
+
+
+MULTIPROCESSOR INTERACTION
+--------------------------
+
+When there's a system with more than one processor, the CPUs in the system may
+be working on the same set of data at the same time.  This can cause
+synchronisation problems, and the usual way of dealing with them is to use
+locks - but locks are quite expensive, and so it may be preferable to operate
+without the use of a lock if at all possible.  In such a case accesses that
+affect both CPUs may have to be carefully ordered to prevent error.
+
+Consider the R/W semaphore slow path.  In that, a waiting process is queued on
+the semaphore, as noted by it having a record on its stack linked to the
+semaphore's list:
+
+	struct rw_semaphore {
+		...
+		struct list_head waiters;
+	};
+
+	struct rwsem_waiter {
+		struct list_head list;
+		struct task_struct *task;
+	};
+
+To wake up the waiter, the up_read() or up_write() functions have to read the
+pointer from this record to know as to where the next waiter record is, clear
+the task pointer, call wake_up_process() on the task, and release the reference
+held on the waiter's task struct:
+
+	READ waiter->list.next;
+	READ waiter->task;
+	WRITE waiter->task;
+	CALL wakeup
+	RELEASE task
+
+If any of these steps occur out of order, then the whole thing may fail.
+
+Note that the waiter does not get the semaphore lock again - it just waits for
+its task pointer to be cleared.  Since the record is on its stack, this means
+that if the task pointer is cleared _before_ the next pointer in the list is
+read, another CPU might start processing the waiter and it might clobber its
+stack before up*() functions have a chance to read the next pointer.
+
+	CPU 0				CPU 1
+	===============================	===============================
+					down_xxx()
+					Queue waiter
+					Sleep
+	up_yyy()
+	READ waiter->task;
+	WRITE waiter->task;
+	<preempt>
+					Resume processing
+					down_xxx() returns
+					call foo()
+					foo() clobbers *waiter
+	</preempt>
+	READ waiter->list.next;
+	--- OOPS ---
+
+This could be dealt with using a spinlock, but then the down_xxx() function has
+to get the spinlock again after it's been woken up, which is a waste of
+resources.
+
+The way to deal with this is to insert an SMP memory barrier:
+
+	READ waiter->list.next;
+	READ waiter->task;
+	smp_mb();
+	WRITE waiter->task;
+	CALL wakeup
+	RELEASE task
+
+In this case, the barrier makes a guarantee that all memory accesses before the
+barrier will appear to happen before all the memory accesses after the barrier
+with respect to the other CPUs on the system.  It does _not_ guarantee that all
+the memory accesses before the barrier will be complete by the time the barrier
+itself is complete.
+
+SMP memory barriers are normally nothing more than compiler barriers on a
+kernel compiled for a UP system because the CPU orders overlapping accesses
+with respect to itself, and so CPU barriers aren't needed.
+
+
+INTERRUPTS
+----------
+
+A driver may be interrupted by its own interrupt service routine, and thus they
+may interfere with each other's attempts to control or access the device.
+
+This may be alleviated - at least in part - by disabling interrupts (a form of
+locking), such that the critical operations are all contained within the
+interrupt-disabled section in the driver.  Whilst the driver's interrupt
+routine is executing, the driver's core may not run on the same CPU, and its
+interrupt is not permitted to happen again until the current interrupt has been
+handled, thus the interrupt handler does not need to lock against that.
+
+However, consider a driver was talking to an ethernet card that sports an
+address register and a data register.  If that driver's core is talks to the
+card under interrupt-disablement and then the driver's interrupt handler is
+invoked:
+
+	DISABLE IRQ
+	writew(ADDR, ctl_reg_3);
+	writew(DATA, y);
+	ENABLE IRQ
+	<interrupt>
+	writew(ADDR, ctl_reg_4);
+	q = readw(DATA);
+	</interrupt>
+
+If ordering rules are sufficiently relaxed, the write to the data register
+might happen after the second write to the address register.
+
+
+It must be assumed that accesses done inside an interrupt disabled section may
+leak outside of it and may interleave with accesses performed in an interrupt
+and vice versa unless implicit or explicit barriers are used.
+
+Normally this won't be a problem because the I/O accesses done inside such
+sections will include synchronous read operations on strictly ordered I/O
+registers that form implicit I/O barriers. If this isn't sufficient then an
+mmiowb() may need to be used explicitly.
+
+
+A similar situation may occur between an interrupt routine and two routines
+running on separate CPUs that communicate with each other. If such a case is
+likely, then interrupt-disabling locks should be used to guarantee ordering.
+
+
+=================================
+EXPLICIT KERNEL COMPILER BARRIERS
+=================================
+
+The Linux kernel has an explicit compiler barrier function that prevents the
+compiler from moving the memory accesses either side of it to the other side:
+
+	barrier();
+
+This has no direct effect on the CPU, which may then reorder things however it
+wishes.
+
+
+In addition, accesses to "volatile" memory locations and volatile asm
+statements act as implicit compiler barriers.  Note, however, that the use of
+volatile has two negative consequences:
+
+ (1) it causes the generation of poorer code, and
+
+ (2) it can affect serialisation of events in code distant from the declaration
+     (consider a structure defined in a header file that has a volatile member
+     being accessed by the code in a source file).
+
+The Linux coding style therefore strongly favours the use of explicit barriers
+except in small and specific cases.  In general, volatile should be avoided.
+
+
+===============================
+EXPLICIT KERNEL MEMORY BARRIERS
+===============================
+
+The Linux kernel has six basic CPU memory barriers:
+
+		MANDATORY	SMP CONDITIONAL
+		===============	===============
+	GENERAL	mb()		smp_mb()
+	READ	rmb()		smp_rmb()
+	WRITE	wmb()		smp_wmb()
+
+General memory barriers give a guarantee that all memory accesses specified
+before the barrier will appear to happen before all memory accesses specified
+after the barrier with respect to the other components of the system.
+
+Read and write memory barriers give similar guarantees, but only for memory
+reads versus memory reads and memory writes versus memory writes respectively.
+
+All memory barriers imply compiler barriers.
+
+SMP memory barriers are only compiler barriers on uniprocessor compiled systems
+because it is assumed that a CPU will be apparently self-consistent, and will
+order overlapping accesses correctly with respect to itself.
+
+There is no guarantee that any of the memory accesses specified before a memory
+barrier will be complete by the completion of a memory barrier; the barrier can
+be considered to draw a line in that CPU's access queue that accesses of the
+appropriate type may not cross.
+
+There is no guarantee that issuing a memory barrier on one CPU will have any
+direct effect on another CPU or any other hardware in the system.  The indirect
+effect will be the order in which the second CPU sees the first CPU's accesses
+occur.
+
+There is no guarantee that some intervening piece of off-the-CPU hardware[*]
+will not reorder the memory accesses.  CPU cache coherency mechanisms should
+propegate the indirect effects of a memory barrier between CPUs.
+
+ [*] For information on bus mastering DMA and coherency please read:
+
+	Documentation/pci.txt
+	Documentation/DMA-mapping.txt
+	Documentation/DMA-API.txt
+
+Note that these are the _minimum_ guarantees.  Different architectures may give
+more substantial guarantees, but they may not be relied upon outside of arch
+specific code.
+
+
+There are some more advanced barrier functions:
+
+ (*) set_mb(var, value)
+ (*) set_wmb(var, value)
+
+     These assign the value to the variable and then insert at least a write
+     barrier after it, depending on the function.  They aren't guaranteed to
+     insert anything more than a compiler barrier in a UP compilation.
+
+
+===============================
+IMPLICIT KERNEL MEMORY BARRIERS
+===============================
+
+Some of the other functions in the linux kernel imply memory barriers, amongst
+which are locking and scheduling functions.
+
+This specification is a _minimum_ guarantee; any particular architecture may
+provide more substantial guarantees, but these may not be relied upon outside
+of arch specific code.
+
+
+LOCKING FUNCTIONS
+-----------------
+
+All the following locking functions imply barriers:
+
+ (*) spin locks
+ (*) R/W spin locks
+ (*) mutexes
+ (*) semaphores
+ (*) R/W semaphores
+
+In all cases there are variants on a LOCK operation and an UNLOCK operation.
+
+ (*) LOCK operation implication:
+
+     Memory accesses issued after the LOCK will be completed after the LOCK
+     accesses have completed.
+
+     Memory accesses issued before the LOCK may be completed after the LOCK
+     accesses have completed.
+
+ (*) UNLOCK operation implication:
+
+     Memory accesses issued before the UNLOCK will be completed before the
+     UNLOCK accesses have completed.
+
+     Memory accesses issued after the UNLOCK may be completed before the UNLOCK
+     accesses have completed.
+
+ (*) LOCK vs UNLOCK implication:
+
+     The LOCK accesses will be completed before the UNLOCK accesses.
+
+     Therefore an UNLOCK followed by a LOCK is equivalent to a full barrier,
+     but a LOCK followed by an UNLOCK is not.
+
+Locks and semaphores may not provide any guarantee of ordering on UP compiled
+systems, and so can't be counted on in such a situation to actually do anything
+at all, especially with respect to I/O accesses, unless combined with interrupt
+disabling operations.
+
+See also the section on "Inter-CPU locking barrier effects".
+
+
+As an example, consider the following:
+
+	*A = a;
+	*B = b;
+	LOCK
+	*C = c;
+	*D = d;
+	UNLOCK
+	*E = e;
+	*F = f;
+
+The following sequence of events is acceptable:
+
+	LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
+
+But none of the following are:
+
+	{*F,*A}, *B,	LOCK, *C, *D,	UNLOCK, *E
+	*A, *B, *C,	LOCK, *D,	UNLOCK, *E, *F
+	*A, *B,		LOCK, *C,	UNLOCK, *D, *E, *F
+	*B,		LOCK, *C, *D,	UNLOCK, {*F,*A}, *E
+
+
+INTERRUPT DISABLING FUNCTIONS
+-----------------------------
+
+Functions that disable interrupts (LOCK equivalent) and enable interrupts
+(UNLOCK equivalent) will act as compiler barriers only.  So if memory or I/O
+barriers are required in such a situation, they must be provided from some
+other means.
+
+
+MISCELLANEOUS FUNCTIONS
+-----------------------
+
+Other functions that imply barriers:
+
+ (*) schedule() and similar imply full memory barriers.
+
+
+=================================
+INTER-CPU LOCKING BARRIER EFFECTS
+=================================
+
+On SMP systems locking primitives give a more substantial form of barrier: one
+that does affect memory access ordering on other CPUs, within the context of
+conflict on any particular lock.
+
+
+LOCKS VS MEMORY ACCESSES
+------------------------
+
+Consider the following: the system has a pair of spinlocks (N) and (Q), and
+three CPUs; then should the following sequence of events occur:
+
+	CPU 1				CPU 2
+	===============================	===============================
+	*A = a;				*E = e;
+	LOCK M				LOCK Q
+	*B = b;				*F = f;
+	*C = c;				*G = g;
+	UNLOCK M			UNLOCK Q
+	*D = d;				*H = h;
+
+Then there is no guarantee as to what order CPU #3 will see the accesses to *A
+through *H occur in, other than the constraints imposed by the separate locks
+on the separate CPUs. It might, for example, see:
+
+	*E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
+
+But it won't see any of:
+
+	*B, *C or *D preceding LOCK M
+	*A, *B or *C following UNLOCK M
+	*F, *G or *H preceding LOCK Q
+	*E, *F or *G following UNLOCK Q
+
+
+However, if the following occurs:
+
+	CPU 1				CPU 2
+	===============================	===============================
+	*A = a;
+	LOCK M		[1]
+	*B = b;
+	*C = c;
+	UNLOCK M	[1]
+	*D = d;				*E = e;
+					LOCK M		[2]
+					*F = f;
+					*G = g;
+					UNLOCK M	[2]
+					*H = h;
+
+CPU #3 might see:
+
+	*E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
+		LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
+
+But assuming CPU #1 gets the lock first, it won't see any of:
+
+	*B, *C, *D, *F, *G or *H preceding LOCK M [1]
+	*A, *B or *C following UNLOCK M [1]
+	*F, *G or *H preceding LOCK M [2]
+	*A, *B, *C, *E, *F or *G following UNLOCK M [2]
+
+
+LOCKS VS I/O ACCESSES
+---------------------
+
+Under certain circumstances (such as NUMA), I/O accesses within two spinlocked
+sections on two different CPUs may be seen as interleaved by the PCI bridge.
+
+For example:
+
+	CPU 1				CPU 2
+	===============================	===============================
+	spin_lock(Q)
+	writel(0, ADDR)
+	writel(1, DATA);
+	spin_unlock(Q);
+					spin_lock(Q);
+					writel(4, ADDR);
+					writel(5, DATA);
+					spin_unlock(Q);
+
+may be seen by the PCI bridge as follows:
+
+	WRITE *ADDR = 0, WRITE *ADDR = 4, WRITE *DATA = 1, WRITE *DATA = 5
+
+which would probably break.
+
+What is necessary here is to insert an mmiowb() before dropping the spinlock,
+for example:
+
+	CPU 1				CPU 2
+	===============================	===============================
+	spin_lock(Q)
+	writel(0, ADDR)
+	writel(1, DATA);
+	mmiowb();
+	spin_unlock(Q);
+					spin_lock(Q);
+					writel(4, ADDR);
+					writel(5, DATA);
+					mmiowb();
+					spin_unlock(Q);
+
+this will ensure that the two writes issued on CPU #1 appear at the PCI bridge
+before either of the writes issued on CPU #2.
+
+
+Furthermore, following a write by a read to the same device is okay, because
+the read forces the write to complete before the read is performed:
+
+	CPU 1				CPU 2
+	===============================	===============================
+	spin_lock(Q)
+	writel(0, ADDR)
+	a = readl(DATA);
+	spin_unlock(Q);
+					spin_lock(Q);
+					writel(4, ADDR);
+					b = readl(DATA);
+					spin_unlock(Q);
+
+
+See Documentation/DocBook/deviceiobook.tmpl for more information.
+
+
+==========================
+KERNEL I/O BARRIER EFFECTS
+==========================
+
+When accessing I/O memory, drivers should use the appropriate accessor
+functions:
+
+ (*) inX(), outX():
+
+     These are intended to talk to I/O space rather than memory space, but
+     that's primarily a CPU-specific concept. The i386 and x86_64 processors do
+     indeed have special I/O space access cycles and instructions, but many
+     CPUs don't have such a concept.
+
+     The PCI bus, amongst others, defines an I/O space concept - which on such
+     CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O
+     space.  However, it may also mapped as a virtual I/O space in the CPU's
+     memory map, particularly on those CPUs that don't support alternate
+     I/O spaces.
+
+     Accesses to this space may be fully synchronous (as on i386), but
+     intermediary bridges (such as the PCI host bridge) may not fully honour
+     that.
+
+     They are guaranteed to be fully ordered with respect to each other.
+
+     They are not guaranteed to be fully ordered with respect to other types of
+     memory and I/O operation.
+
+ (*) readX(), writeX():
+
+     Whether these are guaranteed to be fully ordered and uncombined with
+     respect to each other on the issuing CPU depends on the characteristics
+     defined for the memory window through which they're accessing. On later
+     i386 architecture machines, for example, this is controlled by way of the
+     MTRR registers.
+
+     Ordinarily, these will be guaranteed to be fully ordered and uncombined,,
+     provided they're not accessing a prefetchable device.
+
+     However, intermediary hardware (such as a PCI bridge) may indulge in
+     deferral if it so wishes; to flush a write, a read from the same location
+     is preferred[*], but a read from the same device or from configuration
+     space should suffice for PCI.
+
+     [*] NOTE! attempting to read from the same location as was written to may
+     	 cause a malfunction - consider the 16550 Rx/Tx serial registers for
+     	 example.
+
+     Used with prefetchable I/O memory, an mmiowb() barrier may be required to
+     force writes to be ordered.
+
+     Please refer to the PCI specification for more information on interactions
+     between PCI transactions.
+
+ (*) readX_relaxed()
+
+     These are similar to readX(), but are not guaranteed to be ordered in any
+     way. Be aware that there is no I/O read barrier available.
+
+ (*) ioreadX(), iowriteX()
+
+     These will perform as appropriate for the type of access they're actually
+     doing, be it inX()/outX() or readX()/writeX().
+
+
+==========
+REFERENCES
+==========
+
+AMD64 Architecture Programmer's Manual Volume 2: System Programming
+	Chapter 7.1: Memory-Access Ordering
+	Chapter 7.4: Buffering and Combining Memory Writes
+
+IA-32 Intel Architecture Software Developer's Manual, Volume 3:
+System Programming Guide
+	Chapter 7.1: Locked Atomic Operations
+	Chapter 7.2: Memory Ordering
+	Chapter 7.4: Serializing Instructions
+
+The SPARC Architecture Manual, Version 9
+	Chapter 8: Memory Models
+	Appendix D: Formal Specification of the Memory Models
+	Appendix J: Programming with the Memory Models
+
+UltraSPARC Programmer Reference Manual
+	Chapter 5: Memory Accesses and Cacheability
+	Chapter 15: Sparc-V9 Memory Models
+
+UltraSPARC III Cu User's Manual
+	Chapter 9: Memory Models
+
+UltraSPARC IIIi Processor User's Manual
+	Chapter 8: Memory Models
+
+UltraSPARC Architecture 2005
+	Chapter 9: Memory
+	Appendix D: Formal Specifications of the Memory Models
+
+UltraSPARC T1 Supplment to the UltraSPARC Architecture 2005
+	Chapter 8: Memory Models
+	Appendix F: Caches and Cache Coherency
+
+Solaris Internals, Core Kernel Architecture, p63-68:
+	Chapter 3.3: Hardware Considerations for Locks and
+			Synchronization
+
+Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
+for Kernel Programmers:
+	Chapter 13: Other Memory Models
+
+Intel Itanium Architecture Software Developer's Manual: Volume 1:
+	Section 2.6: Speculation
+	Section 4.4: Memory Access