[PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

Fri Oct 16 01:50:44 AEDT 2015

On Thu, Oct 15, 2015 at 11:35:10AM +0100, Will Deacon wrote:
> Dammit guys, it's never simple is it?

I re-read this and it's even more confusing than I first thought.

> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o	Commit P0's write.  The model offers to propagate this write
> > 	to the coherence point and to P1, but don't do so yet.
> > 
> > o	Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o	Commit P0's lwsync.
> > 
> > o	Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o	Commit P0's stwcx. as successful.  This stores a=1.
> 
> On arm64, this is a conditional-store-*release* and therefore cannot be
> observed before the initial write to x...
> 
> > o	Commit P0's branch (not taken).
> > 
> > o	Commit P0's final register-to-register move.
> > 
> > o	Commit P1's sync instruction.
> > 
> > o	There is now nothing that can happen in either processor.
> > 	P0 is done, and P1 is waiting for its sync.  Therefore,
> > 	propagate P1's a=2 write to the coherence point and to
> > 	the other thread.
> 
> ... therefore this is illegal, because you haven't yet propagated that
> prior write...

I misread this as a propagation of PO's conditional store. What actually
happens on arm64, is that the early conditional store can only succeed
once it is placed into the coherence order of the location which it is
updating (but note that this is subtly different from multi-copy
atomicity!).

So, given that the previous conditional store succeeded, the coherence
order on A must be either {0, 1, 2} or {0, 2, 1}.

If it's {0, 1, 2} (as required by your complete example), that means
P1's a=2 write "observes" the conditional store by P0, and therefore
(because the conditional store has release semantics), also observes
P0's x=1 write.

On the other hand, if P1's a=2 write propagates first and we have a
coherence order of {0, 2, 1}, then P0 must have r3=2, because an
exclusive load returning zero would have led to a failed conditional
store thanks to the intervening write by P1.

I find it pretty weird if PPC allows the conditional store to succeed in
this way, as I think that would break simple cases like two threads
incrementing a shared variable in parallel:

""
{
0:r1=1; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
1:r1=1; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
}
P0                 | P1                 ;
lwarx  r11,r10,r12 | lwarx  r11,r10,r12 ;
add r11,r1,r11     | add r11,r1,r11     ;
stwcx. r11,r10,r12 | stwcx. r11,r10,r12 ;
bne Fail0          | bne Fail1          ;
mr r3,r1           | mr r3,r1           ;
Fail0:             | Fail1:             ;
exists
(0:r3=1 /\ a=1 /\ 1:r3=1)

Is also allowed by herd and forbidden by ppcmem, for example.

Will

[PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier

[PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier