[RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants

Boqun Feng boqun.feng at gmail.com
Sun Sep 20 18:23:03 AEST 2015


On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> Hi Will,
> 
> On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > using "lwsync" and "isync".
> > 
> > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> 
> Actually the transitivity is still guaranteed here, I think ;-)
> 
> (Before I put my reasoning, I have to admit I just learned about the
> cumulativity recently, so my reasoning may be wrong. But the good thing
> is that we have our POWER experts in the CCed. In case I'm wrong, they
> could correct me.)
> 
> The thing is, on POWER, transitivity is implemented by a similar but
> slightly different concept, cumulativity, and as said in the link:
> 
> http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
> 
> """
> The ordering done by a memory barrier is said to be “cumulative” if it
> also orders storage accesses that are performed by processors and
> mechanisms other than P1, as follows.
> 
> *	A includes all applicable storage accesses by any such processor
> 	or mechanism that have been performed with respect to P1 before
> 	the memory barrier is created.
> 
> *	B includes all applicable storage accesses by any such processor
> 	or mechanism that are performed after a Load instruction
> 	executed by that processor or mechanism has returned the value
> 	stored by a store that is in B.
> """
> 
> Please note that the set B can be extended indefinitely without any
> other cumulative barrier.
> 
> So for a RELEASE+ACQUIRE pair to a same variable, as long as the barrier
> in the RELEASE operation is cumumlative, the transitivity is guaranteed.
> And lwsync is cumulative, so we are fine here.
> 
> 
> I also wrote a herd litmus to test this. Due to the tool's limitation, I
> use the xchg_release and xchg_acquire to test. And since herd doesn't

Hmm.. I think I wanted to say atomic_xchg_release and
atomic_xchg_acquire here, sorry about that inaccuracy..

> support backward branching, some tricks are used here to work around:
> 

And I check again, herd does suppor backward branching, the problem is
just if we use backward branching, there will be a lot more states the
tool need to check, but it seems there are not too many in this case, so
I modify the litmus a little bit as follow:

PPC lwsync+isync-transitivity
""
{
0:r1=1; 0:r2=x; 0:r3=1; 0:r10=0 ; 0:r11=0; 0:r12=a;
1:r1=9; 1:r2=x; 1:r3=1; 1:r10=0 ; 1:r11=0; 1:r12=a;
2:r1=9; 2:r2=x; 2:r3=2; 2:r10=0 ; 2:r11=0; 2:r12=a;
}
 P0           | P1                  | P2                  ;
 stw r1,0(r2) | lwz r1,0(r2)        | Fail2:              ;
              | lwsync              | lwarx r11, r10, r12 ;
              | Fail1:              | stwcx. r3, r10, r12 ;
              | lwarx  r11,r10,r12  | bne Fail2           ;
              | stwcx. r3,r10,r12   | isync               ;
              | bne Fail1           | lwz r1, 0(r2)       ; 

exists
(1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1 = 0)

which is actually:

CPU 0			CPU 1				CPU 2
==============		=====================		=======================
{int x = 0, atomic_t a = ATOMIC_INIT(0)}
WRITE_ONCE(x,1);	t1 = READ_ONCE(x);		t2 = atomic_xchg_acquire(&a, 2);
			atomic_xchg_release(&a, 1);	t3 = READ_ONCE(x);

exists
(t1 == 1 && t2 == 1 && t3 == 0)


The result is still(it may take a while to get the result):

Test lwsync+isync-transitivity Allowed
States 11
1:r1=0; 1:r11=0; 2:r1=0; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=0; 2:r11=1;
1:r1=0; 1:r11=0; 2:r1=1; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=1; 2:r11=1;
1:r1=0; 1:r11=2; 2:r1=0; 2:r11=0;
1:r1=0; 1:r11=2; 2:r1=1; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=0; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=1; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=1; 2:r11=1;
1:r1=1; 1:r11=2; 2:r1=0; 2:r11=0;
1:r1=1; 1:r11=2; 2:r1=1; 2:r11=0;
Loop No
Witnesses
Positive: 0 Negative: 198
Condition exists (1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1=0)
Observation lwsync+isync-transitivity Never 0 198

, which means transitivity is guaranteed.


And I think it deserves more analysis based on cumulativity:

Initially, for the lwsync on P1(CPU 1), we have set A and B of the
storage accesses on the same processor which lwsync orders:

A includes:
	on CPU 1:
		lwz r1, 0(r2) // t1 = READ_ONCE(x); 

B includes:
	on CPU 1:
		lwarx  r11,r10,r12 // atomic_xchg_release();
		stwcx. r3,r10,r12

and as t1 == 1, which means before lwsync, P1 perceives the STORE of x
on CPU 0, which makes another storage access is included in A:

A now includes:
	on CPU 0:
		stw r1, 0(r) // WRITE_ONCE(x,1);
	on CPU 1:
		lwz r1, 0(r2) // t1 = READ_ONCE(x); 

B now includes:
	on CPU 1:
		lwarx  r11,r10,r12 // atomic_xchg_release();
		stwcx. r3,r10,r12

and as t2 == 1, which means on CPU 2, "lwarx  r11,r10,r12" in
atomic_xchg_acqurie() reads the value stored by "stwcx. r3,r10,r12" in
atomic_xchg_release() on CPU 1, that makes all storage accesses
performed after atomic_xchg_acquire() get included in set B:

A now includes:
	on CPU 0:
		stw r1, 0(r) // WRITE_ONCE(x,1);
	on CPU 1:
		lwz r1, 0(r2) // t1 = READ_ONCE(x); 

B now includes:
	on CPU 1:
		lwarx  r11,r10,r12 // atomic_xchg_release();
		stwcx. r3,r10,r12
	on CPU 2:
		lwz r1, 0(r2) // t3 = READ_ONCE(x);

Therefore the STORE of x on CPU 0 and the LOAD of x on CPU 2 can not be
reordered in this case, which means transitivity guaranteed.

Regards,
Boqun
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20150920/d0187565/attachment.sig>


More information about the Linuxppc-dev mailing list