[RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants
Boqun Feng
boqun.feng at gmail.com
Sun Sep 20 18:23:03 AEST 2015
On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> Hi Will,
>
> On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > using "lwsync" and "isync".
> >
> > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
>
> Actually the transitivity is still guaranteed here, I think ;-)
>
> (Before I put my reasoning, I have to admit I just learned about the
> cumulativity recently, so my reasoning may be wrong. But the good thing
> is that we have our POWER experts in the CCed. In case I'm wrong, they
> could correct me.)
>
> The thing is, on POWER, transitivity is implemented by a similar but
> slightly different concept, cumulativity, and as said in the link:
>
> http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
>
> """
> The ordering done by a memory barrier is said to be “cumulative” if it
> also orders storage accesses that are performed by processors and
> mechanisms other than P1, as follows.
>
> * A includes all applicable storage accesses by any such processor
> or mechanism that have been performed with respect to P1 before
> the memory barrier is created.
>
> * B includes all applicable storage accesses by any such processor
> or mechanism that are performed after a Load instruction
> executed by that processor or mechanism has returned the value
> stored by a store that is in B.
> """
>
> Please note that the set B can be extended indefinitely without any
> other cumulative barrier.
>
> So for a RELEASE+ACQUIRE pair to a same variable, as long as the barrier
> in the RELEASE operation is cumumlative, the transitivity is guaranteed.
> And lwsync is cumulative, so we are fine here.
>
>
> I also wrote a herd litmus to test this. Due to the tool's limitation, I
> use the xchg_release and xchg_acquire to test. And since herd doesn't
Hmm.. I think I wanted to say atomic_xchg_release and
atomic_xchg_acquire here, sorry about that inaccuracy..
> support backward branching, some tricks are used here to work around:
>
And I check again, herd does suppor backward branching, the problem is
just if we use backward branching, there will be a lot more states the
tool need to check, but it seems there are not too many in this case, so
I modify the litmus a little bit as follow:
PPC lwsync+isync-transitivity
""
{
0:r1=1; 0:r2=x; 0:r3=1; 0:r10=0 ; 0:r11=0; 0:r12=a;
1:r1=9; 1:r2=x; 1:r3=1; 1:r10=0 ; 1:r11=0; 1:r12=a;
2:r1=9; 2:r2=x; 2:r3=2; 2:r10=0 ; 2:r11=0; 2:r12=a;
}
P0 | P1 | P2 ;
stw r1,0(r2) | lwz r1,0(r2) | Fail2: ;
| lwsync | lwarx r11, r10, r12 ;
| Fail1: | stwcx. r3, r10, r12 ;
| lwarx r11,r10,r12 | bne Fail2 ;
| stwcx. r3,r10,r12 | isync ;
| bne Fail1 | lwz r1, 0(r2) ;
exists
(1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1 = 0)
which is actually:
CPU 0 CPU 1 CPU 2
============== ===================== =======================
{int x = 0, atomic_t a = ATOMIC_INIT(0)}
WRITE_ONCE(x,1); t1 = READ_ONCE(x); t2 = atomic_xchg_acquire(&a, 2);
atomic_xchg_release(&a, 1); t3 = READ_ONCE(x);
exists
(t1 == 1 && t2 == 1 && t3 == 0)
The result is still(it may take a while to get the result):
Test lwsync+isync-transitivity Allowed
States 11
1:r1=0; 1:r11=0; 2:r1=0; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=0; 2:r11=1;
1:r1=0; 1:r11=0; 2:r1=1; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=1; 2:r11=1;
1:r1=0; 1:r11=2; 2:r1=0; 2:r11=0;
1:r1=0; 1:r11=2; 2:r1=1; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=0; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=1; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=1; 2:r11=1;
1:r1=1; 1:r11=2; 2:r1=0; 2:r11=0;
1:r1=1; 1:r11=2; 2:r1=1; 2:r11=0;
Loop No
Witnesses
Positive: 0 Negative: 198
Condition exists (1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1=0)
Observation lwsync+isync-transitivity Never 0 198
, which means transitivity is guaranteed.
And I think it deserves more analysis based on cumulativity:
Initially, for the lwsync on P1(CPU 1), we have set A and B of the
storage accesses on the same processor which lwsync orders:
A includes:
on CPU 1:
lwz r1, 0(r2) // t1 = READ_ONCE(x);
B includes:
on CPU 1:
lwarx r11,r10,r12 // atomic_xchg_release();
stwcx. r3,r10,r12
and as t1 == 1, which means before lwsync, P1 perceives the STORE of x
on CPU 0, which makes another storage access is included in A:
A now includes:
on CPU 0:
stw r1, 0(r) // WRITE_ONCE(x,1);
on CPU 1:
lwz r1, 0(r2) // t1 = READ_ONCE(x);
B now includes:
on CPU 1:
lwarx r11,r10,r12 // atomic_xchg_release();
stwcx. r3,r10,r12
and as t2 == 1, which means on CPU 2, "lwarx r11,r10,r12" in
atomic_xchg_acqurie() reads the value stored by "stwcx. r3,r10,r12" in
atomic_xchg_release() on CPU 1, that makes all storage accesses
performed after atomic_xchg_acquire() get included in set B:
A now includes:
on CPU 0:
stw r1, 0(r) // WRITE_ONCE(x,1);
on CPU 1:
lwz r1, 0(r2) // t1 = READ_ONCE(x);
B now includes:
on CPU 1:
lwarx r11,r10,r12 // atomic_xchg_release();
stwcx. r3,r10,r12
on CPU 2:
lwz r1, 0(r2) // t3 = READ_ONCE(x);
Therefore the STORE of x on CPU 0 and the LOAD of x on CPU 2 can not be
reordered in this case, which means transitivity guaranteed.
Regards,
Boqun
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20150920/d0187565/attachment.sig>
More information about the Linuxppc-dev
mailing list