[RFC PATCH] powerpc: Optimise barriers for fully ordered atomics

Nicholas Piggin npiggin at gmail.com
Tue Apr 16 12:12:27 AEST 2024


On Sat Apr 13, 2024 at 7:48 PM AEST, Michael Ellerman wrote:
> Nicholas Piggin <npiggin at gmail.com> writes:
> > "Fully ordered" atomics (RMW that return a value) are said to have a
> > full barrier before and after the atomic operation. This is implemented
> > as:
> >
> >   hwsync
> >   larx
> >   ...
> >   stcx.
> >   bne-
> >   hwsync
> >
> > This is slow on POWER processors because hwsync and stcx. require a
> > round-trip to the nest (~= L2 cache). The hwsyncs can be avoided with
> > the sequence:
> >
> >   lwsync
> >   larx
> >   ...
> >   stcx.
> >   bne-
> >   isync
> >
> > lwsync prevents all reorderings except store/load reordering, so the
> > larx could be execued ahead of a prior store becoming visible. However
> > the stcx. is a store, so it is ordered by the lwsync against all prior
> > access and if the value in memory had been modified since the larx, it
> > will fail. So the point at which the larx executes is not a concern
> > because the stcx. always verifies the memory was unchanged.
> >
> > The isync prevents subsequent instructions being executed before the
> > stcx. executes, and stcx. is necessarily visible to the system after
> > it executes, so there is no opportunity for it (or prior stores, thanks
> > to lwsync) to become visible after a subsequent load or store.
>
> AFAICS commit b97021f85517 ("powerpc: Fix atomic_xxx_return barrier
> semantics") disagrees.
>
> That was 2011, so maybe it's wrong or outdated?

Hmm, thanks for the reference. I didn't know about that.

isync or ordering execution / completion of a load after a previous
plain store doesn't guarantee ordering, because stores drain from
queues and become visible some time after they complete.

Okay, but I was thinking a successful stcx. should have to be visible.
I guess that's not true if it broke on P7. Maybe it's possible in the
architecture to have a successful stcx. not having this property though,
I find it pretty hard to read this part of the architecture. Clearly
it has to be visible to other processors performing larx/stcx.,
otherwise the canonical stcx. ; bne- ; isync sequence can't provide
mutual exclusion.

I wonder if the P7 breakage was caused by to the "wormhole coherency"
thing that was changed ("fixed") in POWER9. I'll have to look into it
a bit more. The cache coherency guy I was talking to retired before
answering :/ I'll have to find another victim.

I should try to break a P8 too.

>
> Either way it would be good to have some litmus tests to back this up.
>
> cheers
>
> ps. isn't there a rule against sending barrier patches after midnight? ;)

There should be if not.

Thanks,
Nick


More information about the Linuxppc-dev mailing list