[RFC PATCH] powerpc: Optimise barriers for fully ordered atomics

Sat Apr 13 19:48:34 AEST 2024

Nicholas Piggin <npiggin at gmail.com> writes:
> "Fully ordered" atomics (RMW that return a value) are said to have a
> full barrier before and after the atomic operation. This is implemented
> as:
>
>   hwsync
>   larx
>   ...
>   stcx.
>   bne-
>   hwsync
>
> This is slow on POWER processors because hwsync and stcx. require a
> round-trip to the nest (~= L2 cache). The hwsyncs can be avoided with
> the sequence:
>
>   lwsync
>   larx
>   ...
>   stcx.
>   bne-
>   isync
>
> lwsync prevents all reorderings except store/load reordering, so the
> larx could be execued ahead of a prior store becoming visible. However
> the stcx. is a store, so it is ordered by the lwsync against all prior
> access and if the value in memory had been modified since the larx, it
> will fail. So the point at which the larx executes is not a concern
> because the stcx. always verifies the memory was unchanged.
>
> The isync prevents subsequent instructions being executed before the
> stcx. executes, and stcx. is necessarily visible to the system after
> it executes, so there is no opportunity for it (or prior stores, thanks
> to lwsync) to become visible after a subsequent load or store.

AFAICS commit b97021f85517 ("powerpc: Fix atomic_xxx_return barrier
semantics") disagrees.

That was 2011, so maybe it's wrong or outdated?

Either way it would be good to have some litmus tests to back this up.

cheers

ps. isn't there a rule against sending barrier patches after midnight? ;)