[patch 2/2] powerpc: optimise smp_wmb

Thu May 22 10:30:54 EST 2008

On Wed, May 21, 2008 at 10:12:03PM +0200, Segher Boessenkool wrote:
> >>From memory, I measured lwsync is 5 times faster than eieio on
> >a dual G5. This was on a simple microbenchmark that made use of
> >smp_wmb for store ordering, but it did not involve any IO access
> >(which presumably would disadvantage eieio further).
> 
> This is very much specific to your particular benchmark.
> 
> On the 970, there are two differences between lwsync and eieio:
> 
> 1) lwsync cannot issue before all previous loads are done; eieio
> does not have this restriction.
> 
> Then, they both fly through the execution core, it doesn't wait
> for the barrier insn to complete in the storage system.  In both
> cases, a barrier is inserted into both the L2 queues and the
> non-cacheable queues.  These barriers are both removed at the
> same time, that is, when both are the oldest in their queue and
> have done their thing.
> 
> 2) For eieio, the non-cacheable unit waits for all previous
> (non-cacheable) stores to complete, and then arbitrates for the
> bus and sends an EIEIO transaction.
> 
> Your benchmark doesn't do non-cacheable stores, so it would seem
> the five-time slowdown is caused by that bus arbitration (and the
> bus transaction).  Maybe your cacheable stores hit the bus as well,
> that would make this worse.  Your benchmark also doesn't see the
> negative effects from 1).
> 
> In "real" code, I expect 2) to be pretty much invisible (the store
> queues will never be completely filled up), but 1) shouldn't be very
> bad either.  So it's a wash.  But only a real benchmark will tell.

OK, interesting thanks. Yes the "benchmark" is not a good one, but
it verified for me that there is a difference there. Combined with
IBM's documents saying lwsync is preferred for store/store ordering
is my rationale for sending the patch. A real benchmark would be nice
but it would probably be hard to notice any improvement.

> >Given the G5 speedup, I'd be surprised if there is not an improvment
> >on POWER4 and 5 as well,
> 
> The 970 storage subsystem and the POWER4 one are very different.
> Or maybe these queues are just about the last thing that _is_
> identical, I dunno, there aren't public POWER4 docs for this ;-)
> 
> >although no idea about POWER6 or cell...
> 
> No idea about POWER6; for CBE, the backend works similar to the
> 970 one.
> 
> Given that the architecture says to use lwsync for cases like this,
> it would be very surprising if it performed (much) worse than eieio,
> eh? ;-)  So I think your patch is a win; just wanted to clarify on
> your five-time slowdown number.

Sure, thanks!

Nick