[PATCH] powerpc: tiny memcpy_(to|from)io optimisation

Thu Jun 4 00:36:36 EST 2009

On Wed, 2009-06-03 at 08:51 +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2009-06-02 at 20:45 +0200, Albrecht Dreß wrote:
> 
> > 
> > which drops the r1 accesses, but still produces the sub-optimal loop.   
> > Is this a gcc regression, or did I miss something here?  Probably the  
> > only bullet-proof way is to write some core loops in assembly... :-/
> 
> Well, gcc may be right here. What you call the "optimal" loop uses the
> lwzu instruction. An interesting thing about this instruction is that
> it updates two GPRs at completion (I'm ignoring the load multiple and
> string instructions on purpose here).

> I wouldn't be surprised thus if the loop variant with the separate add
> ends up more efficient on most implementations around.

On an e300 core using the lwzu/stwu is about 20% faster so at least one
core prefer that optimization.