[PATCH] powerpc: tiny memcpy_(to|from)io optimisation

Wed Jun 3 08:51:25 EST 2009

On Tue, 2009-06-02 at 20:45 +0200, Albrecht Dreß wrote:

> 
> which drops the r1 accesses, but still produces the sub-optimal loop.   
> Is this a gcc regression, or did I miss something here?  Probably the  
> only bullet-proof way is to write some core loops in assembly... :-/

Well, gcc may be right here. What you call the "optimal" loop uses the
lwzu instruction. An interesting thing about this instruction is that
it updates two GPRs at completion (I'm ignoring the load multiple and
string instructions on purpose here).

Now, quite a few simple implementations don't have two write ports to
the GPR file, nor the logic to handle hazards properly with two GPRs
being updated... which means the instruction is very likely to take a
very inefficient path through the pipeline. On server processors, I'm
pretty sure it's just cracked into a load and an add anyway.

I wouldn't be surprised thus if the loop variant with the separate add
ends up more efficient on most implementations around.

Of course, the loop above could use some unrolling to put some distance
between the load and the store of the loaded value.

Cheers,
Ben.