[PATCH] powerpc: tiny memcpy_(to|from)io optimisation
Kenneth Johansson
kenneth at southpole.se
Thu Jun 4 00:36:36 EST 2009
On Wed, 2009-06-03 at 08:51 +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2009-06-02 at 20:45 +0200, Albrecht Dreß wrote:
>
> >
> > which drops the r1 accesses, but still produces the sub-optimal loop.
> > Is this a gcc regression, or did I miss something here? Probably the
> > only bullet-proof way is to write some core loops in assembly... :-/
>
> Well, gcc may be right here. What you call the "optimal" loop uses the
> lwzu instruction. An interesting thing about this instruction is that
> it updates two GPRs at completion (I'm ignoring the load multiple and
> string instructions on purpose here).
> I wouldn't be surprised thus if the loop variant with the separate add
> ends up more efficient on most implementations around.
On an e300 core using the lwzu/stwu is about 20% faster so at least one
core prefer that optimization.
More information about the Linuxppc-dev
mailing list