[PATCH] powerpc: tiny memcpy_(to|from)io optimisation
Albrecht Dreß
albrecht.dress at arcor.de
Wed Jun 3 04:45:55 EST 2009
Am 01.06.09 08:14 schrieb(en) Joakim Tjernlund:
> .. not even 4.2.2 which is fairly modern will get it right. It breaks
> very easy as gcc has never been any good at this type of
> optimization. Sometimes small changes will make gcc unhappy and it
> won't do the right optimization.
It's even worse... Looking at the assembly output of the simple
function
<snip>
void loop2(void * src, void * dst, int n)
{
volatile uint32_t * _dst = (volatile uint32_t *) (dst - 4);
volatile uint32_t * _src = (volatile uint32_t *) (src - 4);
n >>= 2;
do {
*(++_dst) = *(++_src);
} while (--n);
}
</snip>
gcc 4.0.1 coming with Apple's Developer Tools (on Tiger) with options
"-O3 -mcpu=603e -mtune=603e" produces
<snip>
_loop2:
srawi r5,r5,2
mtctr r5
addi r4,r4,-4
addi r3,r3,-4
L11:
lwzu r0,4(r3)
stwu r0,4(r4)
bdnz L11
blr
</snip>
which looks perfect to me. However, gcc 4.3.3 on Ubuntu/PPC produces
with the same options
<snip>
loop2:
srawi 5,5,2
stwu 1,-16(1)
mtctr 5
li 9,0
.L8:
lwzx 0,3,9
stwx 0,4,9
addi 9,9,4
bdnz .L8
addi 1,1,16
blr
</snip>
wasting a register and a statement in the loop core, and fiddles around
with the stack pointer for no good reason. Gcc 4.4.0 produces
<snip>
loop2:
srawi 5,5,2
mtctr 5
li 9,0
.L9:
lwzx 0,3,9
stwx 0,4,9
addi 9,9,4
bdnz .L9
blr
</snip>
which drops the r1 accesses, but still produces the sub-optimal loop.
Is this a gcc regression, or did I miss something here? Probably the
only bullet-proof way is to write some core loops in assembly... :-/
Thanks, Albrecht.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20090602/2af952be/attachment.pgp>
More information about the Linuxppc-dev
mailing list