Xorg on Fujitsu "Lime" with MPC5200b?

Fri Apr 16 19:25:30 EST 2010

On Thu, Apr 15, 2010 at 03:53:53PM +0200, Roman Fietze wrote:
> Hello Bill,
> 
> On Thursday 15 April 2010 15:01:59 Bill Gatliff wrote:
> 
> > Are you talking about this code here?
> > 
> >     void
> >     shadowUpdatePacked (ScreenPtr pScreen,
> >                         shadowBufPtr pBuf)
> >     {
> >     ...
> >                     while (i--)
> >                         *win++ = *sha++;
> 
> Yes. I added a routine like
> 
> /* Swap frame buffer bytes in 32 bit value.  */
> static __inline unsigned int
> fbbits_swap32(unsigned int __bsx)
> {
>     return ((((__bsx) & 0xff000000) >> 8) | (((__bsx) & 0x00ff0000) << 8) |
> 	    (((__bsx) & 0x0000ff00) >> 8) | (((__bsx) & 0x000000ff) << 8));
> }

I don't see the difference with:

	return (((__bsx & 0xff00ff00)>> 8) | ((__bsx & 0x00ff00ff) << 8));

for which the compiler (GCC 4.3.2) generates better code (GCC 4.3.2) as shown.

In the first case:

.L3:
        lwzx 9,3,8
        rlwinm 0,9,8,0,7
        rlwinm 11,9,24,8,15
        rlwinm 10,9,24,24,31
        or 0,0,11
        or 0,0,10
        rlwinm 9,9,8,16,23
        or 0,0,9
        stwx 0,4,8
        addi 8,8,4
        bdnz .L3

in the second:

.L9:
        lwzx 0,3,11
        and 9,0,10
        and 0,0,8
        slwi 0,0,8
        srwi 9,9,8
        or 0,0,9
        stwx 0,4,11
        addi 11,11,4
        bdnz .L9

saving 2 instructions. AFAIR the MPC5200 is based on a 603e core, 
so the integer instructions have to go to the single integer unit that
can handle them (the second IU can only handle add and cmp), so the
mimimum is 5 clocks/iteration versus 7. Even with two IU (or 3), the 
second code has better latency.

	Gabriel