arch/powerpc/math-emu/mtfsf.c - incorrect mask?

Sat Feb 8 07:49:40 EST 2014

On Fri, 7 Feb 2014, Gabriel Paubert wrote:

> 	Hi Stephen,
> 
> On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > Gabriel Paubert <paubert at iram.es> wrote on 02/06/2014 07:26:37 PM:
> > 
> > > From: Gabriel Paubert <paubert at iram.es>
> > > To: Stephen N Chivers <schivers at csc.com.au>
> > > Cc: linuxppc-dev at lists.ozlabs.org, Chris Proctor <cproctor at csc.com.au>
> > > Date: 02/06/2014 07:26 PM
> > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > 
> > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:

> > > > 
> > > >                 mask = 0;
> > > >                 if (FM & (1 << 0))
> > > >                         mask |= 0x0000000f;
> > > >                 if (FM & (1 << 1))
> > > >                         mask |= 0x000000f0;
> > > >                 if (FM & (1 << 2))
> > > >                         mask |= 0x00000f00;
> > > >                 if (FM & (1 << 3))
> > > >                         mask |= 0x0000f000;
> > > >                 if (FM & (1 << 4))
> > > >                         mask |= 0x000f0000;
> > > >                 if (FM & (1 << 5))
> > > >                         mask |= 0x00f00000;
> > > >                 if (FM & (1 << 6))
> > > >                         mask |= 0x0f000000;
> > > >                 if (FM & (1 << 7))
> > > >                         mask |= 0x90000000;
> > > > 
> > > > With the above mask computation I get consistent results for 
> > > > both the MPC8548 and MPC7410 boards.
> > > > 
> > > > Am I missing something subtle?
> > > 
> > > No I think you are correct. This said, this code may probably be 
> > optimized 
> > > to eliminate a lot of the conditional branches. I think that:

If the compiler is enabled to generate isel instructions, it would not 
use a conditional branch for this code. (ignore the andi's values, 
this is an old compile)

c0037c2c <mtfsf>:
c0037c2c:       2c 03 00 00     cmpwi   r3,0
c0037c30:       41 82 01 1c     beq-    c0037d4c <mtfsf+0x120>
c0037c34:       2f 83 00 ff     cmpwi   cr7,r3,255
c0037c38:       41 9e 01 28     beq-    cr7,c0037d60 <mtfsf+0x134>
c0037c3c:       70 66 00 01     andi.   r6,r3,1
c0037c40:       3d 00 90 00     lis     r8,-28672
c0037c44:       7d 20 40 9e     iseleq  r9,r0,r8
c0037c48:       70 6a 00 02     andi.   r10,r3,2
c0037c4c:       65 28 0f 00     oris    r8,r9,3840
c0037c50:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c54:       70 66 00 04     andi.   r6,r3,4
c0037c58:       65 28 00 f0     oris    r8,r9,240
c0037c5c:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c60:       70 6a 00 08     andi.   r10,r3,8
c0037c64:       65 28 00 0f     oris    r8,r9,15
c0037c68:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c6c:       70 66 00 10     andi.   r6,r3,16
c0037c70:       61 28 f0 00     ori     r8,r9,61440
c0037c74:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c78:       70 6a 00 20     andi.   r10,r3,32
c0037c7c:       61 28 0f 00     ori     r8,r9,3840
c0037c80:       54 6a cf fe     rlwinm  r10,r3,25,31,31
c0037c84:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c88:       2f 8a 00 00     cmpwi   cr7,r10,0
c0037c8c:       70 66 00 40     andi.   r6,r3,64
c0037c90:       61 28 00 f0     ori     r8,r9,240
c0037c94:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c98:       41 9e 00 08     beq-    cr7,c0037ca0 <mtfsf+0x74>
c0037c9c:       61 29 00 0f     ori     r9,r9,15
	...

However, your other solutions are better.

> > > 
> > > mask = (FM & 1);
> > > mask |= (FM << 3) & 0x10;
> > > mask |= (FM << 6) & 0x100;
> > > mask |= (FM << 9) & 0x1000;
> > > mask |= (FM << 12) & 0x10000;
> > > mask |= (FM << 15) & 0x100000;
> > > mask |= (FM << 18) & 0x1000000;
> > > mask |= (FM << 21) & 0x10000000;
> > > mask *= 15;
> > > 
> > > should do the job, in less code space and without a single branch.
> > > 
> > > Each one of the "mask |=" lines should be translated into an
> > > rlwinm instruction followed by an "or". Actually it should be possible
> > > to transform each of these lines into a single "rlwimi" instruction
> > > but I don't know how to coerce gcc to reach this level of optimization.
> > > 
> > > Another way of optomizing this could be:
> > > 
> > > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > mask *= 15;
> > > 

> Ok, I finally edited my sources and test compiled the suggestions
> I gave. I'd say that method2 is the best overall indeed. You can
> actually save one more instruction by setting mask to all ones in 
> the case FM=0xff, but that's about all in this area.

My measurements show method1 to be smaller and faster than method2 due 
to the number of instructions needed to generate the constant masks in 
method2, but it may depend upon your compiler and hardware.  Both are 
faster than the original with isel.