arch/powerpc/math-emu/mtfsf.c - incorrect mask?
David Laight
David.Laight at ACULAB.COM
Mon Feb 10 22:17:38 EST 2014
> > However, your other solutions are better.
> >
> >
> > > > >
> > > > > mask = (FM & 1);
> > > > > mask |= (FM << 3) & 0x10;
> > > > > mask |= (FM << 6) & 0x100;
> > > > > mask |= (FM << 9) & 0x1000;
> > > > > mask |= (FM << 12) & 0x10000;
> > > > > mask |= (FM << 15) & 0x100000;
> > > > > mask |= (FM << 18) & 0x1000000;
> > > > > mask |= (FM << 21) & 0x10000000;
> > > > > mask *= 15;
> > > > >
> > > > > should do the job, in less code space and without a single branch.
...
> > > > > Another way of optomizing this could be:
> > > > >
> > > > > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > > > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > > > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > > > mask *= 15;
...
> Ok, if you have measured that method1 is faster than method2, let us go for it.
> I believe method2 would be faster if you had a large out-of-order execution
> window, because more parallelism can be extracted from it, but this is probably
> only true for high end cores, which do not need FPU emulation in the first place.
FWIW the second has a long dependency chain on 'mask', whereas the first can execute
the shift/and in any order and then merge the results.
So on most superscalar cpu, or one with result delays for arithmetic, the first
is likely to be faster.
David
More information about the Linuxppc-dev
mailing list