[PATCH 1/2] powerpc: Add 64bit optimised memcmp

Mon Jan 12 20:45:21 AEDT 2015

From: Joakim Tjernlund 
> On Mon, 2015-01-12 at 11:55 +1100, Anton Blanchard wrote:
> > Hi David,
> >
> > > The unrolled loop (deleted) looks excessive.
> > > On a modern cpu with multiple execution units you can usually
> > > manage to get the loop overhead to execute in parallel to the
> > > actual 'work'.
> > > So I suspect that a much simpler 'word at a time' loop will be almost as fast - especially in the
> case where the code isn't
> > > already in the cache and the compare is relatively short.
> >
> > I'm always keen to keep things as simple as possible, but your loop is over 50% slower. Once the
> loop hits a steady state you are going to run into front end issues with instruction fetch on POWER8.

Interesting, I'm not an expert on ppc scheduling, but on my old x86 Athon 700 (I think
it was that one) a similar loop ran as fast as 'rep movsw'.

> Out of curiosity, does preincrement make any difference(or can gcc do that for you nowadays)?

It will only change register pressure slightly, and might allow any execution
delays be filled - but that is very processor dependant.
Actually you probably want to do 'a += 2' somewhere to reduce the instruction count.
Similarly the end condition needs to compare one of the pointers.

Elsewhere (not ppc) I've used (++p)[-1] instead of *p++ to move the increment
before the load to get better scheduling.

>          a1 = *a;
>          b1 = *b;
>          while {
>                  a2 = *++a;
>                  b2 = *++b;
>                  if (a1 != a2)
     That should have been a1 != b1
>                  	break;
>                  a1 = *++a;
>                  b1 = *++b;
>          } while (a2 != a1);
     and a2 != b2

	David