[PATCH 1/2] powerpc: Add 64bit optimised memcmp

Fri Jan 9 21:06:59 AEDT 2015

From: Anton Blanchard
> I noticed ksm spending quite a lot of time in memcmp on a large
> KVM box. The current memcmp loop is very unoptimised - byte at a
> time compares with no loop unrolling. We can do much much better.
> 
> Optimise the loop in a few ways:
> 
> - Unroll the byte at a time loop
> 
> - For large (at least 32 byte) comparisons that are also 8 byte
>   aligned, use an unrolled modulo scheduled loop using 8 byte
>   loads. This is similar to our glibc memcmp.
> 
> A simple microbenchmark testing 10000000 iterations of an 8192 byte
> memcmp was used to measure the performance:
> 
> baseline:	29.93 s
> 
> modified:	 1.70 s
> 
> Just over 17x faster.

The unrolled loop (deleted) looks excessive.
On a modern cpu with multiple execution units you can usually
manage to get the loop overhead to execute in parallel to the
actual 'work'.
So I suspect that a much simpler 'word at a time' loop will be
almost as fast - especially in the case where the code isn't
already in the cache and the compare is relatively short.
Try something based on:
	a1 = *a++;
	b1 = *b++;
	while {
		a2 = *a++;
		b2 = *b++;
		if (a1 != a2)
			break;
		a1 = *a++;
		b1 = *b++;
	} while (a2 != a1);

	David