[PATCH 1/2] powerpc: Add 64bit optimised memcmp

Fri Jan 9 22:01:29 AEDT 2015

On 08-01-2015 23:56, Anton Blanchard wrote:
> I noticed ksm spending quite a lot of time in memcmp on a large
> KVM box. The current memcmp loop is very unoptimised - byte at a
> time compares with no loop unrolling. We can do much much better.
>
> Optimise the loop in a few ways:
>
> - Unroll the byte at a time loop
>
> - For large (at least 32 byte) comparisons that are also 8 byte
>   aligned, use an unrolled modulo scheduled loop using 8 byte
>   loads. This is similar to our glibc memcmp.
>
> A simple microbenchmark testing 10000000 iterations of an 8192 byte
> memcmp was used to measure the performance:
>
> baseline:	29.93 s
>
> modified:	 1.70 s
>
> Just over 17x faster.
>
> Signed-off-by: Anton Blanchard <anton at samba.org>
>
Why not use glibc implementations instead? All of them (ppc64, power4, and
power7) avoids use byte at time compares for unaligned cases inputs; while
showing the same performance for aligned one than this new implementation.
To give you an example, a 8192 bytes compare with input alignment of 63/18
shows:

__memcmp_power7:  320 cycles
__memcmp_power4:  320 cycles
__memcmp_ppc64:   340 cycles
this memcmp:     3185 cycles