[PATCH v4 3/4] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp()

Michael Ellerman mpe at ellerman.id.au
Fri May 18 00:13:52 AEST 2018


wei.guo.simon at gmail.com writes:
> From: Simon Guo <wei.guo.simon at gmail.com>
>
> This patch is based on the previous VMX patch on memcmp().
>
> To optimize ppc64 memcmp() with VMX instruction, we need to think about
> the VMX penalty brought with: If kernel uses VMX instruction, it needs
> to save/restore current thread's VMX registers. There are 32 x 128 bits
> VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store.
>
> The major concern regarding the memcmp() performance in kernel is KSM,
> who will use memcmp() frequently to merge identical pages. So it will
> make sense to take some measures/enhancement on KSM to see whether any
> improvement can be done here.  Cyril Bur indicates that the memcmp() for
> KSM has a higher possibility to fail (unmatch) early in previous bytes
> in following mail.
> 	https://patchwork.ozlabs.org/patch/817322/#1773629
> And I am taking a follow-up on this with this patch.
>
> Per some testing, it shows KSM memcmp() will fail early at previous 32
> bytes.  More specifically:
>     - 76% cases will fail/unmatch before 16 bytes;
>     - 83% cases will fail/unmatch before 32 bytes;
>     - 84% cases will fail/unmatch before 64 bytes;
> So 32 bytes looks a better choice than other bytes for pre-checking.
>
> This patch adds a 32 bytes pre-checking firstly before jumping into VMX
> operations, to avoid the unnecessary VMX penalty. And the testing shows
> ~20% improvement on memcmp() average execution time with this patch.
>
> The detail data and analysis is at:
> https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md
>
> Any suggestion is welcome.

Thanks for digging into that, really great work.

I'm inclined to make this not depend on KSM though. It seems like a good
optimisation to do in general.

So can we just call it the 'pre-check' or something, and always do it?

cheers

> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index 6303bbf..df2eec0 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -405,6 +405,35 @@ _GLOBAL(memcmp)
>  	/* Enter with src/dst addrs has the same offset with 8 bytes
>  	 * align boundary
>  	 */
> +
> +#ifdef CONFIG_KSM
> +	/* KSM will always compare at page boundary so it falls into
> +	 * .Lsameoffset_vmx_cmp.
> +	 *
> +	 * There is an optimization for KSM based on following fact:
> +	 * KSM pages memcmp() prones to fail early at the first bytes. In
> +	 * a statisis data, it shows 76% KSM memcmp() fails at the first
> +	 * 16 bytes, and 83% KSM memcmp() fails at the first 32 bytes, 84%
> +	 * KSM memcmp() fails at the first 64 bytes.
> +	 *
> +	 * Before applying VMX instructions which will lead to 32x128bits VMX
> +	 * regs load/restore penalty, let us compares the first 32 bytes
> +	 * so that we can catch the ~80% fail cases.
> +	 */
> +
> +	li	r0,4
> +	mtctr	r0
> +.Lksm_32B_loop:
> +	LD	rA,0,r3
> +	LD	rB,0,r4
> +	cmpld	cr0,rA,rB
> +	addi	r3,r3,8
> +	addi	r4,r4,8
> +	bne     cr0,.LcmpAB_lightweight
> +	addi	r5,r5,-8
> +	bdnz	.Lksm_32B_loop
> +#endif
> +
>  	ENTER_VMX_OPS
>  	beq     cr1,.Llong_novmx_cmp
>  
> -- 
> 1.8.3.1


More information about the Linuxppc-dev mailing list