[PATCH v4 3/4] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp()
Michael Ellerman
mpe at ellerman.id.au
Fri May 18 00:13:52 AEST 2018
wei.guo.simon at gmail.com writes:
> From: Simon Guo <wei.guo.simon at gmail.com>
>
> This patch is based on the previous VMX patch on memcmp().
>
> To optimize ppc64 memcmp() with VMX instruction, we need to think about
> the VMX penalty brought with: If kernel uses VMX instruction, it needs
> to save/restore current thread's VMX registers. There are 32 x 128 bits
> VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store.
>
> The major concern regarding the memcmp() performance in kernel is KSM,
> who will use memcmp() frequently to merge identical pages. So it will
> make sense to take some measures/enhancement on KSM to see whether any
> improvement can be done here. Cyril Bur indicates that the memcmp() for
> KSM has a higher possibility to fail (unmatch) early in previous bytes
> in following mail.
> https://patchwork.ozlabs.org/patch/817322/#1773629
> And I am taking a follow-up on this with this patch.
>
> Per some testing, it shows KSM memcmp() will fail early at previous 32
> bytes. More specifically:
> - 76% cases will fail/unmatch before 16 bytes;
> - 83% cases will fail/unmatch before 32 bytes;
> - 84% cases will fail/unmatch before 64 bytes;
> So 32 bytes looks a better choice than other bytes for pre-checking.
>
> This patch adds a 32 bytes pre-checking firstly before jumping into VMX
> operations, to avoid the unnecessary VMX penalty. And the testing shows
> ~20% improvement on memcmp() average execution time with this patch.
>
> The detail data and analysis is at:
> https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md
>
> Any suggestion is welcome.
Thanks for digging into that, really great work.
I'm inclined to make this not depend on KSM though. It seems like a good
optimisation to do in general.
So can we just call it the 'pre-check' or something, and always do it?
cheers
> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index 6303bbf..df2eec0 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -405,6 +405,35 @@ _GLOBAL(memcmp)
> /* Enter with src/dst addrs has the same offset with 8 bytes
> * align boundary
> */
> +
> +#ifdef CONFIG_KSM
> + /* KSM will always compare at page boundary so it falls into
> + * .Lsameoffset_vmx_cmp.
> + *
> + * There is an optimization for KSM based on following fact:
> + * KSM pages memcmp() prones to fail early at the first bytes. In
> + * a statisis data, it shows 76% KSM memcmp() fails at the first
> + * 16 bytes, and 83% KSM memcmp() fails at the first 32 bytes, 84%
> + * KSM memcmp() fails at the first 64 bytes.
> + *
> + * Before applying VMX instructions which will lead to 32x128bits VMX
> + * regs load/restore penalty, let us compares the first 32 bytes
> + * so that we can catch the ~80% fail cases.
> + */
> +
> + li r0,4
> + mtctr r0
> +.Lksm_32B_loop:
> + LD rA,0,r3
> + LD rB,0,r4
> + cmpld cr0,rA,rB
> + addi r3,r3,8
> + addi r4,r4,8
> + bne cr0,.LcmpAB_lightweight
> + addi r5,r5,-8
> + bdnz .Lksm_32B_loop
> +#endif
> +
> ENTER_VMX_OPS
> beq cr1,.Llong_novmx_cmp
>
> --
> 1.8.3.1
More information about the Linuxppc-dev
mailing list