[PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision

Tue Sep 26 09:59:46 AEST 2017

On Sun, 2017-09-24 at 05:18 +0800, Simon Guo wrote:
> Hi Cyril,
> On Sat, Sep 23, 2017 at 12:06:48AM +1000, Cyril Bur wrote:
> > On Thu, 2017-09-21 at 07:34 +0800, wei.guo.simon at gmail.com wrote:
> > > From: Simon Guo <wei.guo.simon at gmail.com>
> > > 
> > > This patch add VMX primitives to do memcmp() in case the compare size
> > > exceeds 4K bytes.
> > > 
> > 
> > Hi Simon,
> > 
> > Sorry I didn't see this sooner, I've actually been working on a kernel
> > version of glibc commit dec4a7105e (powerpc: Improve memcmp performance
> > for POWER8) unfortunately I've been distracted and it still isn't done.
> 
> Thanks for sync with me. Let's consolidate our effort together :)
> 
> I have a quick check on glibc commit dec4a7105e. 
> Looks the aligned case comparison with VSX is launched without rN size
> limitation, which means it will have a VSX reg load penalty even when the 
> length is 9 bytes.
> 

This was written for userspace which doesn't have to explicitly enable
VMX in order to use it - we need to be smarter in the kernel.

> It did some optimization when src/dest addrs don't have the same offset 
> on 8 bytes alignment boundary. I need to read more closely.
> 
> > I wonder if we can consolidate our efforts here. One thing I did come
> > across in my testing is that for memcmp() that will fail early (I
> > haven't narrowed down the the optimal number yet) the cost of enabling
> > VMX actually turns out to be a performance regression, as such I've
> > added a small check of the first 64 bytes to the start before enabling
> > VMX to ensure the penalty is worth taking.
> 
> Will there still be a penalty if the 65th byte differs?  
> 

I haven't benchmarked it exactly, my rationale for 64 bytes was that it
is the stride of the vectorised copy loop so, if we know we'll fail
before even completing one iteration of the vectorized loop there isn't
any point using the vector regs.

> > 
> > Also, you should consider doing 4K and greater, KSM (Kernel Samepage
> > Merging) uses PAGE_SIZE which can be as small as 4K.
> 
> Currently the VMX will only be applied when size exceeds 4K. Are you
> suggesting a bigger threshold than 4K?
> 

Equal to or greater than 4K, KSM will benefit.

> We can sync more offline for v3.
> 
> Thanks,
> - Simon