[PATCH 2/3] powerpc: POWER7 optimised memcpy using VMX

Fri Jun 17 19:02:55 EST 2011

> On Fri, Jun 17, 2011 at 02:54:00PM +1000, Anton Blanchard wrote:
> > Implement a POWER7 optimised memcpy using VMX. For large aligned
> > copies this new loop is over 10% faster and for large unaligned
> > copies it is over 200% faster.
...

> BTW: do you have any statistics on the size distribution
> of memcpy memcpy_to_from_usr?
> 
> My gut feeling is that the intermediate case is the most
> important, and the short case the less critical (drowned
> in overhead's noise) but that's the kind of things on which
> I've often been wrong.

My thoughts are certainly that the code is too big, and that
the 'cold cache' version and possibly the effects of increasing
the size of the working set (ie displacing other code) may
be significant in real life.

For memcpy() the 'short' case will happen surprisingly often,
I suspect the fixed costs for the short case may dominate some
real workloads.

I'm not sure the speed of misaligned copies matters enough
to take the hit of the alignment test!

Of course, I don't actually remember doing any instrumentation
of this, but I have changed i386/amd64 memcpy (not linux/glibc)
to avoid the 'rep movsb' used for the trailing bytes (copy
the last 'word' first) - the setup cost for 'rep movsb' is
over 40 clocks on netburst P4!
(It is possible to get amd64 to copy data as fast as the
'rep movd', but the setup times are longer. And very recent
Intel cpus contain hardware acceleration for aligned and
misaligned 'rep movsd' - so trying anything clever isn't good.)
I do realise thise doesn't directly apply to ppc :-)

	David