Efficient memcpy()/memmove() for G2/G3 cores...

Thu Sep 4 12:04:58 EST 2008

prodyut hazarika writes:

> glibc memxxx for powerpc are horribly inefficient. For optimal performance,
> we should should dcbt instruction to establish the source address in cache, and
> dcbz to establish the destination address in cache. We should do
> dcbt and dcbz such that the touches happen a line ahead of the actual copy.
> 
> The problem which is see is that dcbt and dcbz instructions don't work on
> non-cacheable memory (obviously!). But memxxx function are used for both
> cached and non-cached memory. Thus this optimized memcpy should be smart enough
> to figure out that both source and destination address fall in
> cacheable space, and only then
> used the optimized dcbt/dcbz instructions.

I would be careful about adding overhead to memcpy.  I found that in
the kernel, almost all calls to memcpy are for less than 128 bytes (1
cache line on most 64-bit machines).  So, adding a lot of code to
detect cacheability and do prefetching is just going to slow down the
common case, which is short copies.  I don't have statistics for glibc
but I wouldn't be surprised if most copies were short there also.

The other thing that I have found is that code that is optimal for
cache-cold copies is usually significantly slower than optimal for
cache-hot copies, because the cache management instructions consume
cycles and don't help in the cache-hot case.

In other words, I don't think we should be tuning the glibc memcpy
based on tests of how fast it copies multiple megabytes.

Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for
larger copies.  We don't want to use dcbt/dcbz on the larger 64-bit
processors (POWER4/5/6) because the hardware prefetching and
write-combining mean that dcbt/dcbz don't help and just slow things
down.

Paul.