Efficient memcpy()/memmove() for G2/G3 cores...
Paul Mackerras
paulus at samba.org
Thu Sep 4 12:04:58 EST 2008
prodyut hazarika writes:
> glibc memxxx for powerpc are horribly inefficient. For optimal performance,
> we should should dcbt instruction to establish the source address in cache, and
> dcbz to establish the destination address in cache. We should do
> dcbt and dcbz such that the touches happen a line ahead of the actual copy.
>
> The problem which is see is that dcbt and dcbz instructions don't work on
> non-cacheable memory (obviously!). But memxxx function are used for both
> cached and non-cached memory. Thus this optimized memcpy should be smart enough
> to figure out that both source and destination address fall in
> cacheable space, and only then
> used the optimized dcbt/dcbz instructions.
I would be careful about adding overhead to memcpy. I found that in
the kernel, almost all calls to memcpy are for less than 128 bytes (1
cache line on most 64-bit machines). So, adding a lot of code to
detect cacheability and do prefetching is just going to slow down the
common case, which is short copies. I don't have statistics for glibc
but I wouldn't be surprised if most copies were short there also.
The other thing that I have found is that code that is optimal for
cache-cold copies is usually significantly slower than optimal for
cache-hot copies, because the cache management instructions consume
cycles and don't help in the cache-hot case.
In other words, I don't think we should be tuning the glibc memcpy
based on tests of how fast it copies multiple megabytes.
Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for
larger copies. We don't want to use dcbt/dcbz on the larger 64-bit
processors (POWER4/5/6) because the hardware prefetching and
write-combining mean that dcbt/dcbz don't help and just slow things
down.
Paul.
More information about the Linuxppc-dev
mailing list