Efficient memcpy()/memmove() for G2/G3 cores...

Fri Sep 5 04:14:56 EST 2008

> I would be careful about adding overhead to memcpy.  I found that in
> the kernel, almost all calls to memcpy are for less than 128 bytes (1
> cache line on most 64-bit machines).  So, adding a lot of code to
> detect cacheability and do prefetching is just going to slow down the
> common case, which is short copies.  I don't have statistics for glibc
> but I wouldn't be surprised if most copies were short there also.
>

You are right. For small copy, it is not advisable.
The way I did was put a small check in the beginning of memcpy. If the copy
is less than 5 cache lines, I don't do dcbt/dcbz. Thus we see a big jump
for copy more than 5 cache lines. The overhead is only 2 assembly instructions
(compare number of bytes followed by jump).

One question - How can we can quickly determine whether both source and memory
address range fall in cacheable range? The user can mmap a region of memory as
non-cacheable, but then call memcpy with that address.

The optimized version must quickly determine that dcbt/dcbz must not
be used in this case.
I don't know what would be a good way to achieve the same?

Regards,
Prodyut Hazarika