Efficient memcpy()/memmove() for G2/G3 cores...
Josh Boyer
jwboyer at linux.vnet.ibm.com
Thu Sep 4 22:19:26 EST 2008
On Thu, Sep 04, 2008 at 02:05:16PM +0200, David Jander wrote:
>> I would be careful about adding overhead to memcpy. I found that in
>> the kernel, almost all calls to memcpy are for less than 128 bytes (1
>> cache line on most 64-bit machines). So, adding a lot of code to
>> detect cacheability and do prefetching is just going to slow down the
>> common case, which is short copies. I don't have statistics for glibc
>> but I wouldn't be surprised if most copies were short there also.
>
>Then please explain the following. This is a memcpy() speed test for different
>sized blocks on a MPC5121e (DIU is turned on). The first case is glibc code
>without optimizations, and the second case is 16-register strides with
>dcbt/dcbz instructions, written in assembly language (see attachment)
>
>$ ./memcpyspeed
>Fully aligned:
>100000 chunks of 5 bytes : 3.48 Mbyte/s ( throughput: 6.96 Mbytes/s)
>50000 chunks of 16 bytes : 14.3 Mbyte/s ( throughput: 28.6 Mbytes/s)
>10000 chunks of 100 bytes : 14.4 Mbyte/s ( throughput: 28.8 Mbytes/s)
>5000 chunks of 256 bytes : 14.4 Mbyte/s ( throughput: 28.7 Mbytes/s)
>1000 chunks of 1000 bytes : 14.4 Mbyte/s ( throughput: 28.7 Mbytes/s)
>50 chunks of 16384 bytes : 14.2 Mbyte/s ( throughput: 28.4 Mbytes/s)
>1 chunks of 1048576 bytes : 14.4 Mbyte/s ( throughput: 28.8 Mbytes/s)
>
>$ LD_PRELOAD=./libmemcpye300dj.so ./memcpyspeed
>Fully aligned:
>100000 chunks of 5 bytes : 7.44 Mbyte/s ( throughput: 14.9 Mbytes/s)
>50000 chunks of 16 bytes : 13.1 Mbyte/s ( throughput: 26.2 Mbytes/s)
>10000 chunks of 100 bytes : 29.4 Mbyte/s ( throughput: 58.8 Mbytes/s)
>5000 chunks of 256 bytes : 90.2 Mbyte/s ( throughput: 180 Mbytes/s)
>1000 chunks of 1000 bytes : 77 Mbyte/s ( throughput: 154 Mbytes/s)
>50 chunks of 16384 bytes : 96.8 Mbyte/s ( throughput: 194 Mbytes/s)
>1 chunks of 1048576 bytes : 97.6 Mbyte/s ( throughput: 195 Mbytes/s)
>
>(I have edited the output of this tool to fit into an e-mail without wrapping
>lines for readability).
>Please tell me how on earth there can be such a big difference???
>Note that on a MPC5200B this is TOTALLY different, and both processors have an
>e300 core (different versions of it though).
How can there be such a big difference in throughput? Well, your algorithm
seems better optimized than the glibc one for your testcase :).
>> The other thing that I have found is that code that is optimal for
>> cache-cold copies is usually significantly slower than optimal for
>> cache-hot copies, because the cache management instructions consume
>> cycles and don't help in the cache-hot case.
>>
>> In other words, I don't think we should be tuning the glibc memcpy
>> based on tests of how fast it copies multiple megabytes.
>
>I don't just copy multiple megabytes! See above example. Also I do constant
>performance testing of different applications using LD_PRELOAD, to se the
>impact. Recentrly I even tried prboom (a free doom port), to remember the
>good old days of PC benchmarking ;-)
>I have yet to come across a test that has lower performance with this
>optimization (on an MPC5121e that is).
>
>> Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for
>> larger copies. We don't want to use dcbt/dcbz on the larger 64-bit
>
>At least for MPC5121e you really, really need it!!
>
>> processors (POWER4/5/6) because the hardware prefetching and
>> write-combining mean that dcbt/dcbz don't help and just slow things
>> down.
>
>That's explainable.
>What's not explainable, are the results I am getting on the MPC5121e.
>Please, could someone tell me what I am doing wrong? (I must be doing
>something wrong, I'm almost sure).
I don't think you're doing anything wrong exactly. But it seems that
your testcase sits there and just copies data with memcpy in varying
sizes and amounts. That's not exactly a real-world usecase is it?
I think what Paul was saying is that during the course of runtime for a
normal program (the kernel or userspace), most memcpy operations will be of
a small order of magnitude. They will also be scattered among code that does
_other_ stuff than just memcpy. So he's concerned about the overhead of an
implementation that sets up the cache to do a single 32 byte memcpy.
Of course, I could be totally wrong. I haven't had my coffee yet this
morning after all.
josh
More information about the Linuxppc-dev
mailing list