Efficient memcpy()/memmove() for G2/G3 cores...
Joakim Tjernlund
joakim.tjernlund at transmode.se
Mon Sep 1 19:36:15 EST 2008
On Mon, 2008-09-01 at 09:23 +0200, David Jander wrote:
> On Friday 29 August 2008 14:20:33 Joakim Tjernlund wrote:
> >[...]
> > > The problem is: I have very little experience with powerpc assembly and
> > > only very limited time to dedicate to this and I am looking for others
> > > who have
> >
> > I improved the PowerPC memcpy and friends in uClibc a while ago. It does
> > basically the same a the kernel memcpy but without any cache
> > instructions. It is written in C, but in such a way that
> > optimal assembly is generated.
>
> Hmm, isn't that going to break on a different version of gcc?
Not break, but gcc might generate non optimal code. However, the code
is laid out to make it easy for gcc to do the right thing.
> I just copied the latest version of trunk/uClibc/libc/string/powerpc/memcpy.c
> from subversion as uclibc-memcpy.c, removed the last line and did this:
>
> $ gcc -shared -O2 -Wall -o libucmemcpy.so uclibc-memcpy.c
>
> (should I use other compiler options?)
These are fine.
>
> Then I started my test program with LD_PRELOAD=...
>
> My test program only copies big chunks of aligned memory, so it will only test
> for maximum throughput (such as copying video frames). I will make a better
> one, to measure throughput on different sized blocks of aligned and unaligned
> memory, but first I want to find out why I can't seem to get even close to
> the expected RAM bandwidth (bursts occur at 1.6 Gbyte/s, sustained transfers
> might be able to reach 400 Mbyte/s in theory, taking into account the video
> controller eating almost half of it, I'd like to get somewhere close to 200).
>
> The result is quite a bit better than that of glibc-2.7 (13.2 Mbyte/s --> 22
> Mbyte/s), but still far from the 71.5 Mbyte/s achieved when using bigger
> strides of 16 registers load/store at a time.
> Note, that this is copy performance, one-way througput should be double these
> figures.
Yeah, the code is trying to do a reasonable job without knowing what
micro arch it is running on. These could probably go to glibc
as new general purpose memxxx() routines. You will probably see
a big increase once dcbz is added to the copy/memset functions.
Fire away :)
Jocke
More information about the Linuxppc-dev
mailing list