Optimised memset64/memset32 for powerpc
Matthew Wilcox
willy at infradead.org
Wed Mar 22 00:29:10 AEDT 2017
On Tue, Mar 21, 2017 at 01:23:36PM +0100, Christophe LEROY wrote:
> > It doesn't look free for you as you only store one register each time
> > around the loop in the 32-bit memset implementation:
> >
> > 1: stwu r4,4(r6)
> > bdnz 1b
> >
> > (wouldn't you get better performance on 32-bit powerpc by unrolling that
> > loop like you do on 64-bit?)
>
> In arch/powerpc/lib/copy_32.S, the implementation of memset() is optimised
> when the value to be set is zero. It makes use of the 'dcbz' instruction
> which zeroizes a complete cache line.
>
> Not much effort has been put on optimising non-zero memset() because there
> are almost none.
Yes, bzero() is much more common than setting an 8-bit pattern.
And setting an 8-bit pattern is almost certainly more common than setting
a 32 or 64 bit pattern.
> Unrolling the loop could help a bit on old powerpc32s that don't have branch
> units, but on those processors the main driver is the time spent to do the
> effective write to memory, and the operations necessary to unroll the loop
> are not worth the cycle added by the branch.
>
> On more modern powerpc32s, the branch unit implies that branches have a zero
> cost.
Fair enough. I'm just surprised it was worth unrolling the loop on
powerpc64 and not on powerpc32 -- see mem_64.S.
> A simple static inline C function would probably do the job, based on what I
> get below:
>
> void memset32(int *p, int v, unsigned int c)
> {
> int i;
>
> for (i = 0; i < c; i++)
> *p++ = v;
> }
>
> void memset64(long long *p, long long v, unsigned int c)
> {
> int i;
>
> for (i = 0; i < c; i++)
> *p++ = v;
> }
Well, those are the generic versions in the first patch:
http://git.infradead.org/users/willy/linux-dax.git/commitdiff/538b9776ac925199969bd5af4e994da776d461e7
so if those are good enough for you guys, there's no need for you to
do anything.
Thanks for your time!
More information about the Linuxppc-dev
mailing list