[PATCH 1/2] powerpc: string: implement optimized memset variants

Thu Apr 13 01:05:03 AEST 2017

Excerpts from PrasannaKumar Muralidharan's message of April 5, 2017 11:21:
> On 30 March 2017 at 12:46, Naveen N. Rao
> <naveen.n.rao at linux.vnet.ibm.com> wrote:
>> Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here
>> are the results:
>> generic:        0.245315533 seconds time elapsed        ( +-  1.83% )
>> optimized:      0.169282701 seconds time elapsed        ( +-  1.96% )
> 
> Wondering what makes gcc not to produce efficient assembly code. Can
> you please post the disassembly of C implementation of memset64? Just
> for info purpose.

It's largely the same as what Christophe posted for powerpc32.

Others will have better insights, but afaics, gcc only seems to be 
unrolling the loop with -funroll-loops (which we don't use).

As an aside, it looks like gcc recently picked up an optimization in v7 
that can also help (from https://gcc.gnu.org/gcc-7/changes.html):
"A new store merging pass has been added. It merges constant stores to 
adjacent memory locations into fewer, wider, stores. It is enabled by 
the -fstore-merging option and at the -O2 optimization level or higher 
(and -Os)."

- Naveen