[PATCH 1/2] powerpc: string: implement optimized memset variants
Naveen N. Rao
naveen.n.rao at linux.vnet.ibm.com
Thu Apr 13 01:05:03 AEST 2017
Excerpts from PrasannaKumar Muralidharan's message of April 5, 2017 11:21:
> On 30 March 2017 at 12:46, Naveen N. Rao
> <naveen.n.rao at linux.vnet.ibm.com> wrote:
>> Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here
>> are the results:
>> generic: 0.245315533 seconds time elapsed ( +- 1.83% )
>> optimized: 0.169282701 seconds time elapsed ( +- 1.96% )
>
> Wondering what makes gcc not to produce efficient assembly code. Can
> you please post the disassembly of C implementation of memset64? Just
> for info purpose.
It's largely the same as what Christophe posted for powerpc32.
Others will have better insights, but afaics, gcc only seems to be
unrolling the loop with -funroll-loops (which we don't use).
As an aside, it looks like gcc recently picked up an optimization in v7
that can also help (from https://gcc.gnu.org/gcc-7/changes.html):
"A new store merging pass has been added. It merges constant stores to
adjacent memory locations into fewer, wider, stores. It is enabled by
the -fstore-merging option and at the -O2 optimization level or higher
(and -Os)."
- Naveen
More information about the Linuxppc-dev
mailing list