[PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes
Scott Wood
scottwood at freescale.com
Tue Aug 18 07:08:14 AEST 2015
On Mon, 2015-08-17 at 19:16 +0800, Kevin Hao wrote:
> On Fri, Aug 14, 2015 at 09:44:28PM -0500, Scott Wood wrote:
> > I tried a couple different benchmarks and didn't find a significant
> > difference, relative to the variability of the results running on the
> > same
> > kernel. A patch that claims to "optimize a bit" as its main purpose
> > ought to
> > show some results. :-)
>
> I tried to compare the execution time of these two code sequences with the
> following test module:
>
> #include <linux/module.h>
> #include <linux/kernel.h>
> #include <linux/printk.h>
>
> static void test1(void)
> {
> int i;
> unsigned char lock, c;
> unsigned short cpu, s;
>
> for (i = 0; i < 100000; i++) {
> lock = 0;
> cpu = 1;
>
> asm volatile (
> "1: lbarx %0,0,%2\n\
> lhz %1,0(%3)\n\
> cmpdi %0,0\n\
> cmpdi cr1,%1,1\n\
This should be either "cmpdi cr1,%0,1" or crclr, not that it made much
difference. The test seemed to be rather sensitive to additional
instructions inserted at the beginning of the asm statement (especially
isync), so the initial instructions before the loop are probably pairing with
something outside the asm.
That said, it looks like this patch at least doesn't make things worse, and
does convert cmpdi to a more readable crclr, so I guess I'll apply it even
though it doesn't show any measurable benefit when testing entire TLB misses
(much less actual applications).
I suspect the point where I misunderstood the core manual was where it listed
lbarx as having a repeat-rate of 3 cycles. I probably assumed that that was
because of the presync, and thus a subsequent unrelated load could execute
partially in parallel, but it looks like the repeat rate is specifically
talking about how long it is until the execution unit can accept any other
instruction.
-Scott
More information about the Linuxppc-dev
mailing list