[PATCH] net: Remove branch in csum_shift()

Segher Boessenkool segher at kernel.crashing.org
Mon Feb 14 20:29:18 AEDT 2022


On Sun, Feb 13, 2022 at 05:47:52PM +0000, David Laight wrote:
> From: Segher Boessenkool 
> > Sent: 13 February 2022 09:16
> > In an ideal world the compiler could choose the optimal code sequences
> > everywhere.  But that won't ever happen, the search space is way too
> > big.  So compilers just use heuristics, not exhaustive search like
> > superopt does.  There is a middle way of course, something with directed
> > searches, and maybe in a few decades systems will be fast enough.  Until
> > then we will very often see code that is 10% slower and 30% bigger than
> > necessary.  A single insn more than needed isn't so bad :-)
> 
> But it can be a lot more than that.

Obviously, but that isn't the case here (for powerpc anyway).  My point
here is that you won't ever get ideal generated code from your high-
level code (which is what C is), certainly not for all architectures.
But it *is* possible to get something reasonably good.

> > Making things branch-free is very much worth it here though!
> 
> I tried to find out where 'here' is.

I meant "with this patch".

Unpredictable branches are very expensive.  They already were something
to worry about on single-issue in-order processors, but they are much
more expensive now.

> I can't get godbolt to generate anything like that object code
> for a call to csum_shift().
> 
> I can't actually get it to issue a rotate (x86 of ppc).

All powerpc rotate insns start with "rl", and no other insns do.  There
also are extended mnemonics to ease programming, like "rotlw", which is
just a form of rlwinm (rotlw d,s,n is rlwnm d,s,n,0,31).

Depending on what tool you use to display binary code it will show you
extended mnemonics for some insns or just the basic insns.

> I think it is only a single instruction because the compiler
> has saved 'offset & 1' much earlier instead of doing testing
> 'offset & 1' just prior to the conditional.

rlwinm -- "nm" means "and mask".  rlwnm d,s,n,mb,me rotates register s
left by the contents of register n bits, and logical ands it with the
mask from bit mb until bit me.

> It certainly has a nasty habit of doing that pessimisation.

?  Not sure what you mean here.

> I also suspect that the addc/addze pair could be removed
> by passing the old checksum into csum_partial.

Maybe?  Does it not have to return a reduced result here?


Segher


More information about the Linuxppc-dev mailing list