[PATCH] net: Remove branch in csum_shift()

Tue Mar 1 21:20:26 AEDT 2022

Le 13/02/2022 à 18:47, David Laight a écrit :
> From: Segher Boessenkool
>> Sent: 13 February 2022 09:16
> ....
>>
>>> What happens on x86-64?
>>>
>>> Trying to do the same in the x86 ipcsum code tended to make the code worse.
>>> (Although that test is for an odd length fragment and can just be removed.)
>>
>> In an ideal world the compiler could choose the optimal code sequences
>> everywhere.  But that won't ever happen, the search space is way too
>> big.  So compilers just use heuristics, not exhaustive search like
>> superopt does.  There is a middle way of course, something with directed
>> searches, and maybe in a few decades systems will be fast enough.  Until
>> then we will very often see code that is 10% slower and 30% bigger than
>> necessary.  A single insn more than needed isn't so bad :-)
> 
> But it can be a lot more than that.
> 
>> Making things branch-free is very much worth it here though!
> 
> I tried to find out where 'here' is.
> 
> I can't get godbolt to generate anything like that object code
> for a call to csum_shift().
> 
> I can't actually get it to issue a rotate (x86 of ppc).
> 
> I think it is only a single instruction because the compiler
> has saved 'offset & 1' much earlier instead of doing testing
> 'offset & 1' just prior to the conditional.
> It certainly has a nasty habit of doing that pessimisation.
> 
> So while it helps a specific call site it may be much
> worse in general.
> 

The main user of csum_shift() is csum_and_copy_to_iter().

You clearly see the difference in one of the instances below extracted 
from output of objdump -S lib/iov_iter.o:

Without the patch:

	sum = csum_shift(csstate->csum, csstate->off);
     21a8:	92 e1 00 4c 	stw     r23,76(r1)
     21ac:	7c 77 1b 78 	mr      r23,r3
     21b0:	93 01 00 50 	stw     r24,80(r1)
     21b4:	7c b8 2b 78 	mr      r24,r5
     21b8:	93 61 00 5c 	stw     r27,92(r1)
     21bc:	7c db 33 78 	mr      r27,r6
     21c0:	93 81 00 60 	stw     r28,96(r1)
     21c4:	81 05 00 04 	lwz     r8,4(r5)
     21c8:	83 85 00 00 	lwz     r28,0(r5)
}

static __always_inline __wsum csum_shift(__wsum sum, int offset)
{
	/* rotate sum to align it with a 16b boundary */
	if (offset & 1)
     21cc:	71 09 00 01 	andi.   r9,r8,1		<== test oddity
     21d0:	41 a2 00 08 	beq     21d8		<== branch
  * @word: value to rotate
  * @shift: bits to roll
  */
static inline __u32 ror32(__u32 word, unsigned int shift)
{
	return (word >> (shift & 31)) | (word << ((-shift) & 31));
     21d4:	57 9c c0 3e 	rotlwi  r28,r28,24	<== rotate
     21d8:	2b 8a 00 03 	cmplwi  cr7,r10,3
     21dc:	41 9e 01 ec 	beq     cr7,23c8 <csum_and_copy_to_iter+0x234>

With the patch:

	sum = csum_shift(csstate->csum, csstate->off);
     21a8:	92 c1 00 48 	stw     r22,72(r1)
	if (unlikely(iov_iter_is_pipe(i)))
     21ac:	28 08 00 03 	cmplwi  r8,3
     21b0:	92 e1 00 4c 	stw     r23,76(r1)
     21b4:	7c 76 1b 78 	mr      r22,r3
     21b8:	93 41 00 58 	stw     r26,88(r1)
     21bc:	7c b7 2b 78 	mr      r23,r5
     21c0:	93 81 00 60 	stw     r28,96(r1)
     21c4:	7c da 33 78 	mr      r26,r6
	sum = csum_shift(csstate->csum, csstate->off);
     21c8:	80 e5 00 04 	lwz     r7,4(r5)
  * @word: value to rotate
  * @shift: bits to roll
  */
static inline __u32 rol32(__u32 word, unsigned int shift)
{
	return (word << (shift & 31)) | (word >> ((-shift) & 31));
     21cc:	81 25 00 00 	lwz     r9,0(r5)
}

static __always_inline __wsum csum_shift(__wsum sum, int offset)
{
	/* rotate sum to align it with a 16b boundary */
	return (__force __wsum)rol32((__force u32)sum, (offset & 1) << 3);
     21d0:	54 ea 1f 38 	rlwinm  r10,r7,3,28,28
     21d4:	5d 3c 50 3e 	rotlw   r28,r9,r10	<== rotate
	if (unlikely(iov_iter_is_pipe(i)))
     21d8:	41 82 01 e0 	beq     23b8 <csum_and_copy_to_iter+0x224>

Christophe