[PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes

Fri May 18 20:35:48 AEST 2018

On 05/17/2018 03:55 PM, Segher Boessenkool wrote:
> On Thu, May 17, 2018 at 12:49:58PM +0200, Christophe Leroy wrote:
>> In my 8xx configuration, I get 208 calls to memcmp()
>> Within those 208 calls, about half of them have constant sizes,
>> 46 have a size of 8, 17 have a size of 16, only a few have a
>> size over 16. Other fixed sizes are mostly 4, 6 and 10.
>>
>> This patch inlines calls to memcmp() when size
>> is constant and lower than or equal to 16
>>
>> In my 8xx configuration, this reduces the number of calls
>> to memcmp() from 208 to 123
>>
>> The following table shows the number of TB timeticks to perform
>> a constant size memcmp() before and after the patch depending on
>> the size
>>
>> 	Before	After	Improvement
>> 01:	 7577	 5682	25%
>> 02:	41668	 5682	86%
>> 03:	51137	13258	74%
>> 04:	45455	 5682	87%
>> 05:	58713	13258	77%
>> 06:	58712	13258	77%
>> 07:	68183	20834	70%
>> 08:	56819	15153	73%
>> 09:	70077	28411	60%
>> 10:	70077	28411	60%
>> 11:	79546	35986	55%
>> 12:	68182	28411	58%
>> 13:	81440	35986	55%
>> 14:	81440	39774	51%
>> 15:	94697	43562	54%
>> 16:	79546	37881	52%
> 
> Could you show results with a more recent GCC?  What version was this?

It was with the latest GCC version I have available in my environment, 
that is GCC 5.4. Is that too old ?

It seems that version inlines memcmp() when length is 1. All other 
lengths call memcmp()

> 
> What is this really measuring?  I doubt it takes 7577 (or 5682) timebase
> ticks to do a 1-byte memcmp, which is just 3 instructions after all.

Well I looked again in my tests and it seems some results are wrong, can 
remember why, I probably did something wrong when I did the tests.

Anyway, the principle is to call a function tstcmpX() 100000 times from 
a loop, and getting the mftb before and after the loop.
Then we remove from the elapsed time the time spent when calling 
tstcmp0() which is only a blr.
Therefore, we get really the time spent in the comparison only.

Here is the loop:

c06243b0:	7f ac 42 e6 	mftb    r29
c06243b4:	3f 60 00 01 	lis     r27,1
c06243b8:	63 7b 86 a0 	ori     r27,r27,34464
c06243bc:	38 a0 00 02 	li      r5,2
c06243c0:	7f c4 f3 78 	mr      r4,r30
c06243c4:	7f 83 e3 78 	mr      r3,r28
c06243c8:	4b 9e 8c 09 	bl      c000cfd0 <tstcmp2>
c06243cc:	2c 1b 00 01 	cmpwi   r27,1
c06243d0:	3b 7b ff ff 	addi    r27,r27,-1
c06243d4:	40 82 ff e8 	bne     c06243bc <setup_arch+0x294>
c06243d8:	7c ac 42 e6 	mftb    r5
c06243dc:	7c bd 28 50 	subf    r5,r29,r5
c06243e0:	7c bf 28 50 	subf    r5,r31,r5
c06243e4:	38 80 00 02 	li      r4,2
c06243e8:	7f 43 d3 78 	mr      r3,r26
c06243ec:	4b a2 e4 45 	bl      c0052830 <printk>

Before the patch:
c000cfc4 <tstcmp0>:
c000cfc4:	4e 80 00 20 	blr

c000cfc8 <tstcmp1>:
c000cfc8:	38 a0 00 01 	li      r5,1
c000cfcc:	48 00 72 08 	b       c00141d4 <__memcmp>

c000cfd0 <tstcmp2>:
c000cfd0:	38 a0 00 02 	li      r5,2
c000cfd4:	48 00 72 00 	b       c00141d4 <__memcmp>

c000cfd8 <tstcmp3>:
c000cfd8:	38 a0 00 03 	li      r5,3
c000cfdc:	48 00 71 f8 	b       c00141d4 <__memcmp>

After the patch:
c000cfc4 <tstcmp0>:
c000cfc4:	4e 80 00 20 	blr

c000cfd8 <tstcmp1>:
c000cfd8:	88 64 00 00 	lbz     r3,0(r4)
c000cfdc:	89 25 00 00 	lbz     r9,0(r5)
c000cfe0:	7c 69 18 50 	subf    r3,r9,r3
c000cfe4:	4e 80 00 20 	blr

c000cfe8 <tstcmp2>:
c000cfe8:	a0 64 00 00 	lhz     r3,0(r4)
c000cfec:	a1 25 00 00 	lhz     r9,0(r5)
c000cff0:	7c 69 18 50 	subf    r3,r9,r3
c000cff4:	4e 80 00 20 	blr

c000cff8 <tstcmp3>:
c000cff8:	a1 24 00 00 	lhz     r9,0(r4)
c000cffc:	a0 65 00 00 	lhz     r3,0(r5)
c000d000:	7c 63 48 51 	subf.   r3,r3,r9
c000d004:	4c 82 00 20 	bnelr
c000d008:	88 64 00 02 	lbz     r3,2(r4)
c000d00c:	89 25 00 02 	lbz     r9,2(r5)
c000d010:	7c 69 18 50 	subf    r3,r9,r3
c000d014:	4e 80 00 20 	blr

c000d018 <tstcmp4>:
c000d018:	80 64 00 00 	lwz     r3,0(r4)
c000d01c:	81 25 00 00 	lwz     r9,0(r5)
c000d020:	7c 69 18 50 	subf    r3,r9,r3
c000d024:	4e 80 00 20 	blr

c000d028 <tstcmp5>:
c000d028:	81 24 00 00 	lwz     r9,0(r4)
c000d02c:	80 65 00 00 	lwz     r3,0(r5)
c000d030:	7c 63 48 51 	subf.   r3,r3,r9
c000d034:	4c 82 00 20 	bnelr
c000d038:	88 64 00 04 	lbz     r3,4(r4)
c000d03c:	89 25 00 04 	lbz     r9,4(r5)
c000d040:	7c 69 18 50 	subf    r3,r9,r3
c000d044:	4e 80 00 20 	blr

c000d048 <tstcmp6>:
c000d048:	81 24 00 00 	lwz     r9,0(r4)
c000d04c:	80 65 00 00 	lwz     r3,0(r5)
c000d050:	7c 63 48 51 	subf.   r3,r3,r9
c000d054:	4c 82 00 20 	bnelr
c000d058:	a0 64 00 04 	lhz     r3,4(r4)
c000d05c:	a1 25 00 04 	lhz     r9,4(r5)
c000d060:	7c 69 18 50 	subf    r3,r9,r3
c000d064:	4e 80 00 20 	blr

c000d068 <tstcmp7>:
c000d068:	81 24 00 00 	lwz     r9,0(r4)
c000d06c:	80 65 00 00 	lwz     r3,0(r5)
c000d070:	7d 23 48 51 	subf.   r9,r3,r9
c000d074:	40 82 00 20 	bne     c000d094 <tstcmp7+0x2c>
c000d078:	a0 64 00 04 	lhz     r3,4(r4)
c000d07c:	a1 25 00 04 	lhz     r9,4(r5)
c000d080:	7d 29 18 51 	subf.   r9,r9,r3
c000d084:	40 82 00 10 	bne     c000d094 <tstcmp7+0x2c>
c000d088:	88 64 00 06 	lbz     r3,6(r4)
c000d08c:	89 25 00 06 	lbz     r9,6(r5)
c000d090:	7d 29 18 50 	subf    r9,r9,r3
c000d094:	7d 23 4b 78 	mr      r3,r9
c000d098:	4e 80 00 20 	blr

c000d09c <tstcmp8>:
c000d09c:	81 25 00 04 	lwz     r9,4(r5)
c000d0a0:	80 64 00 04 	lwz     r3,4(r4)
c000d0a4:	81 04 00 00 	lwz     r8,0(r4)
c000d0a8:	81 45 00 00 	lwz     r10,0(r5)
c000d0ac:	7c 69 18 10 	subfc   r3,r9,r3
c000d0b0:	7d 2a 41 10 	subfe   r9,r10,r8
c000d0b4:	7d 2a fe 70 	srawi   r10,r9,31
c000d0b8:	7d 48 4b 79 	or.     r8,r10,r9
c000d0bc:	4d a2 00 20 	bclr+   12,eq
c000d0c0:	7d 23 4b 78 	mr      r3,r9
c000d0c4:	4e 80 00 20 	blr

c000d0c8 <tstcmp9>:
c000d0c8:	81 25 00 04 	lwz     r9,4(r5)
c000d0cc:	80 64 00 04 	lwz     r3,4(r4)
c000d0d0:	81 04 00 00 	lwz     r8,0(r4)
c000d0d4:	81 45 00 00 	lwz     r10,0(r5)
c000d0d8:	7c 69 18 10 	subfc   r3,r9,r3
c000d0dc:	7d 2a 41 10 	subfe   r9,r10,r8
c000d0e0:	7d 2a fe 70 	srawi   r10,r9,31
c000d0e4:	7d 48 4b 79 	or.     r8,r10,r9
c000d0e8:	41 82 00 08 	beq     c000d0f0 <tstcmp9+0x28>
c000d0ec:	7d 23 4b 78 	mr      r3,r9
c000d0f0:	2f 83 00 00 	cmpwi   cr7,r3,0
c000d0f4:	4c 9e 00 20 	bnelr   cr7
c000d0f8:	88 64 00 08 	lbz     r3,8(r4)
c000d0fc:	89 25 00 08 	lbz     r9,8(r5)
c000d100:	7c 69 18 50 	subf    r3,r9,r3
c000d104:	4e 80 00 20 	blr

This shows that on PPC32, the 8 bytes comparison is not optimal, I will 
improve it.

We also see in tstcmp7() that GCC is a bit stupid, it should use r3 as 
result of the sub as he does with all previous ones, then do bnelr 
instead of bne+mr+blr

Below are the results of the measurement redone today:

     Before After  Improvment
01  24621   5681  77%
02  24621   5681  77%
03  34091  13257  61%
04  28409   5681  80%
05  41667  13257  68%
06  41668  13257  68%
07  51138  22727  56%
08  39772  15151  62%
09  53031  28409  46%
10  53031  28409  46%
11  62501  35986  42%
12  51137  28410  44%
13  64395  35985  44%
14  68182  39774  42%
15  73865  43560  41%
16  62500  37879  39%

We also see here that 08 is not optimal, it should have given same 
results as 05 and 06. I will keep it as is for PPC64 but will rewrite it 
as two 04 comparisons for PPC32

Christophe

> 
> 
> Segher
>