[PATCH v2] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes

Christophe Leroy christophe.leroy at csgroup.eu
Sat Nov 11 21:33:50 AEDT 2023



Le 02/11/2023 à 12:39, Michael Ellerman a écrit :
> Matthew Wilcox <willy at infradead.org> writes:
>> On Tue, Oct 24, 2023 at 08:06:04PM +0530, Aneesh Kumar K.V wrote:
>>>   		ptep++;
>>> -		pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT));
>>>   		addr += PAGE_SIZE;
>>> +		/*
>>> +		 * increment the pfn.
>>> +		 */
>>> +		pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
>>
>> when i looked at this, it generated shit code.  did you check?
> 
> I didn't look ...
> 
> <goes and looks>
> 
> It's not super clear cut. There's some difference because pfn_pte()
> contains two extra VM_BUG_ONs.
> 
> But with DEBUG_VM *off* the version using pfn_pte() generates *better*
> code, or at least less code, ~160 instructions vs ~200.
> 
> For some reason the version using PTE_RPN_SHIFT seems to be byte
> swapping the pte an extra two times, each of which generates ~8
> instructions. But I can't see why.
> 
> I tried a few other things and couldn't come up with anything that
> generated better code. But I'll keep poking at it tomorrow.

On PPC32 the version using PTE_RPN_SHIFT is better, here is what the 
main loop of set_ptes() looks like:

  22c:	55 29 f0 be 	srwi    r9,r9,2
  230:	7d 29 03 a6 	mtctr   r9
  234:	39 3f 10 00 	addi    r9,r31,4096
  238:	39 1f 20 00 	addi    r8,r31,8192
  23c:	39 5f 30 00 	addi    r10,r31,12288
  240:	3b ff 40 00 	addi    r31,r31,16384
  244:	91 3e 00 04 	stw     r9,4(r30)
  248:	91 1e 00 08 	stw     r8,8(r30)
  24c:	91 5e 00 0c 	stw     r10,12(r30)
  250:	97 fe 00 10 	stwu    r31,16(r30)
  254:	42 00 ff e0 	bdnz    234 <set_ptes+0x78>

With the version using pfn_pte(), the main loop is:

  218:	54 e9 f8 7e 	srwi    r9,r7,1
  21c:	7d 29 03 a6 	mtctr   r9
  220:	57 e9 00 26 	clrrwi  r9,r31,12
  224:	39 29 10 00 	addi    r9,r9,4096
  228:	57 ff 05 3e 	clrlwi  r31,r31,20
  22c:	7d 29 fb 78 	or      r9,r9,r31
  230:	55 3f 00 26 	clrrwi  r31,r9,12
  234:	3b ff 10 00 	addi    r31,r31,4096
  238:	55 28 05 3e 	clrlwi  r8,r9,20
  23c:	7f ff 43 78 	or      r31,r31,r8
  240:	91 3d 00 04 	stw     r9,4(r29)
  244:	93 fd 00 08 	stw     r31,8(r29)
  248:	3b bd 00 08 	addi    r29,r29,8
  24c:	42 00 ff d4 	bdnz    220 <set_ptes+0x64>

Not only the loop is bigger, but it is also only unrolled by 2 while 
first one is unrolled by 4 (r7 and r9 contain the same value).

Therefore allthough the PTE_RPN_SHIFT version is 87 instructions while 
the other one is only 81 instructions, the former looks better.

Christophe


More information about the Linuxppc-dev mailing list