[PATCH v2] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
Christophe Leroy
christophe.leroy at csgroup.eu
Sat Nov 11 21:33:50 AEDT 2023
Le 02/11/2023 à 12:39, Michael Ellerman a écrit :
> Matthew Wilcox <willy at infradead.org> writes:
>> On Tue, Oct 24, 2023 at 08:06:04PM +0530, Aneesh Kumar K.V wrote:
>>> ptep++;
>>> - pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT));
>>> addr += PAGE_SIZE;
>>> + /*
>>> + * increment the pfn.
>>> + */
>>> + pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
>>
>> when i looked at this, it generated shit code. did you check?
>
> I didn't look ...
>
> <goes and looks>
>
> It's not super clear cut. There's some difference because pfn_pte()
> contains two extra VM_BUG_ONs.
>
> But with DEBUG_VM *off* the version using pfn_pte() generates *better*
> code, or at least less code, ~160 instructions vs ~200.
>
> For some reason the version using PTE_RPN_SHIFT seems to be byte
> swapping the pte an extra two times, each of which generates ~8
> instructions. But I can't see why.
>
> I tried a few other things and couldn't come up with anything that
> generated better code. But I'll keep poking at it tomorrow.
On PPC32 the version using PTE_RPN_SHIFT is better, here is what the
main loop of set_ptes() looks like:
22c: 55 29 f0 be srwi r9,r9,2
230: 7d 29 03 a6 mtctr r9
234: 39 3f 10 00 addi r9,r31,4096
238: 39 1f 20 00 addi r8,r31,8192
23c: 39 5f 30 00 addi r10,r31,12288
240: 3b ff 40 00 addi r31,r31,16384
244: 91 3e 00 04 stw r9,4(r30)
248: 91 1e 00 08 stw r8,8(r30)
24c: 91 5e 00 0c stw r10,12(r30)
250: 97 fe 00 10 stwu r31,16(r30)
254: 42 00 ff e0 bdnz 234 <set_ptes+0x78>
With the version using pfn_pte(), the main loop is:
218: 54 e9 f8 7e srwi r9,r7,1
21c: 7d 29 03 a6 mtctr r9
220: 57 e9 00 26 clrrwi r9,r31,12
224: 39 29 10 00 addi r9,r9,4096
228: 57 ff 05 3e clrlwi r31,r31,20
22c: 7d 29 fb 78 or r9,r9,r31
230: 55 3f 00 26 clrrwi r31,r9,12
234: 3b ff 10 00 addi r31,r31,4096
238: 55 28 05 3e clrlwi r8,r9,20
23c: 7f ff 43 78 or r31,r31,r8
240: 91 3d 00 04 stw r9,4(r29)
244: 93 fd 00 08 stw r31,8(r29)
248: 3b bd 00 08 addi r29,r29,8
24c: 42 00 ff d4 bdnz 220 <set_ptes+0x64>
Not only the loop is bigger, but it is also only unrolled by 2 while
first one is unrolled by 4 (r7 and r9 contain the same value).
Therefore allthough the PTE_RPN_SHIFT version is 87 instructions while
the other one is only 81 instructions, the former looks better.
Christophe
More information about the Linuxppc-dev
mailing list