[RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
Michael Ellerman
mpe at ellerman.id.au
Mon Aug 16 17:03:55 AEST 2021
"Aneesh Kumar K.V" <aneesh.kumar at linux.ibm.com> writes:
> On 8/12/21 6:19 PM, Michael Ellerman wrote:
>> "Puvichakravarthy Ramachandran" <puvichakravarthy at in.ibm.com> writes:
>>>> With shared mapping, even though we are unmapping a large range, the kernel
>>>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>>>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>>>> This results in the kernel issuing a high number of TLB flushes even for a large
>>>> range. This can be improved by making sure the kernel switch to pid based flush if the
>>>> kernel is unmapping a 2M range.
>>>>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar at linux.ibm.com>
>>>> ---
>>>> arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > b/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> index aefc100d79a7..21d0f098e43b 100644
>>>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>>>> * invalidating a full PID, so it has a far lower threshold to change > from
>>>> * individual page flushes to full-pid flushes.
>>>> */
>>>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>>>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>>>> static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = POWER9_TLB_SETS_RADIX * 2;
>>>>
>>>> static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>>>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > mm_struct *mm,
>>>> if (fullmm)
>>>> flush_pid = true;
>>>> else if (type == FLUSH_TYPE_GLOBAL)
>>>> - flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>>>> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>>>> else
>>>> flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>>>
>>> Additional details on the test environment. This was tested on a 2 Node/8
>>> socket Power10 system.
>>> The LPAR had 105 cores and the LPAR spanned across all the sockets.
>>>
>>> # perf stat -I 1000 -a -e cycles,instructions -e
>>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1
>>> Rate of work: = 176
>>> # time counts unit events
>>> 1.029206442 4198594519 cycles
>>> 1.029206442 2458254252 instructions # 0.59 insn per cycle
>>> 1.029206442 3004031488 PM_EXEC_STALL
>>> 1.029206442 1798186036 PM_EXEC_STALL_TLBIE
>>> Rate of work: = 181
>>> 2.054288539 4183883450 cycles
>>> 2.054288539 2472178171 instructions # 0.59 insn per cycle
>>> 2.054288539 3014609313 PM_EXEC_STALL
>>> 2.054288539 1797851642 PM_EXEC_STALL_TLBIE
>>> Rate of work: = 180
>>> 3.078306883 4171250717 cycles
>>> 3.078306883 2468341094 instructions # 0.59 insn per cycle
>>> 3.078306883 2993036205 PM_EXEC_STALL
>>> 3.078306883 1798181890 PM_EXEC_STALL_TLBIE
>>> .
>>> .
>>>
>>> # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>>> 34
>>>
>>> # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>>>
>>> # perf stat -I 1000 -a -e cycles,instructions -e
>>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1
>>> Rate of work: = 313
>>> # time counts unit events
>>> 1.030310506 4206071143 cycles
>>> 1.030310506 4314716958 instructions # 1.03 insn per cycle
>>> 1.030310506 2157762167 PM_EXEC_STALL
>>> 1.030310506 110825573 PM_EXEC_STALL_TLBIE
>>> Rate of work: = 322
>>> 2.056034068 4331745630 cycles
>>> 2.056034068 4531658304 instructions # 1.05 insn per cycle
>>> 2.056034068 2288971361 PM_EXEC_STALL
>>> 2.056034068 111267927 PM_EXEC_STALL_TLBIE
>>> Rate of work: = 321
>>> 3.081216434 4327050349 cycles
>>> 3.081216434 4379679508 instructions # 1.01 insn per cycle
>>> 3.081216434 2252602550 PM_EXEC_STALL
>>> 3.081216434 110974887 PM_EXEC_STALL_TLBIE
>>
>>
>> What is the tlbie test actually doing?
>>
>> Does it do anything to measure the cost of refilling after the full mm flush?
>
> That is essentially
>
> for ()
> {
> shmat()
> fillshm()
> shmdt()
>
> }
>
> for a 256MB range. So it is not really a fair benchmark because it
> doesn't take into account the impact of throwing away the full pid
> translation. But even then the TLBIE stalls is an important data point?
Choosing the ceiling is a trade-off, and this test only measures one
side of the trade-off.
It tells us that the actual time taken to execute the full flush is less
than doing 32 individual flushes, but that's not the full story.
To decide I think we need some numbers for some more "real" workloads,
to at least see that there's no change, or preferably some improvement.
Another interesting test might be to do the shmat/fillshm/shmdt, and
then chase some pointers to provoke TLB misses. Then we could work out
the relative cost of TLB misses vs the time to do the flush.
cheers
More information about the Linuxppc-dev
mailing list