[RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE

Mon Aug 16 17:03:55 AEST 2021

"Aneesh Kumar K.V" <aneesh.kumar at linux.ibm.com> writes:
> On 8/12/21 6:19 PM, Michael Ellerman wrote:
>> "Puvichakravarthy Ramachandran" <puvichakravarthy at in.ibm.com> writes:
>>>> With shared mapping, even though we are unmapping a large range, the kernel
>>>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>>>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>>>> This results in the kernel issuing a high number of TLB flushes even for a large
>>>> range. This can be improved by making sure the kernel switch to pid based flush if the
>>>> kernel is unmapping a 2M range.
>>>>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar at linux.ibm.com>
>>>> ---
>>>>   arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>>>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > b/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> index aefc100d79a7..21d0f098e43b 100644
>>>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>>>>    * invalidating a full PID, so it has a far lower threshold to change > from
>>>>    * individual page flushes to full-pid flushes.
>>>>    */
>>>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>>>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>>>>   static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = POWER9_TLB_SETS_RADIX * 2;
>>>>
>>>>   static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>>>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > mm_struct *mm,
>>>>        if (fullmm)
>>>>                flush_pid = true;
>>>>        else if (type == FLUSH_TYPE_GLOBAL)
>>>> -             flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>>>> +             flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>>>>        else
>>>>                flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>>>
>>> Additional details on the test environment. This was tested on a 2 Node/8
>>> socket Power10 system.
>>> The LPAR had 105 cores and the LPAR spanned across all the sockets.
>>>
>>> # perf stat -I 1000 -a -e cycles,instructions -e
>>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>>>   Rate of work: = 176
>>> #           time             counts unit events
>>>       1.029206442         4198594519      cycles
>>>       1.029206442         2458254252      instructions              # 0.59 insn per cycle
>>>       1.029206442         3004031488      PM_EXEC_STALL
>>>       1.029206442         1798186036      PM_EXEC_STALL_TLBIE
>>>   Rate of work: = 181
>>>       2.054288539         4183883450      cycles
>>>       2.054288539         2472178171      instructions              # 0.59 insn per cycle
>>>       2.054288539         3014609313      PM_EXEC_STALL
>>>       2.054288539         1797851642      PM_EXEC_STALL_TLBIE
>>>   Rate of work: = 180
>>>       3.078306883         4171250717      cycles
>>>       3.078306883         2468341094      instructions              # 0.59 insn per cycle
>>>       3.078306883         2993036205      PM_EXEC_STALL
>>>       3.078306883         1798181890      PM_EXEC_STALL_TLBIE
>>> .
>>> .
>>>
>>> # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>>> 34
>>>
>>> # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>>>
>>> # perf stat -I 1000 -a -e cycles,instructions -e
>>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>>>   Rate of work: = 313
>>> #           time             counts unit events
>>>       1.030310506         4206071143      cycles
>>>       1.030310506         4314716958      instructions              # 1.03 insn per cycle
>>>       1.030310506         2157762167      PM_EXEC_STALL
>>>       1.030310506          110825573      PM_EXEC_STALL_TLBIE
>>>   Rate of work: = 322
>>>       2.056034068         4331745630      cycles
>>>       2.056034068         4531658304      instructions              # 1.05 insn per cycle
>>>       2.056034068         2288971361      PM_EXEC_STALL
>>>       2.056034068          111267927      PM_EXEC_STALL_TLBIE
>>>   Rate of work: = 321
>>>       3.081216434         4327050349      cycles
>>>       3.081216434         4379679508      instructions              # 1.01 insn per cycle
>>>       3.081216434         2252602550      PM_EXEC_STALL
>>>       3.081216434          110974887      PM_EXEC_STALL_TLBIE
>> 
>> 
>> What is the tlbie test actually doing?
>> 
>> Does it do anything to measure the cost of refilling after the full mm flush?
>
> That is essentially
>
> for ()
> {
>    shmat()
>    fillshm()
>    shmdt()
>
> }
>
> for a 256MB range. So it is not really a fair benchmark because it 
> doesn't take into account the impact of throwing away the full pid 
> translation. But even then the TLBIE stalls is an important data point?

Choosing the ceiling is a trade-off, and this test only measures one
side of the trade-off.

It tells us that the actual time taken to execute the full flush is less
than doing 32 individual flushes, but that's not the full story.

To decide I think we need some numbers for some more "real" workloads,
to at least see that there's no change, or preferably some improvement.

Another interesting test might be to do the shmat/fillshm/shmdt, and
then chase some pointers to provoke TLB misses. Then we could work out
the relative cost of TLB misses vs the time to do the flush.

cheers