[RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
Michael Ellerman
mpe at ellerman.id.au
Wed Aug 4 16:59:25 AEST 2021
Nicholas Piggin <npiggin at gmail.com> writes:
> Excerpts from Aneesh Kumar K.V's message of August 4, 2021 12:37 am:
>> With shared mapping, even though we are unmapping a large range, the kernel
>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>> This results in the kernel issuing a high number of TLB flushes even for a large
>> range. This can be improved by making sure the kernel switch to pid based flush if the
>> kernel is unmapping a 2M range.
>
> It would be good to have a bit more description here.
>
> In any patch that changes a heuristic like this, I would like to see
> some justification or reasoning that could be refuted or used as a
> supporting argument if we ever wanted to change the heuristic later.
> Ideally with some of the obvious downsides listed as well.
>
> This "improves" things here, but what if it hurt things elsewhere, how
> would we come in later and decide to change it back?
>
> THP flushes for example, I think now they'll do PID flushes (if they
> have to be broadcast, which they will tend to be when khugepaged does
> them). So now that might increase jitter for THP and cause it to be a
> loss for more workloads.
>
> So where do you notice this? What's the benefit?
Ack. Needs some numbers and supporting evidence.
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index aefc100d79a7..21d0f098e43b 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>> * invalidating a full PID, so it has a far lower threshold to change from
>> * individual page flushes to full-pid flushes.
>> */
>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>> static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
>>
>> static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>> if (fullmm)
>> flush_pid = true;
>> else if (type == FLUSH_TYPE_GLOBAL)
>> - flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>
> Arguably >= is nicer than > here, but this shouldn't be in the same
> patch as the value change.
>
>> else
>> flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>
> And it should change everything to be consistent. Although I'm not sure
> it's worth changing even though I highly doubt any administrator would
> be tweaking this.
This made me look at how an administrator tweaks these thresholds, and
AFAICS the answer is "recompile the kernel"?
It looks like x86 have a debugfs file for tlb_single_page_flush_ceiling,
but we don't. I guess we meant to copy that but never did?
So at the moment both thresholds could just be #defines.
Making them tweakable at runtime would be nice, it would give us an
escape hatch if we ever hit a workload in production that wants a
different value.
cheers
More information about the Linuxppc-dev
mailing list