[RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements
Ryan Roberts
ryan.roberts at arm.com
Sat May 31 01:55:36 AEST 2025
On 30/05/2025 15:47, Lorenzo Stoakes wrote:
> +cc Jann who is a specialist in all things page table-y and especially scary
> edge cases :)
>
> On Fri, May 30, 2025 at 03:04:38PM +0100, Ryan Roberts wrote:
>> Hi All,
>>
>> I recently added support for lazy mmu mode on arm64. The series is now in
>> Linus's tree so should be in v6.16-rc1. But during testing in linux-next we
>> found some ugly corners (unexpected nesting). I was able to fix those issues by
>> making the arm64 implementation more permissive (like the other arches). But
>> this is quite fragile IMHO. So I'd rather fix the root cause and ensure that
>> lazy mmu mode never nests, and more importantly, that code never makes pgtable
>> modifications expecting them to be immediate, not knowing that it's actually in
>> lazy mmu mode so the changes get deferred.
>
> When you say fragile, are you confident it _works_ but perhaps not quite as well
> as you want? Or are you concerned this might be broken upstream in any way?
I'm confident that it _works_ for arm64 as it is, upstream. But if Dev's series
were to go in _without_ the lazy_mmu bracketting in some manner, then it would
be broken if the config includes CONFIG_DEBUG_PAGEALLOC.
There's a lot more explanation in the later patches as to how it can be broken,
but for arm64, the situation is currently like this, because our implementation
of __change_memory_common() uses apply_to_page_range() which implicitly starts
an inner lazy_mmu_mode. We enter multiple times, but we exit one the first call
to exit. Everything works correctly but it's not optimal because C is no longer
deferred:
arch_enter_lazy_mmu_mode() << outer lazy mmu region
<do some pte changes (A)>
alloc_pages()
debug_pagealloc_map_pages()
__kernel_map_pages()
__change_memory_common()
arch_enter_lazy_mmu_mode() << inner lazy mmu region
<change kernel pte to make valid (B)>
arch_leave_lazy_mmu_mode() << exit; complete A + B
clear_page()
<do some more pte changes (C)> << no longer in lazy mode
arch_leave_lazy_mmu_mode() << nop
An alternative implementation would not add the nested lazy mmu mode, so we end
up with this:
arch_enter_lazy_mmu_mode() << outer lazy mmu region
<do some pte changes (A)>
alloc_pages()
debug_pagealloc_map_pages()
__kernel_map_pages()
__change_memory_common()
<change kernel pte to make valid (B)> << deferred due to lazy mmu
clear_page() << BANG! B has not be actioned
<do some more pte changes (C)>
arch_leave_lazy_mmu_mode()
This is clearly a much worse outcome. It's not happening today but it could in
future. That's why I'm claiming it's fragile. It's much better (IMHO) to
disallow calling the page allocator when in lazy mmu mode.
I won't speak for other arches; there may be more or less potential impact for them.
>
> I am thinking specifically about the proposed use in Dev's new series [0] and
> obviously hoping (and assuming in fact) that it's the former :)
Dev's changes aren't directly related to this, but if a version was accepted
that didn't include the lazy mmu mode, that would cause non-obvious issues.
Hope that helps?
Thanks,
Ryan
>
> [0]: https://lore.kernel.org/linux-mm/20250530090407.19237-1-dev.jain@arm.com/
>
>>
>> The first 2 patches are unrelated, very obvious bug fixes. They don't affect
>> arm64 because arm64 only uses lazy mmu for kernel mappings. But I noticed them
>> during code review and think they should be fixed.
>>
>> The next 3 patches are aimed at solving the nesting issue.
>>
>> And the final patch is reverting the "permissive" fix I did for arm64, which is
>> no longer needed after the previous 3 patches.
>>
>> I've labelled this RFC for now because it depends on the arm64 lazy mmu patches
>> in Linus's master, so it won't apply to mm-unstable. But I'm keen to get review
>> and siince I'm touching various arches and modifying some core mm stuff, I
>> thought that might take a while so thought I'd beat the rush and get a first
>> version out early.
>>
>> I've build-tested all the affected arches. And I've run mm selftests for the
>> arm64 build, with no issues (with DEBUG_PAGEALLOC and KFENCE enabled).
>>
>> Applies against Linus's master branch (f66bc387efbe).
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (6):
>> fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in
>> pagemap_scan_pmd_entry()
>> mm: Fix pte update and tlb maintenance ordering in
>> migrate_vma_collect_pmd()
>> mm: Avoid calling page allocator from apply_to_page_range()
>> mm: Introduce arch_in_lazy_mmu_mode()
>> mm: Avoid calling page allocator while in lazy mmu mode
>> Revert "arm64/mm: Permit lazy_mmu_mode to be nested"
>>
>> arch/arm64/include/asm/pgtable.h | 22 ++++----
>> .../include/asm/book3s/64/tlbflush-hash.h | 15 ++++++
>> arch/sparc/include/asm/tlbflush_64.h | 1 +
>> arch/sparc/mm/tlb.c | 12 +++++
>> arch/x86/include/asm/paravirt.h | 5 ++
>> arch/x86/include/asm/paravirt_types.h | 1 +
>> arch/x86/kernel/paravirt.c | 6 +++
>> arch/x86/xen/mmu_pv.c | 6 +++
>> fs/proc/task_mmu.c | 3 +-
>> include/asm-generic/tlb.h | 2 +
>> include/linux/mm.h | 6 +++
>> include/linux/pgtable.h | 1 +
>> kernel/bpf/arena.c | 6 +--
>> mm/kasan/shadow.c | 2 +-
>> mm/memory.c | 54 ++++++++++++++-----
>> mm/migrate_device.c | 3 +-
>> mm/mmu_gather.c | 15 ++++++
>> 17 files changed, 128 insertions(+), 32 deletions(-)
>>
>> --
>> 2.43.0
>>
More information about the Linuxppc-dev
mailing list