[RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range()
Liam R. Howlett
Liam.Howlett at oracle.com
Sat May 31 02:23:19 AEST 2025
* Ryan Roberts <ryan.roberts at arm.com> [250530 10:05]:
> Lazy mmu mode applies to the current task and permits pte modifications
> to be deferred and updated at a later time in a batch to improve
> performance. apply_to_page_range() calls its callback in lazy mmu mode
> and some of those callbacks call into the page allocator to either
> allocate or free pages.
>
> This is problematic with CONFIG_DEBUG_PAGEALLOC because
> debug_pagealloc_[un]map_pages() calls the arch implementation of
> __kernel_map_pages() which must modify the ptes for the linear map.
>
> There are two possibilities at this point:
>
> - If the arch implementation modifies the ptes directly without first
> entering lazy mmu mode, the pte modifications may get deferred until
> the existing lazy mmu mode is exited. This could result in taking
> spurious faults for example.
>
> - If the arch implementation enters a nested lazy mmu mode before
> modification of the ptes (many arches use apply_to_page_range()),
> then the linear map updates will definitely be applied upon leaving
> the inner lazy mmu mode. But because lazy mmu mode does not support
> nesting, the remainder of the outer user is no longer in lazy mmu
> mode and the optimization opportunity is lost.
>
> So let's just ensure that the page allocator is never called from within
> lazy mmu mode. New "_nolazy" variants of apply_to_page_range() and
> apply_to_existing_page_range() are introduced which don't enter lazy mmu
> mode. Then users which need to call into the page allocator within their
> callback are updated to use the _nolazy variants.
>
> Signed-off-by: Ryan Roberts <ryan.roberts at arm.com>
> ---
> include/linux/mm.h | 6 ++++++
> kernel/bpf/arena.c | 6 +++---
> mm/kasan/shadow.c | 2 +-
> mm/memory.c | 54 +++++++++++++++++++++++++++++++++++-----------
> 4 files changed, 51 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e51dba8398f7..11cae6ce04ff 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3743,9 +3743,15 @@ static inline bool gup_can_follow_protnone(struct vm_area_struct *vma,
> typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
> extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
> unsigned long size, pte_fn_t fn, void *data);
> +extern int apply_to_page_range_nolazy(struct mm_struct *mm,
> + unsigned long address, unsigned long size,
> + pte_fn_t fn, void *data);
We are removing externs as things are edited, so probably drop them
here.
> extern int apply_to_existing_page_range(struct mm_struct *mm,
> unsigned long address, unsigned long size,
> pte_fn_t fn, void *data);
> +extern int apply_to_existing_page_range_nolazy(struct mm_struct *mm,
> + unsigned long address, unsigned long size,
> + pte_fn_t fn, void *data);
>
> #ifdef CONFIG_PAGE_POISONING
> extern void __kernel_poison_pages(struct page *page, int numpages);
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 0d56cea71602..ca833cfeefb7 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -187,10 +187,10 @@ static void arena_map_free(struct bpf_map *map)
> /*
> * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
> * It unmaps everything from vmalloc area and clears pgtables.
> - * Call apply_to_existing_page_range() first to find populated ptes and
> - * free those pages.
> + * Call apply_to_existing_page_range_nolazy() first to find populated
> + * ptes and free those pages.
> */
> - apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> + apply_to_existing_page_range_nolazy(&init_mm, bpf_arena_get_kern_vm_start(arena),
> KERN_VM_SZ - GUARD_SZ, existing_page_cb, NULL);
> free_vm_area(arena->kern_vm);
> range_tree_destroy(&arena->rt);
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index d2c70cd2afb1..2325c5166c3a 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -590,7 +590,7 @@ void kasan_release_vmalloc(unsigned long start, unsigned long end,
>
>
> if (flags & KASAN_VMALLOC_PAGE_RANGE)
> - apply_to_existing_page_range(&init_mm,
> + apply_to_existing_page_range_nolazy(&init_mm,
> (unsigned long)shadow_start,
> size, kasan_depopulate_vmalloc_pte,
> NULL);
> diff --git a/mm/memory.c b/mm/memory.c
> index 49199410805c..24436074ce48 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2913,7 +2913,7 @@ EXPORT_SYMBOL(vm_iomap_memory);
> static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> unsigned long addr, unsigned long end,
> pte_fn_t fn, void *data, bool create,
> - pgtbl_mod_mask *mask)
> + pgtbl_mod_mask *mask, bool lazy_mmu)
> {
> pte_t *pte, *mapped_pte;
> int err = 0;
> @@ -2933,7 +2933,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> return -EINVAL;
> }
>
> - arch_enter_lazy_mmu_mode();
> + if (lazy_mmu)
> + arch_enter_lazy_mmu_mode();
>
> if (fn) {
> do {
> @@ -2946,7 +2947,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> }
> *mask |= PGTBL_PTE_MODIFIED;
>
> - arch_leave_lazy_mmu_mode();
> + if (lazy_mmu)
> + arch_leave_lazy_mmu_mode();
>
> if (mm != &init_mm)
> pte_unmap_unlock(mapped_pte, ptl);
> @@ -2956,7 +2958,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
> unsigned long addr, unsigned long end,
> pte_fn_t fn, void *data, bool create,
> - pgtbl_mod_mask *mask)
> + pgtbl_mod_mask *mask, bool lazy_mmu)
I am having a hard time understanding why other lazy mmus were more
self-contained, but arm has added arguments to generic code?
> {
> pmd_t *pmd;
> unsigned long next;
> @@ -2983,7 +2985,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
> pmd_clear_bad(pmd);
> }
> err = apply_to_pte_range(mm, pmd, addr, next,
> - fn, data, create, mask);
> + fn, data, create, mask, lazy_mmu);
> if (err)
> break;
> } while (pmd++, addr = next, addr != end);
> @@ -2994,7 +2996,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
> static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
> unsigned long addr, unsigned long end,
> pte_fn_t fn, void *data, bool create,
> - pgtbl_mod_mask *mask)
> + pgtbl_mod_mask *mask, bool lazy_mmu)
> {
> pud_t *pud;
> unsigned long next;
> @@ -3019,7 +3021,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
> pud_clear_bad(pud);
> }
> err = apply_to_pmd_range(mm, pud, addr, next,
> - fn, data, create, mask);
> + fn, data, create, mask, lazy_mmu);
> if (err)
> break;
> } while (pud++, addr = next, addr != end);
> @@ -3030,7 +3032,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
> static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
> unsigned long addr, unsigned long end,
> pte_fn_t fn, void *data, bool create,
> - pgtbl_mod_mask *mask)
> + pgtbl_mod_mask *mask, bool lazy_mmu)
> {
> p4d_t *p4d;
> unsigned long next;
> @@ -3055,7 +3057,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
> p4d_clear_bad(p4d);
> }
> err = apply_to_pud_range(mm, p4d, addr, next,
> - fn, data, create, mask);
> + fn, data, create, mask, lazy_mmu);
> if (err)
> break;
> } while (p4d++, addr = next, addr != end);
> @@ -3065,7 +3067,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>
> static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
> unsigned long size, pte_fn_t fn,
> - void *data, bool create)
> + void *data, bool create, bool lazy_mmu)
> {
> pgd_t *pgd;
> unsigned long start = addr, next;
> @@ -3091,7 +3093,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
> pgd_clear_bad(pgd);
> }
> err = apply_to_p4d_range(mm, pgd, addr, next,
> - fn, data, create, &mask);
> + fn, data, create, &mask, lazy_mmu);
This is annoying. We now have a bool, bool passed through with mask
inserted in the middle.
> if (err)
> break;
> } while (pgd++, addr = next, addr != end);
> @@ -3105,11 +3107,14 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
> /*
> * Scan a region of virtual memory, filling in page tables as necessary
> * and calling a provided function on each leaf page table.
> + *
> + * fn() is called in lazy mmu mode. As a result, the callback must be careful
> + * not to perform memory allocation.
> */
> int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
> unsigned long size, pte_fn_t fn, void *data)
> {
> - return __apply_to_page_range(mm, addr, size, fn, data, true);
> + return __apply_to_page_range(mm, addr, size, fn, data, true, true);
Please add something here to tell me what false, true is.
> }
> EXPORT_SYMBOL_GPL(apply_to_page_range);
>
> @@ -3117,13 +3122,36 @@ EXPORT_SYMBOL_GPL(apply_to_page_range);
> * Scan a region of virtual memory, calling a provided function on
> * each leaf page table where it exists.
> *
> + * fn() is called in lazy mmu mode. As a result, the callback must be careful
> + * not to perform memory allocation.
> + *
> * Unlike apply_to_page_range, this does _not_ fill in page tables
> * where they are absent.
> */
> int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
> unsigned long size, pte_fn_t fn, void *data)
> {
> - return __apply_to_page_range(mm, addr, size, fn, data, false);
> + return __apply_to_page_range(mm, addr, size, fn, data, false, true);
every..
> +}
> +
> +/*
> + * As per apply_to_page_range() but fn() is not called in lazy mmu mode.
> + */
> +int apply_to_page_range_nolazy(struct mm_struct *mm, unsigned long addr,
> + unsigned long size, pte_fn_t fn, void *data)
> +{
> + return __apply_to_page_range(mm, addr, size, fn, data, true, false);
one...
> +}
> +
> +/*
> + * As per apply_to_existing_page_range() but fn() is not called in lazy mmu
> + * mode.
> + */
> +int apply_to_existing_page_range_nolazy(struct mm_struct *mm,
> + unsigned long addr, unsigned long size,
> + pte_fn_t fn, void *data)
> +{
> + return __apply_to_page_range(mm, addr, size, fn, data, false, false);
adds confusion :)
These wrappers are terrible for readability and annoying for argument
lists too.
Could we do something like the pgtbl_mod_mask or zap_details and pass
through a struct or one unsigned int for create and lazy_mmu?
At least we'd have better self-documenting code in the wrappers.. and if
we ever need a third boolean, we could avoid multiplying the wrappers
again.
WDYT?
Cheers,
Liam
More information about the Linuxppc-dev
mailing list