[RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range()

Liam R. Howlett Liam.Howlett at oracle.com
Sat May 31 02:23:19 AEST 2025


* Ryan Roberts <ryan.roberts at arm.com> [250530 10:05]:
> Lazy mmu mode applies to the current task and permits pte modifications
> to be deferred and updated at a later time in a batch to improve
> performance. apply_to_page_range() calls its callback in lazy mmu mode
> and some of those callbacks call into the page allocator to either
> allocate or free pages.
> 
> This is problematic with CONFIG_DEBUG_PAGEALLOC because
> debug_pagealloc_[un]map_pages() calls the arch implementation of
> __kernel_map_pages() which must modify the ptes for the linear map.
> 
> There are two possibilities at this point:
> 
>  - If the arch implementation modifies the ptes directly without first
>    entering lazy mmu mode, the pte modifications may get deferred until
>    the existing lazy mmu mode is exited. This could result in taking
>    spurious faults for example.
> 
>  - If the arch implementation enters a nested lazy mmu mode before
>    modification of the ptes (many arches use apply_to_page_range()),
>    then the linear map updates will definitely be applied upon leaving
>    the inner lazy mmu mode. But because lazy mmu mode does not support
>    nesting, the remainder of the outer user is no longer in lazy mmu
>    mode and the optimization opportunity is lost.
> 
> So let's just ensure that the page allocator is never called from within
> lazy mmu mode. New "_nolazy" variants of apply_to_page_range() and
> apply_to_existing_page_range() are introduced which don't enter lazy mmu
> mode. Then users which need to call into the page allocator within their
> callback are updated to use the _nolazy variants.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts at arm.com>
> ---
>  include/linux/mm.h |  6 ++++++
>  kernel/bpf/arena.c |  6 +++---
>  mm/kasan/shadow.c  |  2 +-
>  mm/memory.c        | 54 +++++++++++++++++++++++++++++++++++-----------
>  4 files changed, 51 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e51dba8398f7..11cae6ce04ff 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3743,9 +3743,15 @@ static inline bool gup_can_follow_protnone(struct vm_area_struct *vma,
>  typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
>  			       unsigned long size, pte_fn_t fn, void *data);
> +extern int apply_to_page_range_nolazy(struct mm_struct *mm,
> +				      unsigned long address, unsigned long size,
> +				      pte_fn_t fn, void *data);

We are removing externs as things are edited, so probably drop them
here.

>  extern int apply_to_existing_page_range(struct mm_struct *mm,
>  				   unsigned long address, unsigned long size,
>  				   pte_fn_t fn, void *data);
> +extern int apply_to_existing_page_range_nolazy(struct mm_struct *mm,
> +				   unsigned long address, unsigned long size,
> +				   pte_fn_t fn, void *data);
>  
>  #ifdef CONFIG_PAGE_POISONING
>  extern void __kernel_poison_pages(struct page *page, int numpages);
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 0d56cea71602..ca833cfeefb7 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -187,10 +187,10 @@ static void arena_map_free(struct bpf_map *map)
>  	/*
>  	 * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
>  	 * It unmaps everything from vmalloc area and clears pgtables.
> -	 * Call apply_to_existing_page_range() first to find populated ptes and
> -	 * free those pages.
> +	 * Call apply_to_existing_page_range_nolazy() first to find populated
> +	 * ptes and free those pages.
>  	 */
> -	apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> +	apply_to_existing_page_range_nolazy(&init_mm, bpf_arena_get_kern_vm_start(arena),
>  				     KERN_VM_SZ - GUARD_SZ, existing_page_cb, NULL);
>  	free_vm_area(arena->kern_vm);
>  	range_tree_destroy(&arena->rt);
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index d2c70cd2afb1..2325c5166c3a 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -590,7 +590,7 @@ void kasan_release_vmalloc(unsigned long start, unsigned long end,
>  
>  
>  		if (flags & KASAN_VMALLOC_PAGE_RANGE)
> -			apply_to_existing_page_range(&init_mm,
> +			apply_to_existing_page_range_nolazy(&init_mm,
>  					     (unsigned long)shadow_start,
>  					     size, kasan_depopulate_vmalloc_pte,
>  					     NULL);
> diff --git a/mm/memory.c b/mm/memory.c
> index 49199410805c..24436074ce48 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2913,7 +2913,7 @@ EXPORT_SYMBOL(vm_iomap_memory);
>  static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  				     unsigned long addr, unsigned long end,
>  				     pte_fn_t fn, void *data, bool create,
> -				     pgtbl_mod_mask *mask)
> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
>  {
>  	pte_t *pte, *mapped_pte;
>  	int err = 0;
> @@ -2933,7 +2933,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  			return -EINVAL;
>  	}
>  
> -	arch_enter_lazy_mmu_mode();
> +	if (lazy_mmu)
> +		arch_enter_lazy_mmu_mode();
>  
>  	if (fn) {
>  		do {
> @@ -2946,7 +2947,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  	}
>  	*mask |= PGTBL_PTE_MODIFIED;
>  
> -	arch_leave_lazy_mmu_mode();
> +	if (lazy_mmu)
> +		arch_leave_lazy_mmu_mode();
>  
>  	if (mm != &init_mm)
>  		pte_unmap_unlock(mapped_pte, ptl);
> @@ -2956,7 +2958,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
>  				     unsigned long addr, unsigned long end,
>  				     pte_fn_t fn, void *data, bool create,
> -				     pgtbl_mod_mask *mask)
> +				     pgtbl_mod_mask *mask, bool lazy_mmu)

I am having a hard time understanding why other lazy mmus were more
self-contained, but arm has added arguments to generic code?

>  {
>  	pmd_t *pmd;
>  	unsigned long next;
> @@ -2983,7 +2985,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
>  			pmd_clear_bad(pmd);
>  		}
>  		err = apply_to_pte_range(mm, pmd, addr, next,
> -					 fn, data, create, mask);
> +					 fn, data, create, mask, lazy_mmu);
>  		if (err)
>  			break;
>  	} while (pmd++, addr = next, addr != end);
> @@ -2994,7 +2996,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
>  static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
>  				     unsigned long addr, unsigned long end,
>  				     pte_fn_t fn, void *data, bool create,
> -				     pgtbl_mod_mask *mask)
> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> @@ -3019,7 +3021,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
>  			pud_clear_bad(pud);
>  		}
>  		err = apply_to_pmd_range(mm, pud, addr, next,
> -					 fn, data, create, mask);
> +					 fn, data, create, mask, lazy_mmu);
>  		if (err)
>  			break;
>  	} while (pud++, addr = next, addr != end);
> @@ -3030,7 +3032,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
>  static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>  				     unsigned long addr, unsigned long end,
>  				     pte_fn_t fn, void *data, bool create,
> -				     pgtbl_mod_mask *mask)
> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
>  {
>  	p4d_t *p4d;
>  	unsigned long next;
> @@ -3055,7 +3057,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>  			p4d_clear_bad(p4d);
>  		}
>  		err = apply_to_pud_range(mm, p4d, addr, next,
> -					 fn, data, create, mask);
> +					 fn, data, create, mask, lazy_mmu);
>  		if (err)
>  			break;
>  	} while (p4d++, addr = next, addr != end);
> @@ -3065,7 +3067,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>  
>  static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  				 unsigned long size, pte_fn_t fn,
> -				 void *data, bool create)
> +				 void *data, bool create, bool lazy_mmu)
>  {
>  	pgd_t *pgd;
>  	unsigned long start = addr, next;
> @@ -3091,7 +3093,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  			pgd_clear_bad(pgd);
>  		}
>  		err = apply_to_p4d_range(mm, pgd, addr, next,
> -					 fn, data, create, &mask);
> +					 fn, data, create, &mask, lazy_mmu);

This is annoying.  We now have a bool, bool passed through with mask
inserted in the middle.

>  		if (err)
>  			break;
>  	} while (pgd++, addr = next, addr != end);
> @@ -3105,11 +3107,14 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  /*
>   * Scan a region of virtual memory, filling in page tables as necessary
>   * and calling a provided function on each leaf page table.
> + *
> + * fn() is called in lazy mmu mode. As a result, the callback must be careful
> + * not to perform memory allocation.
>   */
>  int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  			unsigned long size, pte_fn_t fn, void *data)
>  {
> -	return __apply_to_page_range(mm, addr, size, fn, data, true);
> +	return __apply_to_page_range(mm, addr, size, fn, data, true, true);

Please add something here to tell me what false, true is.

>  }
>  EXPORT_SYMBOL_GPL(apply_to_page_range);
>  
> @@ -3117,13 +3122,36 @@ EXPORT_SYMBOL_GPL(apply_to_page_range);
>   * Scan a region of virtual memory, calling a provided function on
>   * each leaf page table where it exists.
>   *
> + * fn() is called in lazy mmu mode. As a result, the callback must be careful
> + * not to perform memory allocation.
> + *
>   * Unlike apply_to_page_range, this does _not_ fill in page tables
>   * where they are absent.
>   */
>  int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
>  				 unsigned long size, pte_fn_t fn, void *data)
>  {
> -	return __apply_to_page_range(mm, addr, size, fn, data, false);
> +	return __apply_to_page_range(mm, addr, size, fn, data, false, true);

every..

> +}
> +
> +/*
> + * As per apply_to_page_range() but fn() is not called in lazy mmu mode.
> + */
> +int apply_to_page_range_nolazy(struct mm_struct *mm, unsigned long addr,
> +			       unsigned long size, pte_fn_t fn, void *data)
> +{
> +	return __apply_to_page_range(mm, addr, size, fn, data, true, false);

one...

> +}
> +
> +/*
> + * As per apply_to_existing_page_range() but fn() is not called in lazy mmu
> + * mode.
> + */
> +int apply_to_existing_page_range_nolazy(struct mm_struct *mm,
> +					unsigned long addr, unsigned long size,
> +					pte_fn_t fn, void *data)
> +{
> +	return __apply_to_page_range(mm, addr, size, fn, data, false, false);

adds confusion :)


These wrappers are terrible for readability and annoying for argument
lists too.

Could we do something like the pgtbl_mod_mask or zap_details and pass
through a struct or one unsigned int for create and lazy_mmu?

At least we'd have better self-documenting code in the wrappers.. and if
we ever need a third boolean, we could avoid multiplying the wrappers
again.

WDYT?

Cheers,
Liam


More information about the Linuxppc-dev mailing list