[v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time

Hugh Dickins hughd at google.com
Mon Apr 6 09:34:46 AEST 2026


On Thu, 26 Mar 2026, Usama Arif wrote:

> When the kernel creates a PMD-level THP mapping for anonymous pages, it
> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
> page table sits unused in a deposit list for the lifetime of the THP
> mapping, only to be withdrawn when the PMD is split or zapped. Every
> anonymous THP therefore wastes 4KB of memory unconditionally. On large
> servers where hundreds of gigabytes of memory are mapped as THPs, this
> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
> could otherwise satisfy other allocations, including the very PTE page
> table allocations needed when splits eventually occur.
> 
> This series removes the pre-deposit and allocates the PTE page table
> lazily — only when a PMD split actually happens. Since a large number
> of THPs are never split (they are zapped wholesale when processes exit or
> munmap the full range), the allocation is avoided entirely in the common
> case.
> 
> The pre-deposit pattern exists because split_huge_pmd was designed as an
> operation that must never fail: if the kernel decides to split, it needs
> a PTE page table, so one is deposited in advance. But "must never fail"
> is an unnecessarily strong requirement. A PMD split is typically triggered
> by a partial operation on a sub-PMD range — partial munmap, partial
> mprotect, COW on a pinned folio, GUP with FOLL_SPLIT_PMD, and similar.
> All of these operations already have well-defined error handling for
> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
> fail and propagating the error through these existing paths is the natural
> thing to do. Furthermore, if the system cannot satisfy a single order-0
> allocation for a page table, it is under extreme memory pressure and
> failing the operation is the correct response.
> 
> Designing functions like split_huge_pmd as operations that cannot fail
> has a subtle but real cost to code quality. It forces a pre-allocation
> pattern - every THP creation path must deposit a page table, and every
> split or zap path must withdraw one, creating a hidden coupling between
> widely separated code paths.
> 
> This also serves as a code cleanup. On every architecture except powerpc
> with hash MMU, the deposit/withdraw machinery becomes dead code. The
> series removes the generic implementations in pgtable-generic.c and the
> s390/sparc overrides, replacing them with no-op stubs guarded by
> arch_needs_pgtable_deposit(), which evaluates to false at compile time
> on all non-powerpc architectures.

I see no mention of the big problem,
which has stopped us all from trying this before.

Reclaim: the split_folio_to_list() in shrink_folio_list().

Imagine a process which has forked a thousand times, containing
anon THPs, which should now be swapped out and reclaimed.

To swap out one of those THPs, it will have to allocate a
thousand page tables, all with PF_MEMALLOC set (to give some
access to reserves, while preventing recursion into reclaim).

Elsewhere, we go to great lengths (e.g. mempools) to give
guaranteed access to the memory needed when freeing memory.
In the case of an anon THP, the guaranteed pool has been the
deposited page table. Now what?

And the worst is that when the 501st attempt to allocate a page
table fails, it has allocated and is using 500 pages from reserve,
without reaching the point of freeing any memory at all.

Maybe watermark boosting (I barely know whereof I speak) can help
a bit nowadays.  Has anything else changed to solve the problem?

What would help a lot would be the implementation of swap entries
at the PMD level.  Whether that would help enough, I'm sceptical:
I do think it's foolish to depend upon the availability of huge
contiguous swap extents, whatever the recent improvements there;
but it would at least be an arguable justification.

Shared page tables?  Generally I run away, but perhaps
manageable in this limited context (a store of not-present
swap entries, to be copied on fault).

Hugh


More information about the Linuxppc-dev mailing list