[PATCH v6 00/13] Remove device private pages from physical address space

Tue Mar 24 07:10:42 AEDT 2026

On 3/20/26 06:52, Alistair Popple wrote:
> On 2026-03-18 at 19:44 +1100, "David Hildenbrand (Arm)" <david at kernel.org> wrote...
>> On 3/17/26 02:47, Alistair Popple wrote:
>>> On 2026-03-07 at 03:16 +1100, "David Hildenbrand (Arm)" <david at kernel.org> wrote...
>>>
>>> Thanks David for taking the time to do a thorough review. I will let Jordan
>>> respond to most of the comments but wanted to add some of my own as I helped
>>> with the initial idea.
>>>
>>>
>>> I disagree - this isn't hacking in another/new zone-device thing it is cleaning
>>> up/reworking a pre-existing zone-device thing (DEVICE_PRIVATE pages). My initial
>>> hope was it wouldn't actually involve too much churn on the core-mm side.
>>
>> ... and there is quite some.
>>
>> stuff like make_readable_exclusive_migration_entry_from_page() must be
>> reworked.
> 
> Yeah, I was displeased to (re)discover the migration entry business when we
> fleshed this series out. The idea was basically that raw device-private pfns
> can't be used sensibly by anything in the core-mm anyway so presumably nothing
> was.
> 
> That turned out to be only somewhat true. The exceptions are:
> 
> 1. page_vma_mapped which I think we have a solution for based on the comments to
>    patch 5.

Yes, if we just have the page/folio we are in a better position. I
*suspect* that we want to pass a page range, as the other two weird
cases might pass a page, that, in the future might not be a folio anymore.

> 
> 2. migration entries which obviously we will have to see if we can rework.

Please look into encoding this internally, using one of the highest PFN
bits or sth like that. We don't have to support this on all weird
architectures.

> 
> 3. hmm_range_fault()

Yes.

> 
> 4. page snapshots, although that's actually only used to test zero_pfn so we
>    could probably drop that if we just guarantee device private offsets are
>    always invalid pfns.

Right, I think that can be more reasonably cleaned up.

[...]

>>
>> It will likely still be error prone, but I have no idea how on earth we
>> could possible catch reliably for an "unsigned long" pfn whether it is a
>> PFN (it's right there in the name ...) or something completely different.
> 
> The idea was (at least for device-private) that you never needed the PFN,
> only the page. Ie: that calling page_to_pfn() on a device-private page could,
> conceptually at least, just crash the kernel because it should never happen.
> 
> Obviously we identified some exceptions to that rule, the biggest being
> migration entries, hence the helpers for those.
> 
>> We don't want another pfn_t, it would be too much churn to convert most
>> of MM.
> 
> Given I removed pfn_t I don't need convincing of that :-)

:)

>>>
>>> So any core-mm churn is really just making this more explicit, but this series
>>> doesn't add any new requirements.
>>
>> Again, maybe it can be done in a better way. I did not enjoy some of the
>> code changes I was reading.
> 
> Ok. Was there anything outside the exceptions above that you did not enjoy?

The last patch was hard to review and I am not sure what else is hiding
in there. As said, breaking the patch into logical pieces will make this
a lot easier to review.

> 
> One idea we did have was to make the PFNs "obviously" invalid PFNs, for example
> by setting the MSB which exceeds the physical addressing capabilities of
> every arch/platform. That would allow dropping the hmm and page-snapshot flags
> although is still a bit of a hack.

I mean, that might be cleaner, because *maybe* one could just teach
pfn_valid() about that? Or have another, more lightweight helper that
really just checks for "ordinary" vs. "special" pfns. Needs some thought.

Using the highest bit as "this is not an ordinary pfn" might just do.
Maybe some highmem considerations (making sure we don't run into weird
stuff).

> 
> Ultimately one of the issues we are trying to resolve is that to get a PFN range
> we use get_free_mem_region(), which essentially just returns a random unused PFN
> range from the platform/arch perspective so an architecture may not recognise
> them as valid pfns and hence may not have allocated enough vmemmap space for
> them. That results in pfn_to_page() overflowing into something else (usually
> user space VAs, at least in the case of RISC-V).

Yes, I think it's a noble goal :)

-- 
Cheers,

David