[PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes

Wed Nov 6 11:08:32 AEDT 2019

On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson
<sean.j.christopherson at intel.com> wrote:
>
> On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote:
> > On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams at intel.com> wrote:
> > >
> > > On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
> > > <sean.j.christopherson at intel.com> wrote:
> > > >
> > > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> > > > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david at redhat.com> wrote:
> > > > > > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > > > > > interaction between THP and _PAGE_DEVMAP.
> > > > > >
> > > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > > > > > had to be said :/ ). Luckily, this should be independent of the
> > > > > > PG_reserved thingy AFAIKs.
> > > > >
> > > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > > > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> > > > > page count gets mismanaged and leads to the reported hang.
> > > >
> > > > When mapping pages into the guest, KVM gets the page via gup(), which
> > > > increments the page count for ZONE_DEVICE pages.  But KVM puts the page
> > > > using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
> > > > and so never puts its reference to ZONE_DEVICE pages.
> > >
> > > Oh, yeah, that's busted.
> >
> > Ugh, it's extra busted because every other gup user in the kernel
> > tracks the pages resulting from gup and puts them (put_page()) when
> > they are done. KVM wants to forget about whether it did a gup to get
> > the page and optionally trigger put_page() based purely on the pfn.
> > Outside of VFIO device assignment that needs pages pinned for DMA, why
> > does KVM itself need to pin pages? If pages are pinned over a return
> > to userspace that needs to be a FOLL_LONGTERM gup.
>
> Short answer, KVM pins the page to ensure correctness with respect to the
> primary MMU invalidating the associated host virtual address, e.g. when
> the page is being migrated or unmapped from host userspace.
>
> The main use of gup() is to handle guest page faults and map pages into
> the guest, i.e. into KVM's secondary MMU.  KVM uses gup() to both get the
> PFN and to temporarily pin the page.  The pin is held just long enough to
> guaranteed that any invalidation via the mmu_notifier will be stalled
> until after KVM finishes installing the page into the secondary MMU, i.e.
> the pin is short-term and not held across a return to userspace or entry
> into the guest.  When a subsequent mmu_notifier invalidation occurs, KVM
> pulls the PFN from the secondary MMU and uses that to update accessed
> and dirty bits in the host.
>
> There are a few other KVM flows that eventually call into gup(), but those
> are "traditional" short-term pins and use put_page() directly.

Ok, I was misinterpreting the effect of the bug with what KVM is using
the reference to do.

To your other point:

> But David's proposed fix for the above refcount bug is to omit the patch
> so that KVM no longer treats ZONE_DEVICE pages as reserved.  That seems
> like the right thing to do, including for thp_adjust(), e.g. it would
> naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is
> mapped with a huge page (2mb or above) in the host.  The only hiccup is
> figuring out how to correctly transfer the reference.

That might not be the only hiccup. There's currently no such thing as
huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud),
but the result of pfn_to_page() on such a mapping does not yield a
huge 'struct page'. It seems there are other paths in KVM that assume
that more typical page machinery is active like SetPageDirty() based
on kvm_is_reserved_pfn(). While I told David that I did not want to
see more usage of is_zone_device_page(), this patch below (untested)
seems a cleaner path with less surprises:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4df0aa6b8e5c..fbea17c1810c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);

 void kvm_release_pfn_clean(kvm_pfn_t pfn)
 {
-       if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
+       if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) ||
+           (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn))))
                put_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);

This is safe because the reference that KVM took earlier protects the
is_zone_device_page() lookup from racing device teardown. Otherwise,
if KVM does not have a reference it's unsafe, but that's already even
more broken because KVM would be releasing a page that it never
referenced. Every other KVM operation that assumes page allocator
pages would continue to honor kvm_is_reserved_pfn().