[PATCH kernel v3 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page

Fri Jul 6 15:06:59 AEST 2018

On Thu, Jul 05, 2018 at 03:19:04PM +1000, Alexey Kardashevskiy wrote:
> On Thu, 5 Jul 2018 12:42:20 +1000
> David Gibson <david at gibson.dropbear.id.au> wrote:
> 
> > On Wed, Jul 04, 2018 at 03:00:52PM +1000, Alexey Kardashevskiy wrote:
> > > A VM which has:
> > >  - a DMA capable device passed through to it (eg. network card);
> > >  - running a malicious kernel that ignores H_PUT_TCE failure;
> > >  - capability of using IOMMU pages bigger that physical pages
> > > can create an IOMMU mapping that exposes (for example) 16MB of
> > > the host physical memory to the device when only 64K was allocated to the VM.
> > > 
> > > The remaining 16MB - 64K will be some other content of host memory, possibly
> > > including pages of the VM, but also pages of host kernel memory, host
> > > programs or other VMs.
> > > 
> > > The attacking VM does not control the location of the page it can map,
> > > and is only allowed to map as many pages as it has pages of RAM.
> > > 
> > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> > > an IOMMU page is contained in the physical page so the PCI hardware won't
> > > get access to unassigned host memory; however this check is missing in
> > > the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
> > > did not hit this yet as the very first time when the mapping happens
> > > we do not have tbl::it_userspace allocated yet and fall back to
> > > the userspace which in turn calls VFIO IOMMU driver, this fails and
> > > the guest does not retry,
> > > 
> > > This stores the smallest preregistered page size in the preregistered
> > > region descriptor and changes the mm_iommu_xxx API to check this against
> > > the IOMMU page size. This only allows huge pages use if the entire
> > > preregistered block is backed with huge pages which are completely
> > > contained the preregistered chunk; otherwise this defaults to PAGE_SIZE.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>  
> > 
> > Reviewed-by: David Gibson <david at gibson.dropbear.id.au>
> > 
> > On the grounds that I think this version is safe, which the old one
> > wasn't.  However it still has some flaws..
> > 
> > [snip]
> > > @@ -125,7 +126,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > >  {
> > >  	struct mm_iommu_table_group_mem_t *mem;
> > >  	long i, j, ret = 0, locked_entries = 0;
> > > -	struct page *page = NULL;
> > > +	unsigned int pageshift;
> > > +	struct page *page = NULL, *head = NULL;
> > >  
> > >  	mutex_lock(&mem_list_mutex);
> > >  
> > > @@ -159,6 +161,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > >  		goto unlock_exit;
> > >  	}
> > >  
> > > +	mem->pageshift = 64;
> > >  	mem->hpas = vzalloc(array_size(entries, sizeof(mem->hpas[0])));
> > >  	if (!mem->hpas) {
> > >  		kfree(mem);
> > > @@ -199,9 +202,35 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > >  			}
> > >  		}
> > >  populate:
> > > +		pageshift = PAGE_SHIFT;
> > > +		if (PageCompound(page)) {
> > > +			/* Make sure huge page is contained completely */
> > > +			struct page *tmphead = compound_head(page);
> > > +			unsigned int n = compound_order(tmphead);
> > > +
> > > +			if (!head) {
> > > +				/* Is it a head of a huge page? */
> > > +				if (page == tmphead) {
> > > +					head = tmphead;
> > > +					pageshift += n;
> > > +				}
> > > +			} else if (head == tmphead) {
> > > +				/* Still same huge page, good */
> > > +				pageshift += n;
> > > +
> > > +				/* End of the huge page */
> > > +				if (page - head == (1UL << n) - 1)
> > > +					head = NULL;
> > > +			}
> > > +		}
> > > +		mem->pageshift = min(mem->pageshift, pageshift);
> > >  		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
> > >  	}
> > >  
> > > +	/* We have an incomplete huge page, default to PAGE_SHIFT */
> > > +	if (head)
> > > +		mem->pageshift = PAGE_SHIFT;
> > > +  
> > 
> > So, if the user attempts to prereg a region which starts or ends in
> > the middle of a hugepage, this logic will clamp the region's max page
> > shift down to PAGE_SHIFT.  That's safe, but not optimal.
> > 
> > Suppose userspace had an area backed with 16MiB hugepages, and wanted
> > to pre-reg a window that was 2MiB aligned, but not 16MiB aligned.  It
> > would still be safe to allow 2MiB TCEs, but the code above would clamp
> > it down to 64kiB (or 4kiB).
> > 
> > The code to do it is also pretty convoluted.
> > 
> > I think you'd be better off initializing mem->pageshift to the largest
> > possible natural alignment of the region:
> > 	mem->pageshift = ctz64(ua | (entries << PAGE_SHIFT));
> > 
> > Then it should just be sufficient to clamp pageshift down to
> > compound_order() + PAGE_SHIFT for each entry.
> 
> 
> I like this better, just one question - does hugetlbfs guarantee the @ua
> alignment if backed with an actual huge page?

So, yeah it does, as you determined.  And it has to - I don't know of
any MMU that allows for large pages that aren't naturally aligned, so
the uas would have to be aligned to actually map the pages into
userspace.

But... there's another more subtle case that I'm less sure about.
What you're actually checking for here is a compound page on the
physical side.  A hugetlbfs mapping in userspace is the main case
where I'd expect that, but, I'm not absolutely certain there can't be
some other case where a compound page is used to back a normal 64k
mapping in a user process.  If that is possible, it would probably
also be possible for the UA to end up misaligned with the compound
page's natural alignment.

I don't know of any case where that could happen, but I'm far from
confident it doesn't exist.  Things to consider:
   - mapping hugetlbfs, then mremap()ing part of it
   - a SHARED mapping, where it's aligned in one process and gets
     THPed, but is not aligned in the other
   - mmap() from a device or subsystem that provides some kind
     of IO or special memory that's handled with compound pages on the
     kernel side, but is just mapped into userspace with regular 64k
     PTEs
   - One process mapping libhugetlbfs, then another (say a debugger_
     attempting to map the first process's address space via
     /proc/pid/mem)
   - ..and that's just the ones I could think of quickly

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20180706/99788950/attachment-0001.sig>