[PATCH kernel v3 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page

Thu Jul 5 12:42:20 AEST 2018

On Wed, Jul 04, 2018 at 03:00:52PM +1000, Alexey Kardashevskiy wrote:
> A VM which has:
>  - a DMA capable device passed through to it (eg. network card);
>  - running a malicious kernel that ignores H_PUT_TCE failure;
>  - capability of using IOMMU pages bigger that physical pages
> can create an IOMMU mapping that exposes (for example) 16MB of
> the host physical memory to the device when only 64K was allocated to the VM.
> 
> The remaining 16MB - 64K will be some other content of host memory, possibly
> including pages of the VM, but also pages of host kernel memory, host
> programs or other VMs.
> 
> The attacking VM does not control the location of the page it can map,
> and is only allowed to map as many pages as it has pages of RAM.
> 
> We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> an IOMMU page is contained in the physical page so the PCI hardware won't
> get access to unassigned host memory; however this check is missing in
> the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
> did not hit this yet as the very first time when the mapping happens
> we do not have tbl::it_userspace allocated yet and fall back to
> the userspace which in turn calls VFIO IOMMU driver, this fails and
> the guest does not retry,
> 
> This stores the smallest preregistered page size in the preregistered
> region descriptor and changes the mm_iommu_xxx API to check this against
> the IOMMU page size. This only allows huge pages use if the entire
> preregistered block is backed with huge pages which are completely
> contained the preregistered chunk; otherwise this defaults to PAGE_SIZE.
> 
> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>

Reviewed-by: David Gibson <david at gibson.dropbear.id.au>

On the grounds that I think this version is safe, which the old one
wasn't.  However it still has some flaws..

[snip]
> @@ -125,7 +126,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
>  	long i, j, ret = 0, locked_entries = 0;
> -	struct page *page = NULL;
> +	unsigned int pageshift;
> +	struct page *page = NULL, *head = NULL;
>  
>  	mutex_lock(&mem_list_mutex);
>  
> @@ -159,6 +161,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  		goto unlock_exit;
>  	}
>  
> +	mem->pageshift = 64;
>  	mem->hpas = vzalloc(array_size(entries, sizeof(mem->hpas[0])));
>  	if (!mem->hpas) {
>  		kfree(mem);
> @@ -199,9 +202,35 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  			}
>  		}
>  populate:
> +		pageshift = PAGE_SHIFT;
> +		if (PageCompound(page)) {
> +			/* Make sure huge page is contained completely */
> +			struct page *tmphead = compound_head(page);
> +			unsigned int n = compound_order(tmphead);
> +
> +			if (!head) {
> +				/* Is it a head of a huge page? */
> +				if (page == tmphead) {
> +					head = tmphead;
> +					pageshift += n;
> +				}
> +			} else if (head == tmphead) {
> +				/* Still same huge page, good */
> +				pageshift += n;
> +
> +				/* End of the huge page */
> +				if (page - head == (1UL << n) - 1)
> +					head = NULL;
> +			}
> +		}
> +		mem->pageshift = min(mem->pageshift, pageshift);
>  		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
>  	}
>  
> +	/* We have an incomplete huge page, default to PAGE_SHIFT */
> +	if (head)
> +		mem->pageshift = PAGE_SHIFT;
> +

So, if the user attempts to prereg a region which starts or ends in
the middle of a hugepage, this logic will clamp the region's max page
shift down to PAGE_SHIFT.  That's safe, but not optimal.

Suppose userspace had an area backed with 16MiB hugepages, and wanted
to pre-reg a window that was 2MiB aligned, but not 16MiB aligned.  It
would still be safe to allow 2MiB TCEs, but the code above would clamp
it down to 64kiB (or 4kiB).

The code to do it is also pretty convoluted.

I think you'd be better off initializing mem->pageshift to the largest
possible natural alignment of the region:
	mem->pageshift = ctz64(ua | (entries << PAGE_SHIFT));

Then it should just be sufficient to clamp pageshift down to
compound_order() + PAGE_SHIFT for each entry.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20180705/b4e2587e/attachment.sig>