[PATCH kernel v3 9/9] powerpc/powernv/npu: Enable NVLink pass through

Mon Apr 18 11:52:55 AEST 2016

Hi David,

On Fri, 15 Apr 2016 14:40:20 David Gibson wrote:
> On Tue, Apr 12, 2016 at 06:37:50PM +1000, Alexey Kardashevskiy wrote:
> > IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which
> > also has a couple of fast speed links (NVLink). The interface to links
> > is exposed as an emulated PCI bridge which is included into the same
> > IOMMU group as the corresponding GPU.
> > 
> > In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE.
> > 
> > In order to make these links work when GPU is passed to the guest,
> > these bridges need to be passed as well; otherwise performance will
> > degrade.
> > 
> > This implements and exports API to manage NPU state in regard to VFIO;
> > it replicates iommu_table_group_ops.
> > 
> > This defines a new pnv_pci_ioda2_npu_ops which is assigned to
> > the IODA2 bridge if there are NPUs for a GPU on the bridge.
> > The new callbacks call the default IODA2 callbacks plus new NPU API.
> > This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2
> > table_group, it is not expected to fail as the helper is only called
> > from the pnv_pci_ioda2_npu_ops.
> > 
> > This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to
> > the GPU group if any found. The helper uses helpers to look for
> > the "ibm,gpu" property in the device tree which is a phandle of
> > the corresponding GPU.
> > 
> > This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
> > loop skips NPU PEs as they do not have 32bit DMA segments.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> > ---
> > Changes:
> > v3:
> > * moved NPU-to-GPU IOMMU grouping later after all PHBs are discovered
> > * removed hack to make iommu_add_device() work, iommu_group_add_device()
> > is used instead
> > * cleanup in gpe_table_group_to_npe_cb()
> > 
> > v2:
> > * reimplemented to support NPU + GPU in the same group
> > * merged "powerpc/powernv/npu: Add NPU devices to IOMMU group" and
> > "powerpc/powernv/npu: Enable passing through via VFIO" into this patch
> > ---
> >  arch/powerpc/platforms/powernv/npu-dma.c  | 126 ++++++++++++++++++++++++++++++
> >  arch/powerpc/platforms/powernv/pci-ioda.c | 105 +++++++++++++++++++++++++
> >  arch/powerpc/platforms/powernv/pci.h      |   6 ++
> >  3 files changed, 237 insertions(+)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> > index 8e70221..7cb9f6a 100644
> > --- a/arch/powerpc/platforms/powernv/npu-dma.c
> > +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> > @@ -12,6 +12,7 @@
> >  #include <linux/export.h>
> >  #include <linux/pci.h>
> >  #include <linux/memblock.h>
> > +#include <linux/iommu.h>
> >  
> >  #include <asm/iommu.h>
> >  #include <asm/pnv-pci.h>
> > @@ -262,3 +263,128 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
> >  		}
> >  	}
> >  }
> > +
> > +long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
> > +		struct iommu_table *tbl)
> > +{
> > +	struct pnv_phb *phb = npe->phb;
> > +	int64_t rc;
> > +	const unsigned long size = tbl->it_indirect_levels ?
> > +		tbl->it_level_size : tbl->it_size;
> > +	const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
> > +	const __u64 win_size = tbl->it_size << tbl->it_page_shift;
> > +
> > +	pe_info(npe, "Setting up window#%d %llx..%llx pg=%lx\n", num,
> > +			start_addr, start_addr + win_size - 1,
> > +			IOMMU_PAGE_SIZE(tbl));
> > +
> > +	/* Ignore @num as there is just one window per NPU */
> > +	rc = opal_pci_map_pe_dma_window(phb->opal_id,
> > +			npe->pe_number,
> > +			npe->pe_number,
> > +			tbl->it_indirect_levels + 1,
> > +			__pa(tbl->it_base),
> > +			size << 3,
> > +			IOMMU_PAGE_SIZE(tbl));
> > +	if (rc) {
> > +		pe_err(npe, "Failed to configure TCE table, err %lld\n", rc);
> > +		return rc;
> > +	}
> > +
> > +	pnv_pci_link_table_and_group(phb->hose->node, num,
> > +			tbl, &npe->table_group);
> > +	pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
> > +
> > +	return rc;
> > +}
> > +
> > +long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num)
> > +{
> > +	struct pnv_phb *phb = npe->phb;
> > +	long ret;
> > +
> > +	pe_info(npe, "Removing DMA window #%d\n", num);
> > +
> > +	/* Ignore @num as there is just one window per NPU */
> > +	ret = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
> > +			npe->pe_number,
> > +			0/* levels */, 0/* table address */,
> > +			0/* table size */, 0/* page size */);
> > +	if (ret)
> > +		pe_warn(npe, "Unmapping failed, ret = %ld\n", ret);
> > +	else
> > +		pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
> > +
> > +	pnv_pci_unlink_table_and_group(npe->table_group.tables[num],
> > +			&npe->table_group);
> > +
> > +	return ret;
> > +}
> > +
> > +/* Switch ownership from platform code to external user (e.g. VFIO) */
> > +void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
> > +{
> > +	struct pnv_phb *phb = npe->phb;
> > +	int64_t ret;
> > +
> > +	if (npe->table_group.tables[0]) {
> > +		/* Disable 32bit window */
> > +		pnv_pci_unlink_table_and_group(npe->table_group.tables[0],
> > +				&npe->table_group);
> > +		npe->table_group.tables[0] = NULL;
> > +		ret = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
> > +				npe->pe_number,
> > +				0/* levels */, 0/* table address */,
> > +				0/* table size */, 0/* page size */);
> > +	} else {
> > +		/* Disable bypass */
> > +		ret = opal_pci_map_pe_dma_window_real(phb->opal_id,
> > +				npe->pe_number, npe->pe_number,
> > +				0 /* bypass base */, 0);
> > +	}
> 
> It's not immediately obvious to me why this is an if/else.  I'm
> assuming that the way the kernel uses the NPE IOMMU it always either
> has a 32-bit DMA window, or it's in bypass mode.  Is that inherent to
> the way the hardware works, or just the way Linux uses it?

Unlike the IODA2 the NPU only has a single TVE/window meaning the NPU IOMMU is
either disabled, in bypass or has a 32-bit DMA window. Alexey can comment on
the if/else but I agree that it isn't strictly necessary - calling either 
map_pe_dma_window() or map_pe_dma_window_real() with the arguments in both the
if and else cases will zero the TVE effectively disabling DMA.

> I'm just worrying if this could open an exploitable hole if the host
> kernel ever changes so that it could have bypass windows and TCE
> windows simultaneously active.  Is there any way to unconditionally
> disable bypass *and* disable any existing DMA window.

I'm not a VFIO expert, but from the perspective of the NPU HW this situation 
couldn't exist for the reasons described above. Calling
opal_pci_map_pe_dma_window/_real() with either a zero table size or PCI memory
size will unconditionally disable all DMA windows for that PE# on the NPU.

- Alistair

> Apart from that nit,
> 
> Reviewed-by: David Gibson <david at gibson.dropbear.id.au>
> 
>