[PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

Thu Mar 21 06:09:08 AEDT 2019

On Wed, 20 Mar 2019 15:38:24 +1100
David Gibson <david at gibson.dropbear.id.au> wrote:

> On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> > On Fri, 15 Mar 2019 19:18:35 +1100
> > Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> >   
> > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > > platform puts all interconnected GPUs to the same IOMMU group.
> > > 
> > > However the user may want to pass individual GPUs to the userspace so
> > > in order to do so we need to put them into separate IOMMU groups and
> > > cut off the interconnects.
> > > 
> > > Thankfully V100 GPUs implement an interface to do by programming link
> > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > > this interface, it cannot be re-enabled until the secondary bus reset is
> > > issued to the GPU.
> > > 
> > > This defines a reset_done() handler for V100 NVlink2 device which
> > > determines what links need to be disabled. This relies on presence
> > > of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> > > PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> > > 
> > > This does not change the existing behaviour and instead adds
> > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > 
> > > The alternative approaches would be:
> > > 
> > > 1. do this in the system firmware (skiboot) but for that we would need
> > > to tell skiboot via an additional OPAL call whether or not we want this
> > > isolation - skiboot is unaware of IOMMU groups.
> > > 
> > > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > > the problem with that is at that point the device is not enabled, i.e.
> > > config space is not restored so we need to enable the device (i.e. MMIO
> > > bit in CMD register + program valid address to BAR0) in order to disable
> > > links and then perhaps undo all this initialization to bring the device
> > > back to the state where pci_try_reset_function() expects it to be.  
> > 
> > The trouble seems to be that this approach only maintains the isolation
> > exposed by the IOMMU group when vfio-pci is the active driver for the
> > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > incorporating groups in various ways.  
> 
> I don't think that reasoning is quite right.  An IOMMU group doesn't
> necessarily represent devices which *are* isolated, just devices which
> *can be* isolated.  There are plenty of instances when we don't need
> to isolate devices in different IOMMU groups: passing both groups to
> the same guest or userspace VFIO driver for example, or indeed when
> both groups are owned by regular host kernel drivers.
> 
> In at least some of those cases we also don't want to isolate the
> devices when we don't have to, usually for performance reasons.

I see IOMMU groups as representing the current isolation of the device,
not just the possible isolation.  If there are ways to break down that
isolation then ideally the group would be updated to reflect it.  The
ACS disable patches seem to support this, at boot time we can choose to
disable ACS at certain points in the topology to favor peer-to-peer
performance over isolation.  This is then reflected in the group
composition, because even though ACS *can be* enabled at the given
isolation points, it's intentionally not with this option.  Whether or
not a given user who owns multiple devices needs that isolation is
really beside the point, the user can choose to connect groups via IOMMU
mappings or reconfigure the system to disable ACS and potentially more
direct routing.  The IOMMU groups are still accurately reflecting the
topology and IOMMU based isolation.

> > So, if there's a device specific
> > way to configure the isolation reported in the group, which requires
> > some sort of active management against things like secondary bus
> > resets, then I think we need to manage it above the attached endpoint
> > driver.  
> 
> The problem is that above the endpoint driver, we don't actually have
> enough information about what should be isolated.  For VFIO we want to
> isolate things if they're in different containers, for most regular
> host kernel drivers we don't need to isolate at all (although we might
> as well when it doesn't have a cost).

This idea that we only want to isolate things if they're in different
containers is bogus, imo.  There are performance reasons why we might
not want things isolated, but there are also address space reasons why
we do.  If there are direct routes between devices, the user needs to
be aware of the IOVA pollution, if we maintain singleton groups, they
don't.  Granted we don't really account for this well in most
userspaces and fumble through it by luck of the address space layout
and lack of devices really attempting peer to peer access.

For in-kernel users, we're still theoretically trying to isolate
devices such that they have restricted access to only the resources
they need.  Disabling things like ACS in the topology reduces that
isolation.  AFAICT, most users don't really care about that degree of
isolation, so they run with iommu=pt for native driver performance
while still having the IOMMU available for isolation use cases running
in parallel.  We don't currently have support for on-demand enabling
isolation.

> The host side nVidia GPGPU
> drivers also won't want to isolate the (host owned) NVLink devices
> from each other, since they'll want to use the fast interconnects

This falls into the same mixed use case scenario above where we don't
really have a good solution today.  Things like ACS are dynamically
configurable, but we don't expose any interfaces to let drivers or
users change it (aside from setpci, which we don't account for
dynamically).  We assume a simplistic model where if you want IOMMU,
then you must also want the maximum configurable isolation.
Dynamically changing routing is not necessarily the most foolproof
thing either with potentially in-flight transactions and existing DMA
mappings, which is why I've suggested a couple times that perhaps we
could do a software hot-unplug of a sub-hierarchy, muck with isolation
at the remaining node, then re-discover the removed devices.

Of course when we bring NVIDIA into the mix, I have little sympathy
that the NVLink interfaces are all proprietary and we have no idea how
to make those dynamic changes or discover the interconnected-ness of a
device.  Thanks,

Alex