[PATCH kernel 3/3] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
David Gibson
david at gibson.dropbear.id.au
Mon Nov 12 15:23:43 AEDT 2018
On Mon, Nov 12, 2018 at 01:36:45PM +1100, Alexey Kardashevskiy wrote:
>
>
> On 12/11/2018 12:08, David Gibson wrote:
> > On Fri, Oct 19, 2018 at 11:53:53AM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 19/10/2018 05:05, Alex Williamson wrote:
> >>> On Thu, 18 Oct 2018 10:37:46 -0700
> >>> Piotr Jaroszynski <pjaroszynski at nvidia.com> wrote:
> >>>
> >>>> On 10/18/18 9:55 AM, Alex Williamson wrote:
> >>>>> On Thu, 18 Oct 2018 11:31:33 +1100
> >>>>> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> >>>>>
> >>>>>> On 18/10/2018 08:52, Alex Williamson wrote:
> >>>>>>> On Wed, 17 Oct 2018 12:19:20 +1100
> >>>>>>> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> >>>>>>>
> >>>>>>>> On 17/10/2018 06:08, Alex Williamson wrote:
> >>>>>>>>> On Mon, 15 Oct 2018 20:42:33 +1100
> >>>>>>>>> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> >>>>>>>>>> +
> >>>>>>>>>> + if (pdev->vendor == PCI_VENDOR_ID_IBM &&
> >>>>>>>>>> + pdev->device == 0x04ea) {
> >>>>>>>>>> + ret = vfio_pci_ibm_npu2_init(vdev);
> >>>>>>>>>> + if (ret) {
> >>>>>>>>>> + dev_warn(&vdev->pdev->dev,
> >>>>>>>>>> + "Failed to setup NVIDIA NV2 ATSD region\n");
> >>>>>>>>>> + goto disable_exit;
> >>>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> So the NPU is also actually owned by vfio-pci and assigned to the VM?
> >>>>>>>>
> >>>>>>>> Yes. On a running system it looks like:
> >>>>>>>>
> >>>>>>>> 0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
> >>>>>>>> 0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
> >>>>>>>> 0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
> >>>>>>>> 0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
> >>>>>>>> 0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
> >>>>>>>> 0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
> >>>>>>>> 0035:00:00.0 PCI bridge: IBM Device 04c1
> >>>>>>>> 0035:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> >>>>>>>> 0035:02:04.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> >>>>>>>> 0035:02:05.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> >>>>>>>> 0035:02:0d.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> >>>>>>>> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >>>>>>>> (rev a1
> >>>>>>>> 0035:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >>>>>>>> (rev a1)
> >>>>>>>> 0035:05:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >>>>>>>> (rev a1)
> >>>>>>>>
> >>>>>>>> One "IBM Device" bridge represents one NVLink2, i.e. a piece of NPU.
> >>>>>>>> They all and 3 GPUs go to the same IOMMU group and get passed through to
> >>>>>>>> a guest.
> >>>>>>>>
> >>>>>>>> The entire NPU does not have representation via sysfs as a whole though.
> >>>>>>>
> >>>>>>> So the NPU is a bridge, but it uses a normal header type so vfio-pci
> >>>>>>> will bind to it?
> >>>>>>
> >>>>>> An NPU is a NVLink bridge, it is not PCI in any sense. We (the host
> >>>>>> powerpc firmware known as "skiboot" or "opal") have chosen to emulate a
> >>>>>> virtual bridge per 1 NVLink on the firmware level. So for each physical
> >>>>>> NPU there are 6 virtual bridges. So the NVIDIA driver does not need to
> >>>>>> know much about NPUs.
> >>>>>>
> >>>>>>> And the ATSD register that we need on it is not
> >>>>>>> accessible through these PCI representations of the sub-pieces of the
> >>>>>>> NPU? Thanks,
> >>>>>>
> >>>>>> No, only via the device tree. The skiboot puts the ATSD register address
> >>>>>> to the PHB's DT property called 'ibm,mmio-atsd' of these virtual bridges.
> >>>>>
> >>>>> Ok, so the NPU is essential a virtual device already, mostly just a
> >>>>> stub. But it seems that each NPU is associated to a specific GPU, how
> >>>>> is that association done? In the use case here it seems like it's just
> >>>>> a vehicle to provide this ibm,mmio-atsd property to guest DT and the tgt
> >>>>> routing information to the GPU. So if both of those were attached to
> >>>>> the GPU, there'd be no purpose in assigning the NPU other than it's in
> >>>>> the same IOMMU group with a type 0 header, so something needs to be
> >>>>> done with it. If it's a virtual device, perhaps it could have a type 1
> >>>>> header so vfio wouldn't care about it, then we would only assign the
> >>>>> GPU with these extra properties, which seems easier for management
> >>>>> tools and users. If the guest driver needs a visible NPU device, QEMU
> >>>>> could possibly emulate one to make the GPU association work
> >>>>> automatically. Maybe this isn't really a problem, but I wonder if
> >>>>> you've looked up the management stack to see what tools need to know to
> >>>>> assign these NPU devices and whether specific configurations are
> >>>>> required to make the NPU to GPU association work. Thanks,
> >>>>
> >>>> I'm not that familiar with how this was originally set up, but note that
> >>>> Alexey is just making it work exactly like baremetal does. The baremetal
> >>>> GPU driver works as-is in the VM and expects the same properties in the
> >>>> device-tree. Obviously it doesn't have to be that way, but there is
> >>>> value in keeping it identical.
> >>>>
> >>>> Another probably bigger point is that the NPU device also implements the
> >>>> nvlink HW interface and is required for actually training and
> >>>> maintaining the link up. The driver in the guest trains the links by
> >>>> programming both the GPU end and the NPU end of each link so the NPU
> >>>> device needs to be exposed to the guest.
> >>>
> >>> Ok, so there is functionality in assigning the NPU device itself, it's
> >>> not just an attachment point for meta data. But it still seems there
> >>> must be some association of NPU to GPU, the tgt address seems to pair
> >>> the NPU with a specific GPU, they're not simply a fungible set of NPUs
> >>> and GPUs. Is that association explicit anywhere or is it related to
> >>> the topology or device numbering that needs to match between the host
> >>> and guest? Thanks,
> >>
> >> It is in the device tree (phandle is a node ID).
> >
> > Hrm. But the device tree just publishes information about the
> > hardware. What's the device tree value actually exposing here?
> >
> > Is there an inherent hardware connection between one NPU and one GPU?
> > Or is there just an arbitrary assignment performed by the firmware
> > which it then exposed to the device tree?
>
> I am not sure I understood the question...
>
> The ibm,gpu and ibm,npu values (which are phandles) of NPUs and GPUs
> represent physical wiring.
So you're saying there is specific physical wiring between one
particular NPU and one particular GPU? And the device tree properties
describe that wiring?
I think what Alex and I are both trying to determine is if the binding
of NPUs to GPUs is as a result of physical wiring constraints, or just
a firmware imposed convention.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20181112/b288699d/attachment.sig>
More information about the Linuxppc-dev
mailing list