[PATCH kernel v3] powerpc/powernv: Isolate NVLinks between GV100GL on Witherspoon

Alexey Kardashevskiy aik at ozlabs.ru
Tue Apr 30 16:14:35 AEST 2019



On 30/04/2019 15:45, Alistair Popple wrote:
> Alexey,
> 
>>>>> +void pnv_try_isolate_nvidia_v100(struct pci_dev *bridge)
>>>>> +{
>>>>> +	u32 mask, val;
>>>>> +	void __iomem *bar0_0, *bar0_120000, *bar0_a00000;
>>>>> +	struct pci_dev *pdev;
>>>>> +	u16 cmd = 0, cmdmask = PCI_COMMAND_MEMORY;
>>>>> +
>>>>> +	if (!bridge->subordinate)
>>>>> +		return;
>>>>> +
>>>>> +	pdev = list_first_entry_or_null(&bridge->subordinate->devices,
>>>>> +			struct pci_dev, bus_list);
>>>>> +	if (!pdev)
>>>>> +		return;
>>>>> +
>>>>> +	if (pdev->vendor != PCI_VENDOR_ID_NVIDIA)
> 
> Don't you also need to check the PCIe devid to match only [PV]100 devices as 
> well? I doubt there's any guarantee these registers will remain the same for 
> all future (or older) NVIDIA devices.


I do not have the complete list of IDs and I already saw 3 different
device ids and this only works for machines with ibm,npu/gpu/nvlinks
properties so for now it works and for the future we are hoping to
either have an open source nvidia driver or some small minidriver (also
from nvidia, or may be a spec allowing us to write one) to allow
topology discovery on the host so we would not depend on the skiboot's
powernv DT.

> IMHO this should really be done in the device driver in the guest. A malcious 
> guest could load a modified driver that doesn't do this, but that should not 
> compromise other guests which presumably load a non-compromised driver that 
> disables the links on that guests GPU. However I guess in practice what you 
> have here should work equally well.

Doing it in the guest means a good guest needs to have an updated
driver, we do not really want to depend on this. The idea of IOMMU
groups is that the hypervisor provides isolation irrespective to what
the guest does.

Also vfio+qemu+slof needs to convey the nvlink topology to the guest,
seems like an unnecessary complication.



> - Alistair
> 
>>>>> +		return;
>>>>> +
>>>>> +	mask = nvlinkgpu_get_disable_mask(&pdev->dev);
>>>>> +	if (!mask)
>>>>> +		return;
>>>>> +
>>>>> +	bar0_0 = pci_iomap_range(pdev, 0, 0, 0x10000);
>>>>> +	if (!bar0_0) {
>>>>> +		pci_err(pdev, "Error mapping BAR0 @0\n");
>>>>> +		return;
>>>>> +	}
>>>>> +	bar0_120000 = pci_iomap_range(pdev, 0, 0x120000, 0x10000);
>>>>> +	if (!bar0_120000) {
>>>>> +		pci_err(pdev, "Error mapping BAR0 @120000\n");
>>>>> +		goto bar0_0_unmap;
>>>>> +	}
>>>>> +	bar0_a00000 = pci_iomap_range(pdev, 0, 0xA00000, 0x10000);
>>>>> +	if (!bar0_a00000) {
>>>>> +		pci_err(pdev, "Error mapping BAR0 @A00000\n");
>>>>> +		goto bar0_120000_unmap;
>>>>> +	}
>>>>
>>>> Is it really necessary to do three separate ioremaps vs one that would
>>>> cover them all here?  I suspect you're just sneaking in PAGE_SIZE with
>>>> the 0x10000 size mappings anyway.  Seems like it would simplify setup,
>>>> error reporting, and cleanup to to ioremap to the PAGE_ALIGN'd range
>>>> of the highest register accessed. Thanks,
>>>
>>> Sure I can map it once, I just do not see the point in mapping/unmapping
>>> all 0xa10000>>16=161 system pages for a very short period of time while
>>> we know precisely that we need just 3 pages.
>>>
>>> Repost?
>>
>> Ping?
>>
>> Can this go in as it is (i.e. should I ping Michael) or this needs
>> another round? It would be nice to get some formal acks. Thanks,
>>
>>>> Alex
>>>>
>>>>> +
>>>>> +	pci_restore_state(pdev);
>>>>> +	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
>>>>> +	if ((cmd & cmdmask) != cmdmask)
>>>>> +		pci_write_config_word(pdev, PCI_COMMAND, cmd | cmdmask);
>>>>> +
>>>>> +	/*
>>>>> +	 * The sequence is from "Tesla P100 and V100 SXM2 NVLink Isolation on
>>>>> +	 * Multi-Tenant Systems".
>>>>> +	 * The register names are not provided there either, hence raw values.
>>>>> +	 */
>>>>> +	iowrite32(0x4, bar0_120000 + 0x4C);
>>>>> +	iowrite32(0x2, bar0_120000 + 0x2204);
>>>>> +	val = ioread32(bar0_0 + 0x200);
>>>>> +	val |= 0x02000000;
>>>>> +	iowrite32(val, bar0_0 + 0x200);
>>>>> +	val = ioread32(bar0_a00000 + 0x148);
>>>>> +	val |= mask;
>>>>> +	iowrite32(val, bar0_a00000 + 0x148);
>>>>> +
>>>>> +	if ((cmd | cmdmask) != cmd)
>>>>> +		pci_write_config_word(pdev, PCI_COMMAND, cmd);
>>>>> +
>>>>> +	pci_iounmap(pdev, bar0_a00000);
>>>>> +bar0_120000_unmap:
>>>>> +	pci_iounmap(pdev, bar0_120000);
>>>>> +bar0_0_unmap:
>>>>> +	pci_iounmap(pdev, bar0_0);
>>>>> +}
> 
> 

-- 
Alexey


More information about the Linuxppc-dev mailing list