[RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

Wed Aug 1 18:37:35 AEST 2018

On 01/08/2018 00:29, Alex Williamson wrote:
> On Tue, 31 Jul 2018 14:03:35 +1000
> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> 
>> On 31/07/2018 02:29, Alex Williamson wrote:
>>> On Mon, 30 Jul 2018 18:58:49 +1000
>>> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
>>>> After some local discussions, it was pointed out that force disabling
>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>>>> enable it so malicious guests cannot penetrate good ones (or a host)
>>>> unless a good guest enabled the link but won't happen with a well
>>>> behaving guest. And if two guests became malicious, then can still only
>>>> harm each other, and so can they via other ways such network. This is
>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
>>>> behaving device cannot firewall itself from peers as it is up to the
>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>>>> has means to protect itself, just like a guest can run "firewalld" for
>>>> network.
>>>>
>>>> Although it would be a nice feature to have an extra barrier between
>>>> GPUs, is inability to block the links in hypervisor still a blocker for
>>>> V100 pass through?  
>>>
>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>>> specific routes configured?   
>>
>> The GPU-GPU links need not to be blocked and need to be enabled
>> (==trained) by a driver in the guest. There are no routes between GPUs
>> in NVLink fabric, these are direct links, it is just a switch on each
>> side, both switches need to be on for a link to work.
> 
> Ok, but there is at least the possibility of multiple direct links per
> GPU, the very first diagram I find of NVlink shows 8 interconnected
> GPUs:
> 
> https://www.nvidia.com/en-us/data-center/nvlink/

Out design is like the left part of the picture but it is just a detail.

> So if each switch enables one direct, point to point link, how does the
> guest know which links to open for which peer device?

It uses PCI config space on GPUs to discover the topology.

> And of course
> since we can't see the spec, a security audit is at best hearsay :-\

Yup, the exact discovery protocol is hidden.

>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
>> is controlled via the emulated PCI bridges which I pass through together
>> with the GPU.
> 
> So there's a special emulated switch, is that how the guest knows which
> GPUs it can enable NVLinks to?

Since it only has PCI config space (there is nothing relevant in the
device tree at all), I assume (double checking with the NVIDIA folks
now) the guest driver enables them all, tests which pair works and
disables the ones which do not. This gives a malicious guest a tiny
window of opportunity to break into a good guest. Hm :-/

>>> If the former, then isn't a non-malicious
>>> guest still susceptible to a malicious guest?  
>>
>> A non-malicious guest needs to turn its switch on for a link to a GPU
>> which belongs to a malicious guest.
> 
> Actual security, or obfuscation, will we ever know...
>>>> If the latter, how is
>>> routing configured by the guest given that the guest view of the
>>> topology doesn't match physical hardware?  Are these routes
>>> deconfigured by device reset?  Are they part of the save/restore
>>> state?  Thanks,  
> 
> Still curious what happens to these routes on reset.  Can a later user
> of a GPU inherit a device where the links are already enabled?  Thanks,

I am told that the GPU reset disables links. As a side effect, we get an
HMI (a hardware fault which reset the host machine) when trying
accessing the GPU RAM which indicates that the link is down as the
memory is only accessible via the nvlink. We have special fencing code
in our host firmware (skiboot) to fence this memory on PCI reset so
reading from it returns zeroes instead of HMIs.

-- 
Alexey