[Skiboot] nvlink2 topology

Fri Jul 27 13:13:32 AEST 2018

On Friday, 27 July 2018 11:12:32 AM AEST Alexey Kardashevskiy wrote:
> 
> On 26/07/2018 18:08, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 26/07/2018 16:38, Alistair Popple wrote:
> >> On Thursday, 26 July 2018 4:10:21 PM AEST Alexey Kardashevskiy wrote:
> >>>
> >>> On 26/07/2018 14:34, Alistair Popple wrote:
> >>>> Hi Alexey,
> >>>>
> >>>> On Thursday, 26 July 2018 12:56:20 PM AEST Alexey Kardashevskiy wrote:
> >>>>>
> >>>>> On 26/07/2018 03:53, Reza Arbab wrote:
> >>>>>> On Tue, Jul 24, 2018 at 12:12:43AM +1000, Alexey Kardashevskiy wrote:
> >>>>>>> But before I try this, the existing tree seems to have a problem at
> >>>>>>> (same with another xscom node):
> >>>>>>> /sys/firmware/devicetree/base/xscom at 603fc00000000/npu at 5011000
> >>>>>>> ./link at 4/ibm,slot-label
> >>>>>>>                 "GPU2"
> >>>>>>> ./link at 2/ibm,slot-label
> >>>>>>>                 "GPU1"
> >>>>>>> ./link at 0/ibm,slot-label
> >>>>>>>                 "GPU0"
> >>>>>>> ./link at 5/ibm,slot-label
> >>>>>>>                 "GPU2"
> >>>>>>> ./link at 3/ibm,slot-label
> >>>>>>>                 "GPU1"
> >>>>>>> ./link at 1/ibm,slot-label
> >>>>>>>                 "GPU0"
> >>>>>>>
> >>>>>>> This comes from hostboot.
> >>>>>>> Witherspoon_Design_Workbook_v1.7_19June2018.pdf on page 39 suggests that
> >>>>>>> link at 3 and link at 5 should be swapped. Which one is correct?
> >>>>>>
> >>>>>> I would think link at 3 should be "GPU2" and link at 5 should be "GPU1".
> >>>>
> >>>> The link numbering in the device-tree is based on CPU NDL link index. As the
> >>>> workbook does not contain CPU link indicies
> >>>
> >>> It does, page 39.
> >>
> >> Where? I see the GPU link numbers in the GPU boxes on the right but none on the
> >> CPU side (yellow boxes on the left). The CPU side only has PHY lane masks
> >> listed. The numbers in the GPU boxes are GPU link numbers.
> > 
> > 
> > Ah, counting them from top to bottom does not work. Anyway, I got this
> > from Ryan:
> > 
> > P90_0 -> GPU0_1; P90_1-> GPU0_5, P90_2 -> GPU1_1, P90_5 -> GPU1_5; P90_4
> > -> GPU2_3; P90_3 -> GPU2_5
> > 
> > and he could not tell what document this is from. And it was
> > specifically mentioned that 'nvlinks 3 and 5 are "swapped"'.
> 
> 
> Update:
> witherspoon_seq_red.ppt has this mappings, these are NDL.
> 
> 
> 
> >>>> I suspect you are mixing these up
> >>>> with the GPU link numbers which are shown. The device-tree currently contains no
> >>>> information on what the GPU side link numbers are.
> >>>
> >>> Correct, this is what I want to add.
> >>>
> >>>>>> If so, it's a little surprising that this hasn't broken anything. The
> >>>>>> driver has its own way of discovering what connects to what, so maybe
> >>>>>> there really just isn't a consumer of these labels yet.
> >>>>
> >>>> You need to be careful what you are referring to here - PHY link index, NDL link
> >>>> index or NTL link index. The lane-mask corresponds to the PHY link index which
> >>>> is different to the CPU NDL/NTL link index as there are multiple muxes which
> >>>> switch these around.
> >>>
> >>> So what are the link at x nodes about? PHY, NDL, NTL? The workbook does not
> >>> mention NDL/NTL. What links does page 39 refer to?
> >>
> >> The link nodes are about NTL index.
> > 
> > What is swapped from my comment above? Or it is totally irrelevant?
> 
> 
> Figured it out, it is NDL. What spec does describe this relationship?
> 
> 
> >>>>> Can you please 1) make sure we do understand things right and these are
> >>>>> not some weird muxes somewhere between GPU and P9 2) fix it? Thanks :)
> >>>>
> >>>> I don't think there is anything to fix here. On your original question we have
> >>>> no knowledge of GPU<->GPU link topology so you would need to either hard code
> >>>> this in Skiboot or get it added to the HDAT.
> >>>
> >>> So which is one is it then - HDAT or Skiboot?
> >>
> >> Perhaps Olive or Stewart have an opinion here? Ideally this would be in HDAT and
> >> encoded in the MRW. In practice HDAT seems to just hardcode things anyway so I'm
> >> not sure what value there is in putting it there and a hardcoded platform
> >> specific table in Skiboot might be no worse.
> > 
> > They do not, it is either you or Reza ;)

Well I don't especially care either :-) A Skiboot table is probably easier for
you to implement.

> >>>> Or better yet get the driver enhanced so that it uses it's own topology
> >>>> detection to only bring-up CPU->GPU links in the virtualised pass-thru case.
> >>>
> >>> How? Enhance VFIO or IODA2 with topology detection does not seem
> >>> possible without a document describing it. And we do not need to detect
> >>> anything, we actually know exactly what the topology is, from the workbook.
> >>
> >> Enhance the NVIDIA Device Driver. The device driver running in the guest should
> >> be able to determine which links are CPU-GPU vs. GPU-GPU links and disable just
> >> the GPU-GPU links.
> > 
> > 
> > No, we do not want to trust the guest to do the right thing.

What do you mean? The guest is the thing running the driver, you have to trust
it do the right thing. Otherwise how can you trust it to give you the correct
answers?

Perhaps I should elaborate. As I understand things you are concerned with a
malicious guest gaining access to a other guests via the GPU-GPU links. If the
driver in the normal guest does not enable these links it shouldn't matter what
a malicious guest does. Even if it enables the links on its side the guest has
it's links disabled so nothing will get in/out.

- Alistair