[PATCH 01/11] powerpc/powernv/ioda: Fix ref count for devices with their own PE

Alistair Popple alistair at popple.id.au
Mon Nov 18 13:36:29 AEDT 2019


On Monday, 18 November 2019 12:24:24 PM AEDT Oliver O'Halloran wrote:
> On Mon, Nov 18, 2019 at 12:06 PM Alistair Popple <alistair at popple.id.au> 
wrote:
> >
> > On Wednesday, 13 November 2019 4:38:21 AM AEDT Frederic Barrat wrote:
> > >
> > > However, one question is whether this patch breaks nvlink and if nvlink
> > > assumes the devices won’t go away because we explicitly take a reference
> > > forever. In npu_dma.c, there are 2 functions which allow to find the GPU
> > > associated to a npu device, and vice-versa. Both functions return a
> > > pointer to a struct pci_dev, but they don’t take a reference on the
> > > device being returned. So that seems dangerous. I’m probably missing
> > > something.
> > >
> > > Alexey, Alistair: what, if anything, guarantees, that the npu or gpu
> > > devices stay valid. Is it because we simply don’t provide any means to
> > > get rid of them ? Otherwise, don’t we need the callers of
> > > pnv_pci_get_gpu_dev() and pnv_pci_get_npu_dev() to worry about reference
> > > counting ? I’ve started looking into it and the changes are scary, which
> > > explains Greg’s related commit 02c5f5394918b.
> >
> > To be honest the reference counting looks like it has evolved into 
something
> > quite suspect and I don't think you're missing anything. In practice 
though we
> > likely haven't hit any issues because the original callers didn't store
> > references to the pdev which would make the window quite small (although 
the
> > pass through work may have changed that). And as you say there simply 
wasn't
> > any means to get rid of them anyway (EEH, hotplug, etc. has never been
> > implemented or supported for GPUs and all sorts of terrible things happen 
if
> > you try).
> 
> In other words: leaking a ref is the only safe thing to do.

Correct.

> > The best solution would likely be to review the reference counting and to
> > teach callers of get_*_dev() to call pci_put_dev(), etc.
> 
> The issue is that the two callers of get_pci_dev() are non-GPL
> exported symbols so we don't know what's calling them or what their
> expectations are. We be doing whatever makes sense for OpenCAPI and if
> that happens to cause a problem for someone else, then they can deal
> with it.

The issue isn't that it's exported non-GPL vs. GPL, rather that there are 
probably out-of-tree modules like the NVIDIA driver using it. However as you 
say supporting out-of-tree modules is not generally a concern for kernel 
developers so we certainly shouldn't let that prevent us doing a fix.

- Alistair

> Oliver






More information about the Linuxppc-dev mailing list