[PATCH kernel 2/2] powerpc/powernv/ioda2: Delay PE disposal
Alexey Kardashevskiy
aik at ozlabs.ru
Tue Apr 26 12:29:21 AEST 2016
On 04/21/2016 01:20 PM, Alexey Kardashevskiy wrote:
> On 04/21/2016 10:21 AM, Gavin Shan wrote:
>> On Fri, Apr 08, 2016 at 04:36:44PM +1000, Alexey Kardashevskiy wrote:
>>> When SRIOV is disabled, the existing code presumes there is no
>>> virtual function (VF) in use and destroys all associated PEs.
>>> However it is possible to get into the situation when the user
>>> activated SRIOV disabling while a VF is still in use via VFIO.
>>> For example, unbinding a physical function (PF) while there is a guest
>>> running with a VF passed throuhgh via VFIO will trigger the bug.
>>>
>>> This defines an IODA2-specific IOMMU group release() callback.
>>> This moves all the disposal code from pnv_ioda_release_vf_PE() to this
>>> new callback so the cleanup happens when the last user of an IOMMU
>>> group released the reference.
>>>
>>> As pnv_pci_ioda2_release_dma_pe() was reduced to just calling
>>> iommu_group_put(), this merges pnv_pci_ioda2_release_dma_pe()
>>> into pnv_ioda_release_vf_PE().
>>>
>>
>> Sorry, I don't understand how it works. When PF's driver disables
>> IOV capability, the VF cannnot work. The guest is unlikely to know
>> that and still continue accessing the VF's resources (e.g. config
>> space and MMIO registers). It would cause EEH errors.
>
> The host disables IOV which removes VF devices which unbinds vfio_pci
> driver and does all the cleanup, eventually we get to QEMU's
> vfio_req_notifier_handler() and PCI hot unplug is initiated and the device
> disappears from the guest.
>
> If the guest cannot do PCI hotunplug, then EEH will make host stop it anyway.
>
> Here we do not really care what happens to the guest (it can detect EEH or
> hotunplug or simply crash), we need to make sure that the _host_ does not
> crash in any case because the root user did something weird.
Ping?
>
>
>>
>>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>>> ---
>>> arch/powerpc/platforms/powernv/pci-ioda.c | 33
>>> +++++++++++++------------------
>>> 1 file changed, 14 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
>>> b/arch/powerpc/platforms/powernv/pci-ioda.c
>>> index ce9f2bf..8108c54 100644
>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>> @@ -1333,27 +1333,25 @@ static void pnv_pci_ioda2_set_bypass(struct
>>> pnv_ioda_pe *pe, bool enable);
>>> static void pnv_pci_ioda2_group_release(void *iommu_data)
>>> {
>>> struct iommu_table_group *table_group = iommu_data;
>>> + struct pnv_ioda_pe *pe = container_of(table_group,
>>> + struct pnv_ioda_pe, table_group);
>>> + struct pci_controller *hose = pci_bus_to_host(pe->parent_dev->bus);
>>
>> pe->parent_dev would be NULL for non-VF-PEs and it's protected by
>> CONFIG_PCI_IOV
>> in pci.h.
>
>
> Yeah, I'll fix it.
>
>>
>>> + struct pnv_phb *phb = hose->private_data;
>>> + struct iommu_table *tbl = pe->table_group.tables[0];
>>> + int64_t rc;
>>>
>>> - table_group->group = NULL;
>>> -}
>>> -
>>> -static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct
>>> pnv_ioda_pe *pe)
>>> -{
>>> - struct iommu_table *tbl;
>>> - int64_t rc;
>>> -
>>> - tbl = pe->table_group.tables[0];
>>> rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0);
>>> if (rc)
>>> pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
>>>
>>> pnv_pci_ioda2_set_bypass(pe, false);
>>> - if (pe->table_group.group) {
>>> - iommu_group_put(pe->table_group.group);
>>> - BUG_ON(pe->table_group.group);
>>> - }
>>> +
>>> + BUG_ON(!tbl);
>>> pnv_pci_ioda2_table_free_pages(tbl);
>>> - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
>>> + iommu_free_table(tbl, of_node_full_name(pe->parent_dev->dev.of_node));
>>> +
>>> + pnv_ioda_deconfigure_pe(phb, pe);
>>> + pnv_ioda_free_pe(phb, pe->pe_number);
>>> }
>>
>> It's not correct enough. One PE is comprised of DMA, MMIO, mapping info etc.
>> This function disposes all of them when DMA finishes its job. I don't figure
>> out a better way to represent all of them and their relationship. I guess
>> it's
>> worthy to have something in long term though it's not trival work.
>
>
> Sorry, I am missing your point here. I am not changing the resource
> deallocation here, I am just doing it slightly later and all I wonder at
> the moment is if there are races - like having 2 scripts - one doing unbind
> PF and another doing bind PF - will this crash the host in theory?
>
>
>>
>>>
>>> static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
>>> @@ -1376,16 +1374,13 @@ static void pnv_ioda_release_vf_PE(struct
>>> pci_dev *pdev)
>>> if (pe->parent_dev != pdev)
>>> continue;
>>>
>>> - pnv_pci_ioda2_release_dma_pe(pdev, pe);
>>> -
>>> /* Remove from list */
>>> mutex_lock(&phb->ioda.pe_list_mutex);
>>> list_del(&pe->list);
>>> mutex_unlock(&phb->ioda.pe_list_mutex);
>>>
>>> - pnv_ioda_deconfigure_pe(phb, pe);
>>> -
>>> - pnv_ioda_free_pe(phb, pe->pe_number);
>>> + if (pe->table_group.group)
>>> + iommu_group_put(pe->table_group.group);
>>> }
>>> }
>>>
>>> --
>>> 2.5.0.rc3
>>>
>>
>
>
--
Alexey
More information about the Linuxppc-dev
mailing list