[PATCH v6 2/3] drivers/vfio: EEH support for VFIO PCI device
Alexander Graf
agraf at suse.de
Fri May 23 16:55:15 EST 2014
> Am 23.05.2014 um 06:37 schrieb Gavin Shan <gwshan at linux.vnet.ibm.com>:
>
>> On Thu, May 22, 2014 at 09:10:53PM -0600, Alex Williamson wrote:
>>> On Thu, 2014-05-22 at 18:23 +1000, Gavin Shan wrote:
>>> The patch adds new IOCTL commands for VFIO PCI device to support
>>> EEH functionality for PCI devices, which have been passed through
>>> from host to somebody else via VFIO.
>>>
>>> Signed-off-by: Gavin Shan <gwshan at linux.vnet.ibm.com>
>>> ---
>>> Documentation/vfio.txt | 88 ++++++++++-
>>> arch/powerpc/include/asm/eeh.h | 17 +++
>>> arch/powerpc/kernel/eeh.c | 321 +++++++++++++++++++++++++++++++++++++++++
>>> drivers/vfio/pci/vfio_pci.c | 131 ++++++++++++++++-
>>> include/uapi/linux/vfio.h | 53 +++++++
>>> 5 files changed, 603 insertions(+), 7 deletions(-)
>>
>> Maybe a chicken and egg problem, but it seems like we could split the
>> platform code and vfio code into separate patches.
>
> Ok. I'll keep egg/chicken separated in next revision.
>
>>>
>>> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>>> index b9ca023..dd13db6 100644
>>> --- a/Documentation/vfio.txt
>>> +++ b/Documentation/vfio.txt
>>> @@ -305,7 +305,10 @@ faster, the map/unmap handling has been implemented in real mode which provides
>>> an excellent performance which has limitations such as inability to do
>>> locked pages accounting in real time.
>>>
>>> -So 3 additional ioctls have been added:
>>> +4) PPC64 guests detect PCI errors and recover from them via EEH RTAS services,
>>> +which works on the basis of additional ioctl commands.
>>> +
>>> +So 8 additional ioctls have been added:
>>>
>>> VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
>>> of the DMA window on the PCI bus.
>>> @@ -316,6 +319,20 @@ So 3 additional ioctls have been added:
>>>
>>> VFIO_IOMMU_DISABLE - disables the container.
>>>
>>> + VFIO_EEH_PE_SET_OPTION - enables or disables EEH functinality on the
>>> + specified device. Also, it can be used to remove IO or DMA
>>> + stopped state on the frozen PE.
>>> +
>>> + VFIO_EEH_PE_GET_ADDR - retrieve the unique address of the specified
>>> + PE or query PE sharing mode.
>>> +
>>> + VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state.
>>> +
>>> + VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for
>>> + error recovering.
>>> +
>>> + VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's
>>> + one of the major steps for error recoverying.
>>>
>>> The code flow from the example above should be slightly changed:
>>>
>>> @@ -346,6 +363,75 @@ The code flow from the example above should be slightly changed:
>>> ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
>>> .....
>>>
>>> +Based on the initial example we have, the following piece of code could be
>>> +reference for EEH setup and error handling:
>>> +
>>> + struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) };
>>> + struct vfio_eeh_pe_get_addr addr = { .argsz = sizeof(addr) };
>>> + struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) };
>>> + struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) };
>>> + struct vfio_eeh_pe_configure config = { .argsz = sizeof(config) };
>>> +
>>> + ....
>>> +
>>> + /* Get a file descriptor for the device */
>>> + device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
>>> +
>>> + /* Enable the EEH functionality on the device */
>>> + option.option = EEH_OPT_ENABLE;
>>> + ioctl(device, VFIO_EEH_PE_SET_OPTION, &option);
>>> +
>>> + /* Retrieve PE address and create and maintain PE by yourself */
>>> + addr.option = EEH_OPT_GET_PE_ADDR;
>>> + ioctl(device, VFIO_EEH_PE_GET_ADDR, &addr);
>>> +
>>> + /* Assure EEH is supported on the PE and make PE functional */
>>> + ioctl(device, VFIO_EEH_PE_GET_STATE, &state);
>>> +
>>> + /* Save device's state. pci_save_state() would be good enough
>>> + * as an example.
>>> + */
>>> +
>>> + /* Test and setup the device */
>>> + ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
>>> +
>>> + ....
>>> +
>>> + /* When 0xFF's returned from reading PCI config space or IO BARs
>>> + * of the PCI device. Check the PE state to see if that has been
>>> + * frozen.
>>> + */
>>> + ioctl(device, VFIO_EEH_PE_GET_STATE, &state);
>>
>> There's no notification, the user needs to observe the return value an
>> poll? Should we be enabling an eventfd to notify the user of the state
>> change?
>
> Yes. The user needs to monitor the return value. we should have one notification,
> but it's for later as we discussed :-)
>
>>> +
>>> + /* Waiting for pending PCI transactions to be completed and don't
>>> + * produce any more PCI traffic from/to the affected PE until
>>> + * recovery is finished.
>>> + */
>>> +
>>> + /* Enable IO for the affected PE and collect logs. Usually, the
>>> + * standard part of PCI config space, AER registers are dumped
>>> + * as logs for further analysis.
>>> + */
>>> + option.option = EEH_OPT_THAW_MMIO;
>>> + ioctl(device, VFIO_EEH_PE_SET_OPTION, &option);
>>
>> How does the guest learn about the error? Does it need to?
>
> When guest detects 0xFF's from reading PCI config space or IO, it's going
> check the device (PE) state. If the device (PE) has been put into frozen
> state, the recovery will be started.
>
>>> +
>>> + /* Issue PE reset */
>>> + reset.option = EEH_RESET_HOT;
>>> + ioctl(device, VFIO_EEH_PE_RESET, &reset);
>>> +
>>> + /* Configure the PCI bridges for the affected PE */
>>> + ioctl(device, VFIO_EEH_PE_CONFIGURE, NULL);
>>> +
>>
>> I'm not sure I see why we've split these into separate ioctls. FWIW,
>> the one ioctl we currently have for reset that takes no options is
>> probably going to be the first to get deprecated because of it.
>>
>>> + /* Restored state we saved at initialization time. pci_restore_state()
>>> + * is good enough as an example.
>>> + */
>>> +
>>> + /* Hopefully, error is recovered successfully. Now, you can resume to
>>> + * start PCI traffic to/from the affected PE.
>>> + */
>>> +
>>> + ....
>>> +
>>> -------------------------------------------------------------------------------
>>>
>>> [1] VFIO was originally an acronym for "Virtual Function I/O" in its
>>> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>> index 34a2d83..dd5f1cf 100644
>>> --- a/arch/powerpc/include/asm/eeh.h
>>> +++ b/arch/powerpc/include/asm/eeh.h
>>> @@ -191,6 +191,11 @@ enum {
>>> #define EEH_OPT_ENABLE 1 /* EEH enable */
>>> #define EEH_OPT_THAW_MMIO 2 /* MMIO enable */
>>> #define EEH_OPT_THAW_DMA 3 /* DMA enable */
>>> +#define EEH_OPT_GET_PE_ADDR 0 /* Get PE addr */
>>> +#define EEH_OPT_GET_PE_MODE 1 /* Get PE mode */
>>> +#define EEH_PE_MODE_NONE 0 /* Invalid mode */
>>> +#define EEH_PE_MODE_NOT_SHARED 1 /* Not shared */
>>> +#define EEH_PE_MODE_SHARED 2 /* Shared mode */
>>> #define EEH_STATE_UNAVAILABLE (1 << 0) /* State unavailable */
>>> #define EEH_STATE_NOT_SUPPORT (1 << 1) /* EEH not supported */
>>> #define EEH_STATE_RESET_ACTIVE (1 << 2) /* Active reset */
>>> @@ -198,6 +203,11 @@ enum {
>>> #define EEH_STATE_DMA_ACTIVE (1 << 4) /* Active DMA */
>>> #define EEH_STATE_MMIO_ENABLED (1 << 5) /* MMIO enabled */
>>> #define EEH_STATE_DMA_ENABLED (1 << 6) /* DMA enabled */
>>> +#define EEH_PE_STATE_NORMAL 0 /* Normal state */
>>> +#define EEH_PE_STATE_RESET 1 /* PE reset */
>>> +#define EEH_PE_STATE_STOPPED_IO_DMA 2 /* Stopped */
>>> +#define EEH_PE_STATE_STOPPED_DMA 4 /* Stopped DMA */
>>> +#define EEH_PE_STATE_UNAVAIL 5 /* Unavailable */
>>> #define EEH_RESET_DEACTIVATE 0 /* Deactivate the PE reset */
>>> #define EEH_RESET_HOT 1 /* Hot reset */
>>> #define EEH_RESET_FUNDAMENTAL 3 /* Fundamental reset */
>>> @@ -305,6 +315,13 @@ void eeh_add_device_late(struct pci_dev *);
>>> void eeh_add_device_tree_late(struct pci_bus *);
>>> void eeh_add_sysfs_files(struct pci_bus *);
>>> void eeh_remove_device(struct pci_dev *);
>>> +int eeh_dev_open(struct pci_dev *pdev);
>>> +void eeh_dev_release(struct pci_dev *pdev);
>>> +int eeh_pe_set_option(struct pci_dev *pdev, int option);
>>> +int eeh_pe_get_addr(struct pci_dev *pdev, int option);
>>> +int eeh_pe_get_state(struct pci_dev *pdev);
>>> +int eeh_pe_reset(struct pci_dev *pdev, int option);
>>> +int eeh_pe_configure(struct pci_dev *pdev);
>>>
>>> /**
>>> * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
>>> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>>> index 9c6b899..b90a474 100644
>>> --- a/arch/powerpc/kernel/eeh.c
>>> +++ b/arch/powerpc/kernel/eeh.c
>>> @@ -108,6 +108,9 @@ struct eeh_ops *eeh_ops = NULL;
>>> /* Lock to avoid races due to multiple reports of an error */
>>> DEFINE_RAW_SPINLOCK(confirm_error_lock);
>>>
>>> +/* Lock to protect passed flags */
>>> +static DEFINE_MUTEX(eeh_dev_mutex);
>>> +
>>> /* Buffer for reporting pci register dumps. Its here in BSS, and
>>> * not dynamically alloced, so that it ends up in RMO where RTAS
>>> * can access it.
>>> @@ -1098,6 +1101,324 @@ void eeh_remove_device(struct pci_dev *dev)
>>> edev->mode &= ~EEH_DEV_SYSFS;
>>> }
>>>
>>> +/**
>>> + * eeh_dev_open - Mark EEH device and PE as passed through
>>> + * @pdev: PCI device
>>> + *
>>> + * Mark the indicated EEH device and PE as passed through.
>>> + * In the result, the EEH errors detected on the PE won't be
>>> + * reported. The owner of the device will be responsible for
>>> + * detection and recovery.
>>> + */
>>> +int eeh_dev_open(struct pci_dev *pdev)
>>> +{
>>> + struct eeh_dev *edev;
>>> +
>>> + mutex_lock(&eeh_dev_mutex);
>>> +
>>> + /* No PCI device ? */
>>> + if (!pdev) {
>>> + mutex_unlock(&eeh_dev_mutex);
>>> + return -ENODEV;
>>> + }
>>> +
>>> + /* No EEH device ? */
>>> + edev = pci_dev_to_eeh_dev(pdev);
>>> + if (!edev || !edev->pe) {
>>> + mutex_unlock(&eeh_dev_mutex);
>>> + return -ENODEV;
>>> + }
>>> +
>>> + eeh_dev_set_passed(edev, true);
>>> + eeh_pe_set_passed(edev->pe, true);
>>> + mutex_unlock(&eeh_dev_mutex);
>>> +
>>> + return 0;
>>> +}
>>> +EXPORT_SYMBOL_GPL(eeh_dev_open);
>>> +
>>> +/**
>>> + * eeh_dev_release - Reclaim the ownership of EEH device
>>> + * @pdev: PCI device
>>> + *
>>> + * Reclaim ownership of EEH device, potentially the corresponding
>>> + * PE. In the result, the EEH errors detected on the PE will be
>>> + * reported and handled as usual.
>>> + */
>>> +void eeh_dev_release(struct pci_dev *pdev)
>>> +{
>>> + bool release_pe = true;
>>> + struct eeh_pe *pe = NULL;
>>> + struct eeh_dev *tmp, *edev;
>>> +
>>> + mutex_lock(&eeh_dev_mutex);
>>> +
>>> + /* No PCI device ? */
>>> + if (!pdev) {
>>> + mutex_unlock(&eeh_dev_mutex);
>>> + return;
>>> + }
>>> +
>>> + /* No EEH device ? */
>>> + edev = pci_dev_to_eeh_dev(pdev);
>>> + if (!edev || !eeh_dev_passed(edev) ||
>>> + !edev->pe || !eeh_pe_passed(pe)) {
>>> + mutex_unlock(&eeh_dev_mutex);
>>> + return;
>>> + }
>>> +
>>> + /* Release device */
>>> + pe = edev->pe;
>>> + eeh_dev_set_passed(edev, false);
>>> +
>>> + /* Release PE */
>>> + eeh_pe_for_each_dev(pe, edev, tmp) {
>>> + if (eeh_dev_passed(edev)) {
>>> + release_pe = false;
>>> + break;
>>> + }
>>> + }
>>> +
>>> + if (release_pe)
>>> + eeh_pe_set_passed(pe, false);
>>> +
>>> + mutex_unlock(&eeh_dev_mutex);
>>> +}
>>> +EXPORT_SYMBOL(eeh_dev_release);
>>> +
>>> +static int eeh_dev_check(struct pci_dev *pdev,
>>> + struct eeh_dev **pedev,
>>> + struct eeh_pe **ppe)
>>> +{
>>> + struct eeh_dev *edev;
>>> +
>>> + /* No device ? */
>>> + if (!pdev)
>>> + return -ENODEV;
>>> +
>>> + edev = pci_dev_to_eeh_dev(pdev);
>>> + if (!edev || !eeh_dev_passed(edev) ||
>>> + !edev->pe || !eeh_pe_passed(edev->pe))
>>> + return -ENODEV;
>>> +
>>> + if (pedev)
>>> + *pedev = edev;
>>> + if (ppe)
>>> + *ppe = edev->pe;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +/**
>>> + * eeh_pe_set_option - Set options for the indicated PE
>>> + * @pdev: PCI device
>>> + * @option: requested option
>>> + *
>>> + * The routine is called to enable or disable EEH functionality
>>> + * on the indicated PE, to enable IO or DMA for the frozen PE.
>>> + */
>>> +int eeh_pe_set_option(struct pci_dev *pdev, int option)
>>> +{
>>> + struct eeh_dev *edev;
>>> + struct eeh_pe *pe;
>>> + int ret = 0;
>>> +
>>> + /* Device existing ? */
>>> + ret = eeh_dev_check(pdev, &edev, &pe);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + switch (option) {
>>> + case EEH_OPT_DISABLE:
>>> + case EEH_OPT_ENABLE:
>>> + break;
>>> + case EEH_OPT_THAW_MMIO:
>>> + case EEH_OPT_THAW_DMA:
>>> + if (!eeh_ops || !eeh_ops->set_option) {
>>> + ret = -ENOENT;
>>> + break;
>>> + }
>>> +
>>> + ret = eeh_ops->set_option(pe, option);
>>> + break;
>>> + default:
>>> + pr_debug("%s: Option %d out of range (%d, %d)\n",
>>> + __func__, option, EEH_OPT_DISABLE, EEH_OPT_THAW_DMA);
>>> + ret = -EINVAL;
>>> + }
>>> +
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(eeh_pe_set_option);
>>> +
>>> +/**
>>> + * eeh_pe_get_addr - Retrieve the PE address or sharing mode
>>> + * @pdev: PCI device
>>> + * @option: option
>>> + *
>>> + * Retrieve the PE address or sharing mode.
>>> + */
>>> +int eeh_pe_get_addr(struct pci_dev *pdev, int option)
>>> +{
>>> + struct pci_bus *bus;
>>> + struct eeh_dev *edev;
>>> + struct eeh_pe *pe;
>>> + int ret = 0;
>>> +
>>> + /* Device existing ? */
>>> + ret = eeh_dev_check(pdev, &edev, &pe);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + switch (option) {
>>> + case EEH_OPT_GET_PE_ADDR:
>>> + bus = eeh_pe_bus_get(pe);
>>> + if (!bus) {
>>> + ret = -ENODEV;
>>> + break;
>>> + }
>>> +
>>> + /* PE address has format "00BBSS00" */
>>> + ret = bus->number << 16;
>>> + break;
>>> + case EEH_OPT_GET_PE_MODE:
>>> + /* Wa always have shared PE */
>>> + ret = EEH_PE_MODE_SHARED;
>>> + break;
>>> + default:
>>> + pr_debug("%s: option %d out of range (0, 1)\n",
>>> + __func__, option);
>>> + ret = -EINVAL;
>>> + }
>>> +
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(eeh_pe_get_addr);
>>> +
>>> +/**
>>> + * eeh_pe_get_state - Retrieve PE's state
>>> + * @pdev: PCI device
>>> + *
>>> + * Retrieve the PE's state, which includes 3 aspects: enabled
>>> + * DMA, enabled IO and asserted reset.
>>> + */
>>> +int eeh_pe_get_state(struct pci_dev *pdev)
>>> +{
>>> + struct eeh_dev *edev;
>>> + struct eeh_pe *pe;
>>> + int result, ret = 0;
>>> +
>>> + /* Device existing ? */
>>> + ret = eeh_dev_check(pdev, &edev, &pe);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + if (!eeh_ops || !eeh_ops->get_state)
>>> + return -ENOENT;
>>> +
>>> + result = eeh_ops->get_state(pe, NULL);
>>> + if (!(result & EEH_STATE_RESET_ACTIVE) &&
>>> + (result & EEH_STATE_DMA_ENABLED) &&
>>> + (result & EEH_STATE_MMIO_ENABLED))
>>> + ret = EEH_PE_STATE_NORMAL;
>>> + else if (result & EEH_STATE_RESET_ACTIVE)
>>> + ret = EEH_PE_STATE_RESET;
>>> + else if (!(result & EEH_STATE_RESET_ACTIVE) &&
>>> + !(result & EEH_STATE_DMA_ENABLED) &&
>>> + !(result & EEH_STATE_MMIO_ENABLED))
>>> + ret = EEH_PE_STATE_STOPPED_IO_DMA;
>>> + else if (!(result & EEH_STATE_RESET_ACTIVE) &&
>>> + (result & EEH_STATE_DMA_ENABLED) &&
>>> + !(result & EEH_STATE_MMIO_ENABLED))
>>> + ret = EEH_PE_STATE_STOPPED_DMA;
>>> + else
>>> + ret = EEH_PE_STATE_UNAVAIL;
>>> +
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(eeh_pe_get_state);
>>> +
>>> +/**
>>> + * eeh_pe_reset - Issue PE reset according to specified type
>>> + * @pdev: PCI device
>>> + * @option: reset type
>>> + *
>>> + * The routine is called to reset the specified PE with the
>>> + * indicated type, either fundamental reset or hot reset.
>>> + * PE reset is the most important part for error recovery.
>>> + */
>>> +int eeh_pe_reset(struct pci_dev *pdev, int option)
>>> +{
>>> + struct eeh_dev *edev;
>>> + struct eeh_pe *pe;
>>> + int ret = 0;
>>> +
>>> + /* Device existing ? */
>>> + ret = eeh_dev_check(pdev, &edev, &pe);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + if (!eeh_ops || !eeh_ops->set_option || !eeh_ops->reset)
>>> + return -ENOENT;
>>> +
>>> + switch (option) {
>>> + case EEH_RESET_DEACTIVATE:
>>> + ret = eeh_ops->reset(pe, option);
>>> + if (ret)
>>> + break;
>>> +
>>> + /*
>>> + * The PE is still in frozen state and we need clear
>>> + * that. It's good to clear frozen state after deassert
>>> + * to avoid messy IO access during reset, which might
>>> + * cause recursive frozen PE.
>>> + */
>>> + ret = eeh_ops->set_option(pe, EEH_OPT_THAW_MMIO);
>>> + if (!ret)
>>> + ret = eeh_ops->set_option(pe, EEH_OPT_THAW_DMA);
>>> + if (!ret)
>>> + eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
>>> + break;
>>> + case EEH_RESET_HOT:
>>> + case EEH_RESET_FUNDAMENTAL:
>>> + ret = eeh_ops->reset(pe, option);
>>> + break;
>>> + default:
>>> + pr_debug("%s: Unsupported option %d\n",
>>> + __func__, option);
>>> + ret = -EINVAL;
>>> + }
>>> +
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(eeh_pe_reset);
>>> +
>>> +/**
>>> + * eeh_pe_configure - Configure PCI bridges after PE reset
>>> + * @pdev: PCI device
>>> + *
>>> + * The routine is called to restore the PCI config space for
>>> + * those PCI devices, especially PCI bridges affected by PE
>>> + * reset issued previously.
>>> + */
>>> +int eeh_pe_configure(struct pci_dev *pdev)
>>> +{
>>> + struct eeh_dev *edev;
>>> + struct eeh_pe *pe;
>>> + int ret = 0;
>>> +
>>> + /* Device existing ? */
>>> + ret = eeh_dev_check(pdev, &edev, &pe);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + /* Restore config space for the affected devices */
>>> + eeh_pe_restore_bars(pe);
>>> +
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(eeh_pe_configure);
>>> +
>>> static int proc_eeh_show(struct seq_file *m, void *v)
>>> {
>>> if (!eeh_enabled()) {
>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>>> index 7ba0424..301ac18 100644
>>> --- a/drivers/vfio/pci/vfio_pci.c
>>> +++ b/drivers/vfio/pci/vfio_pci.c
>>> @@ -25,6 +25,9 @@
>>> #include <linux/types.h>
>>> #include <linux/uaccess.h>
>>> #include <linux/vfio.h>
>>> +#ifdef CONFIG_EEH
>>> +#include <asm/eeh.h>
>>> +#endif
>>
>> Can we make a vfio_pci_eeh file that properly stubs out anything called
>> from common code? I don't want to see all these inline ifdefs.
>
> well. do you want see drivers/vfio/vfio-pci/vfio_pci_eeh.c and export
> following functins for vfio_pci.c to use?
>
> vfio_pci_eeh_open()
> vfio_pci_eeh_release()
> vfio_pci_eeh_ioctl()
>
>
>>>
>>> #include "vfio_pci_private.h"
>>>
>>> @@ -152,32 +155,57 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
>>> pci_restore_state(pdev);
>>> }
>>>
>>> +static void vfio_eeh_pci_release(struct pci_dev *pdev)
>>> +{
>>> +#ifdef CONFIG_EEH
>>> + eeh_dev_release(pdev);
>>> +#endif
>>> +}
>>> +
>>> static void vfio_pci_release(void *device_data)
>>> {
>>> struct vfio_pci_device *vdev = device_data;
>>>
>>> - if (atomic_dec_and_test(&vdev->refcnt))
>>> + if (atomic_dec_and_test(&vdev->refcnt)) {
>>> + vfio_eeh_pci_release(vdev->pdev);
>>> vfio_pci_disable(vdev);
>>> + }
>>>
>>> module_put(THIS_MODULE);
>>> }
>>>
>>> +static int vfio_eeh_pci_open(struct pci_dev *pdev)
>>> +{
>>> + int ret = 0;
>>> +
>>> +#ifdef CONFIG_EEH
>>> + ret = eeh_dev_open(pdev);
>>> +#endif
>>> + return ret;
>>> +}
>>> +
>>> static int vfio_pci_open(void *device_data)
>>> {
>>> struct vfio_pci_device *vdev = device_data;
>>> + int ret;
>>>
>>> if (!try_module_get(THIS_MODULE))
>>> return -ENODEV;
>>>
>>> if (atomic_inc_return(&vdev->refcnt) == 1) {
>>> - int ret = vfio_pci_enable(vdev);
>>> - if (ret) {
>>> - module_put(THIS_MODULE);
>>> - return ret;
>>> - }
>>> + ret = vfio_pci_enable(vdev);
>>> + if (ret)
>>> + goto error;
>>> +
>>> + ret = vfio_eeh_pci_open(vdev->pdev);
>>> + if (ret)
>>> + goto error;
>>> }
>>>
>>> return 0;
>>> +error:
>>> + module_put(THIS_MODULE);
>>> + return ret;
>>> }
>>>
>>> static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
>>> @@ -321,6 +349,91 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
>>> return walk.ret;
>>> }
>>>
>>> +static int vfio_eeh_pci_ioctl(struct pci_dev *pdev,
>>> + unsigned int cmd,
>>> + unsigned long arg)
>>> +{
>>> + unsigned long minsz;
>>> + int ret = 0;
>>> +
>>> +#ifdef CONFIG_EEH
>>> + switch (cmd) {
>>> + case VFIO_EEH_PE_SET_OPTION: {
>>> + struct vfio_eeh_pe_set_option option;
>>> +
>>> + minsz = offsetofend(struct vfio_eeh_pe_set_option, option);
>>> + if (copy_from_user(&option, (void __user *)arg, minsz))
>>> + return -EFAULT;
>>> + if (option.argsz < minsz)
>>> + return -EINVAL;
>>> +
>>> + ret = eeh_pe_set_option(pdev, option.option);
>>> + break;
>>> + }
>>> + case VFIO_EEH_PE_GET_ADDR: {
>>> + struct vfio_eeh_pe_get_addr addr;
>>> +
>>> + minsz = offsetofend(struct vfio_eeh_pe_get_addr, info);
>>> + if (copy_from_user(&addr, (void __user *)arg, minsz))
>>> + return -EFAULT;
>>> + if (addr.argsz < minsz)
>>> + return -EINVAL;
>>> +
>>> + ret = eeh_pe_get_addr(pdev, addr.option);
>>> + if (ret >= 0) {
>>> + addr.info = ret;
>>> + if (copy_to_user((void __user *)arg, &addr, minsz))
>>> + return -EFAULT;
>>> + ret = 0;
>>> + }
>>> +
>>> + break;
>>> + }
>>> + case VFIO_EEH_PE_GET_STATE: {
>>> + struct vfio_eeh_pe_get_state state;
>>> +
>>> + minsz = offsetofend(struct vfio_eeh_pe_get_state, state);
>>> + if (copy_from_user(&state, (void __user *)arg, minsz))
>>> + return -EFAULT;
>>> + if (state.argsz < minsz)
>>> + return -EINVAL;
>>> +
>>> + ret = eeh_pe_get_state(pdev);
>>> + if (ret >= 0) {
>>> + state.state = ret;
>>> + if (copy_to_user((void __user *)arg, &state, minsz))
>>> + return -EFAULT;
>>> + ret = 0;
>>> + }
>>> + break;
>>> + }
>>> + case VFIO_EEH_PE_RESET: {
>>> + struct vfio_eeh_pe_reset reset;
>>> +
>>> + minsz = offsetofend(struct vfio_eeh_pe_reset, option);
>>> + if (copy_from_user(&reset, (void __user *)arg, minsz))
>>> + return -EFAULT;
>>> + if (reset.argsz < minsz)
>>> + return -EINVAL;
>>> +
>>> + ret = eeh_pe_reset(pdev, reset.option);
>>> + break;
>>> + }
>>> + case VFIO_EEH_PE_CONFIGURE:
>>> + ret = eeh_pe_configure(pdev);
>>> + break;
>>> + default:
>>> + ret = -EINVAL;
>>> + pr_debug("%s: Cannot handle command %d\n",
>>> + __func__, cmd);
>>> + }
>>> +#else
>>> + ret = -ENOENT;
>>> +#endif
>>> +
>>> + return ret;
>>> +}
>>> +
>>> static long vfio_pci_ioctl(void *device_data,
>>> unsigned int cmd, unsigned long arg)
>>> {
>>> @@ -682,6 +795,12 @@ hot_reset_release:
>>>
>>> kfree(groups);
>>> return ret;
>>> + } else if (cmd == VFIO_EEH_PE_SET_OPTION ||
>>> + cmd == VFIO_EEH_PE_GET_ADDR ||
>>> + cmd == VFIO_EEH_PE_GET_STATE ||
>>> + cmd == VFIO_EEH_PE_RESET ||
>>> + cmd == VFIO_EEH_PE_CONFIGURE) {
>>> + return vfio_eeh_pci_ioctl(vdev->pdev, cmd, arg);
>>> }
>>>
>>> return -ENOTTY;
>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>> index cb9023d..ef55682 100644
>>> --- a/include/uapi/linux/vfio.h
>>> +++ b/include/uapi/linux/vfio.h
>>> @@ -455,6 +455,59 @@ struct vfio_iommu_spapr_tce_info {
>>>
>>> #define VFIO_IOMMU_SPAPR_TCE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>>>
>>> +/*
>>> + * EEH functionality can be enabled or disabled on one specific device.
>>> + * Also, the DMA or IO frozen state can be removed from the frozen PE
>>> + * if required.
>>> + */
>>> +struct vfio_eeh_pe_set_option {
>>> + __u32 argsz;
>>> + __u32 option;
>>> +};
>>> +
>>> +#define VFIO_EEH_PE_SET_OPTION _IO(VFIO_TYPE, VFIO_BASE + 21)
>>
>> What proposed ioctls are making you jump to 21?
>>
>> argsz is probably not as useful without a flags field. Otherwise the
>> caller can't indicate what the extra space is.
>
> The QEMU patches are based on Alexey's additional feature ("ddw"), which
> consumed several ioctl commands.
>
> ok. So you also prefer to remove "argsz"?
>
>>> +
>>> +/*
>>> + * Each EEH PE should have unique address to be identified. The command
>>> + * helps to retrieve the address and the sharing mode of the PE.
>>> + */
>>> +struct vfio_eeh_pe_get_addr {
>>> + __u32 argsz;
>>> + __u32 option;
>>> + __u32 info;
>>> +};
>>> +
>>> +#define VFIO_EEH_PE_GET_ADDR _IO(VFIO_TYPE, VFIO_BASE + 22)
>>> +
>>> +/*
>>> + * EEH PE might have been frozen because of PCI errors. Also, it might
>>> + * be experiencing reset for error revoery. The following command helps
>>> + * to get the state.
>>> + */
>>> +struct vfio_eeh_pe_get_state {
>>> + __u32 argsz;
>>> + __u32 state;
>>> +};
>>> +
>>> +#define VFIO_EEH_PE_GET_STATE _IO(VFIO_TYPE, VFIO_BASE + 23)
>>> +
>>> +/*
>>> + * Reset is the major step to recover problematic PE. The following
>>> + * command helps on that.
>>> + */
>>> +struct vfio_eeh_pe_reset {
>>> + __u32 argsz;
>>> + __u32 option;
>>> +};
>>> +
>>> +#define VFIO_EEH_PE_RESET _IO(VFIO_TYPE, VFIO_BASE + 24)
>>> +
>>> +/*
>>> + * One of the steps for recovery after PE reset is to configure the
>>> + * PCI bridges affected by the PE reset.
>>> + */
>>> +#define VFIO_EEH_PE_CONFIGURE _IO(VFIO_TYPE, VFIO_BASE + 25)
>>
>> What can the user do differently by making these separate ioctls?
>
> hrm, I didn't understood as well. Alex.G could have the explaination.
Alex raised the same concern as me: why separate reset and configure? When we want to recover a device, we need a reset call anyway, right?
Alex
>
>>> +
>>> /* ***************************************************************** */
>>>
>>> #endif /* _UAPIVFIO_H */
>
> Thanks,
> Gavin
>
More information about the Linuxppc-dev
mailing list