[PATCH v6 2/3] drivers/vfio: EEH support for VFIO PCI device

Fri May 23 10:17:30 EST 2014

On Thu, May 22, 2014 at 11:55:29AM +0200, Alexander Graf wrote:
>
>On 22.05.14 10:23, Gavin Shan wrote:
>>The patch adds new IOCTL commands for VFIO PCI device to support
>>EEH functionality for PCI devices, which have been passed through
>>from host to somebody else via VFIO.
>>
>>Signed-off-by: Gavin Shan <gwshan at linux.vnet.ibm.com>
>
>This already looks a *lot* more sane than the previous versions.
>We're slowly getting there I think ;).
>
>Ben, could you please check through the exported EEH interface itself
>and verify whether it does all the lockings correctly, uses the API
>properly and doesn't allow for a user space program to blow up the
>system?
>
>
>>---
>>  Documentation/vfio.txt         |  88 ++++++++++-
>>  arch/powerpc/include/asm/eeh.h |  17 +++
>>  arch/powerpc/kernel/eeh.c      | 321 +++++++++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/vfio_pci.c    | 131 ++++++++++++++++-
>>  include/uapi/linux/vfio.h      |  53 +++++++
>>  5 files changed, 603 insertions(+), 7 deletions(-)
>>
>>diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>>index b9ca023..dd13db6 100644
>>--- a/Documentation/vfio.txt
>>+++ b/Documentation/vfio.txt
>>@@ -305,7 +305,10 @@ faster, the map/unmap handling has been implemented in real mode which provides
>>  an excellent performance which has limitations such as inability to do
>>  locked pages accounting in real time.
>>-So 3 additional ioctls have been added:
>>+4) PPC64 guests detect PCI errors and recover from them via EEH RTAS services,
>>+which works on the basis of additional ioctl commands.
>>+
>>+So 8 additional ioctls have been added:
>>  	VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
>>  		of the DMA window on the PCI bus.
>>@@ -316,6 +319,20 @@ So 3 additional ioctls have been added:
>>  	VFIO_IOMMU_DISABLE - disables the container.
>>+	VFIO_EEH_PE_SET_OPTION - enables or disables EEH functinality on the
>
>functionality
>

Typo. Will fix in next revision.

>>+		specified device. Also, it can be used to remove IO or DMA
>>+		stopped state on the frozen PE.
>>+
>>+	VFIO_EEH_PE_GET_ADDR - retrieve the unique address of the specified
>>+		PE or query PE sharing mode.
>
>What is PE?
>

will add description about that to Documentation/vfio.txt.

>>+
>>+	VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state.
>>+
>>+	VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for
>>+		error recovering.
>>+
>>+	VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's
>>+		one of the major steps for error recoverying.
>>  The code flow from the example above should be slightly changed:
>>@@ -346,6 +363,75 @@ The code flow from the example above should be slightly changed:
>>  	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
>>  	.....
>>+Based on the initial example we have, the following piece of code could be
>>+reference for EEH setup and error handling:
>>+
>>+	struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) };
>>+	struct vfio_eeh_pe_get_addr addr = { .argsz = sizeof(addr) };
>>+	struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) };
>>+	struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) };
>>+	struct vfio_eeh_pe_configure config = { .argsz = sizeof(config) };
>>+
>>+	....
>>+
>>+	/* Get a file descriptor for the device */
>>+	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
>>+
>>+	/* Enable the EEH functionality on the device */
>>+	option.option = EEH_OPT_ENABLE;
>>+	ioctl(device, VFIO_EEH_PE_SET_OPTION, &option);
>>+
>>+	/* Retrieve PE address and create and maintain PE by yourself */
>>+	addr.option = EEH_OPT_GET_PE_ADDR;
>>+	ioctl(device, VFIO_EEH_PE_GET_ADDR, &addr);
>>+
>>+	/* Assure EEH is supported on the PE and make PE functional */
>>+	ioctl(device, VFIO_EEH_PE_GET_STATE, &state);
>>+
>>+	/* Save device's state. pci_save_state() would be good enough
>>+	 * as an example.
>>+	 */
>>+
>>+	/* Test and setup the device */
>>+	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
>>+
>>+	....
>>+
>>+	/* When 0xFF's returned from reading PCI config space or IO BARs
>>+	 * of the PCI device. Check the PE state to see if that has been
>>+	 * frozen.
>>+	 */
>>+	ioctl(device, VFIO_EEH_PE_GET_STATE, &state);
>>+
>>+	/* Waiting for pending PCI transactions to be completed and don't
>>+	 * produce any more PCI traffic from/to the affected PE until
>>+	 * recovery is finished.
>>+	 */
>>+
>>+	/* Enable IO for the affected PE and collect logs. Usually, the
>>+	 * standard part of PCI config space, AER registers are dumped
>>+	 * as logs for further analysis.
>>+	 */
>>+	option.option = EEH_OPT_THAW_MMIO;
>>+	ioctl(device, VFIO_EEH_PE_SET_OPTION, &option);
>>+
>>+	/* Issue PE reset */
>>+	reset.option = EEH_RESET_HOT;
>>+	ioctl(device, VFIO_EEH_PE_RESET, &reset);
>>+
>>+	/* Configure the PCI bridges for the affected PE */
>>+	ioctl(device, VFIO_EEH_PE_CONFIGURE, NULL);
>>+
>>+	/* Restored state we saved at initialization time. pci_restore_state()
>>+	 * is good enough as an example.
>>+	 */
>>+
>>+	/* Hopefully, error is recovered successfully. Now, you can resume to
>>+	 * start PCI traffic to/from the affected PE.
>>+	 */
>>+
>>+	....
>>+
>>  -------------------------------------------------------------------------------
>>  [1] VFIO was originally an acronym for "Virtual Function I/O" in its
>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>index 34a2d83..dd5f1cf 100644
>>--- a/arch/powerpc/include/asm/eeh.h
>>+++ b/arch/powerpc/include/asm/eeh.h
>>@@ -191,6 +191,11 @@ enum {
>>  #define EEH_OPT_ENABLE		1	/* EEH enable	*/
>>  #define EEH_OPT_THAW_MMIO	2	/* MMIO enable	*/
>>  #define EEH_OPT_THAW_DMA	3	/* DMA enable	*/
>>+#define EEH_OPT_GET_PE_ADDR	0	/* Get PE addr	*/
>>+#define EEH_OPT_GET_PE_MODE	1	/* Get PE mode	*/
>>+#define EEH_PE_MODE_NONE	0	/* Invalid mode	*/
>>+#define EEH_PE_MODE_NOT_SHARED	1	/* Not shared	*/
>>+#define EEH_PE_MODE_SHARED	2	/* Shared mode	*/
>>  #define EEH_STATE_UNAVAILABLE	(1 << 0)	/* State unavailable	*/
>>  #define EEH_STATE_NOT_SUPPORT	(1 << 1)	/* EEH not supported	*/
>>  #define EEH_STATE_RESET_ACTIVE	(1 << 2)	/* Active reset		*/
>>@@ -198,6 +203,11 @@ enum {
>>  #define EEH_STATE_DMA_ACTIVE	(1 << 4)	/* Active DMA		*/
>>  #define EEH_STATE_MMIO_ENABLED	(1 << 5)	/* MMIO enabled		*/
>>  #define EEH_STATE_DMA_ENABLED	(1 << 6)	/* DMA enabled		*/
>>+#define EEH_PE_STATE_NORMAL		0	/* Normal state		*/
>>+#define EEH_PE_STATE_RESET		1		/* PE reset	*/
>>+#define EEH_PE_STATE_STOPPED_IO_DMA	2		/* Stopped	*/
>>+#define EEH_PE_STATE_STOPPED_DMA	4		/* Stopped DMA	*/
>>+#define EEH_PE_STATE_UNAVAIL		5		/* Unavailable	*/
>>  #define EEH_RESET_DEACTIVATE	0	/* Deactivate the PE reset	*/
>>  #define EEH_RESET_HOT		1	/* Hot reset			*/
>>  #define EEH_RESET_FUNDAMENTAL	3	/* Fundamental reset		*/
>>@@ -305,6 +315,13 @@ void eeh_add_device_late(struct pci_dev *);
>>  void eeh_add_device_tree_late(struct pci_bus *);
>>  void eeh_add_sysfs_files(struct pci_bus *);
>>  void eeh_remove_device(struct pci_dev *);
>>+int eeh_dev_open(struct pci_dev *pdev);
>>+void eeh_dev_release(struct pci_dev *pdev);
>>+int eeh_pe_set_option(struct pci_dev *pdev, int option);
>>+int eeh_pe_get_addr(struct pci_dev *pdev, int option);
>>+int eeh_pe_get_state(struct pci_dev *pdev);
>>+int eeh_pe_reset(struct pci_dev *pdev, int option);
>>+int eeh_pe_configure(struct pci_dev *pdev);
>>  /**
>>   * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
>>diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>>index 9c6b899..b90a474 100644
>>--- a/arch/powerpc/kernel/eeh.c
>>+++ b/arch/powerpc/kernel/eeh.c
>>@@ -108,6 +108,9 @@ struct eeh_ops *eeh_ops = NULL;
>>  /* Lock to avoid races due to multiple reports of an error */
>>  DEFINE_RAW_SPINLOCK(confirm_error_lock);
>>+/* Lock to protect passed flags */
>>+static DEFINE_MUTEX(eeh_dev_mutex);
>>+
>>  /* Buffer for reporting pci register dumps. Its here in BSS, and
>>   * not dynamically alloced, so that it ends up in RMO where RTAS
>>   * can access it.
>>@@ -1098,6 +1101,324 @@ void eeh_remove_device(struct pci_dev *dev)
>>  	edev->mode &= ~EEH_DEV_SYSFS;
>>  }
>>+/**
>>+ * eeh_dev_open - Mark EEH device and PE as passed through
>>+ * @pdev: PCI device
>>+ *
>>+ * Mark the indicated EEH device and PE as passed through.
>>+ * In the result, the EEH errors detected on the PE won't be
>>+ * reported. The owner of the device will be responsible for
>>+ * detection and recovery.
>>+ */
>>+int eeh_dev_open(struct pci_dev *pdev)
>>+{
>>+	struct eeh_dev *edev;
>>+
>>+	mutex_lock(&eeh_dev_mutex);
>>+
>>+	/* No PCI device ? */
>>+	if (!pdev) {
>>+		mutex_unlock(&eeh_dev_mutex);
>>+		return -ENODEV;
>>+	}
>>+
>>+	/* No EEH device ? */
>>+	edev = pci_dev_to_eeh_dev(pdev);
>>+	if (!edev || !edev->pe) {
>>+		mutex_unlock(&eeh_dev_mutex);
>>+		return -ENODEV;
>>+	}
>>+
>>+	eeh_dev_set_passed(edev, true);
>>+	eeh_pe_set_passed(edev->pe, true);
>>+	mutex_unlock(&eeh_dev_mutex);
>>+
>>+	return 0;
>>+}
>>+EXPORT_SYMBOL_GPL(eeh_dev_open);
>>+
>>+/**
>>+ * eeh_dev_release - Reclaim the ownership of EEH device
>>+ * @pdev: PCI device
>>+ *
>>+ * Reclaim ownership of EEH device, potentially the corresponding
>>+ * PE. In the result, the EEH errors detected on the PE will be
>>+ * reported and handled as usual.
>>+ */
>>+void eeh_dev_release(struct pci_dev *pdev)
>>+{
>>+	bool release_pe = true;
>>+	struct eeh_pe *pe = NULL;
>>+	struct eeh_dev *tmp, *edev;
>>+
>>+	mutex_lock(&eeh_dev_mutex);
>>+
>>+	/* No PCI device ? */
>>+	if (!pdev) {
>>+		mutex_unlock(&eeh_dev_mutex);
>>+		return;
>>+	}
>>+
>>+	/* No EEH device ? */
>>+	edev = pci_dev_to_eeh_dev(pdev);
>>+	if (!edev || !eeh_dev_passed(edev) ||
>>+	    !edev->pe || !eeh_pe_passed(pe)) {
>>+		mutex_unlock(&eeh_dev_mutex);
>>+		return;
>>+	}
>>+
>>+	/* Release device */
>>+	pe = edev->pe;
>>+	eeh_dev_set_passed(edev, false);
>>+
>>+	/* Release PE */
>>+	eeh_pe_for_each_dev(pe, edev, tmp) {
>>+		if (eeh_dev_passed(edev)) {
>>+			release_pe = false;
>>+			break;
>>+		}
>>+	}
>>+
>>+	if (release_pe)
>>+		eeh_pe_set_passed(pe, false);
>>+
>>+	mutex_unlock(&eeh_dev_mutex);
>>+}
>>+EXPORT_SYMBOL(eeh_dev_release);
>>+
>>+static int eeh_dev_check(struct pci_dev *pdev,
>>+			 struct eeh_dev **pedev,
>>+			 struct eeh_pe **ppe)
>>+{
>>+	struct eeh_dev *edev;
>>+
>>+	/* No device ? */
>>+	if (!pdev)
>>+		return -ENODEV;
>>+
>>+	edev = pci_dev_to_eeh_dev(pdev);
>>+	if (!edev || !eeh_dev_passed(edev) ||
>>+	    !edev->pe || !eeh_pe_passed(edev->pe))
>>+		return -ENODEV;
>>+
>>+	if (pedev)
>>+		*pedev = edev;
>>+	if (ppe)
>>+		*ppe = edev->pe;
>>+
>>+	return 0;
>>+}
>>+
>>+/**
>>+ * eeh_pe_set_option - Set options for the indicated PE
>>+ * @pdev: PCI device
>>+ * @option: requested option
>>+ *
>>+ * The routine is called to enable or disable EEH functionality
>>+ * on the indicated PE, to enable IO or DMA for the frozen PE.
>>+ */
>>+int eeh_pe_set_option(struct pci_dev *pdev, int option)
>>+{
>>+	struct eeh_dev *edev;
>>+	struct eeh_pe *pe;
>>+	int ret = 0;
>>+
>>+	/* Device existing ? */
>>+	ret = eeh_dev_check(pdev, &edev, &pe);
>>+	if (ret)
>>+		return ret;
>>+
>>+	switch (option) {
>>+	case EEH_OPT_DISABLE:
>>+	case EEH_OPT_ENABLE:
>
>This deserves a comment
>

Yep. will add it, thanks.

>>+		break;
>>+	case EEH_OPT_THAW_MMIO:
>>+	case EEH_OPT_THAW_DMA:
>>+		if (!eeh_ops || !eeh_ops->set_option) {
>>+			ret = -ENOENT;
>>+			break;
>>+		}
>>+
>>+		ret = eeh_ops->set_option(pe, option);
>>+		break;
>>+	default:
>>+		pr_debug("%s: Option %d out of range (%d, %d)\n",
>>+			__func__, option, EEH_OPT_DISABLE, EEH_OPT_THAW_DMA);
>>+		ret = -EINVAL;
>>+	}
>>+
>>+	return ret;
>>+}
>>+EXPORT_SYMBOL_GPL(eeh_pe_set_option);
>>+
>>+/**
>>+ * eeh_pe_get_addr - Retrieve the PE address or sharing mode
>>+ * @pdev: PCI device
>>+ * @option: option
>>+ *
>>+ * Retrieve the PE address or sharing mode.
>>+ */
>>+int eeh_pe_get_addr(struct pci_dev *pdev, int option)
>>+{
>>+	struct pci_bus *bus;
>>+	struct eeh_dev *edev;
>>+	struct eeh_pe *pe;
>>+	int ret = 0;
>>+
>>+	/* Device existing ? */
>>+	ret = eeh_dev_check(pdev, &edev, &pe);
>>+	if (ret)
>
>Probably safer to write this as ret < 0. Positive return values are
>success now.
>

Ok. Will fix it.

>>+		return ret;
>>+
>>+	switch (option) {
>>+	case EEH_OPT_GET_PE_ADDR:
>>+		bus = eeh_pe_bus_get(pe);
>>+		if (!bus) {
>>+			ret = -ENODEV;
>>+			break;
>>+		}
>>+
>>+		/* PE address has format "00BBSS00" */
>>+		ret = bus->number << 16;
>>+		break;
>>+	case EEH_OPT_GET_PE_MODE:
>>+		/* Wa always have shared PE */
>
>Wa?
>

Basically, PE could have one single PCI device (function), or a PCI bus
(including subordinate PCI devices). we always have the later case currently.
And it's called PE in "shared mode" and I called it as "shared PE" :-)

>>+		ret = EEH_PE_MODE_SHARED;
>>+		break;
>>+	default:
>>+		pr_debug("%s: option %d out of range (0, 1)\n",
>>+			__func__, option);
>>+		ret = -EINVAL;
>>+	}
>>+
>>+	return ret;
>>+}
>>+EXPORT_SYMBOL_GPL(eeh_pe_get_addr);
>>+
>>+/**
>>+ * eeh_pe_get_state - Retrieve PE's state
>>+ * @pdev: PCI device
>>+ *
>>+ * Retrieve the PE's state, which includes 3 aspects: enabled
>>+ * DMA, enabled IO and asserted reset.
>>+ */
>>+int eeh_pe_get_state(struct pci_dev *pdev)
>>+{
>>+	struct eeh_dev *edev;
>>+	struct eeh_pe *pe;
>>+	int result, ret = 0;
>>+
>>+	/* Device existing ? */
>>+	ret = eeh_dev_check(pdev, &edev, &pe);
>>+	if (ret)
>>+		return ret;
>>+
>>+	if (!eeh_ops || !eeh_ops->get_state)
>>+		return -ENOENT;
>>+
>>+	result = eeh_ops->get_state(pe, NULL);
>>+	if (!(result & EEH_STATE_RESET_ACTIVE) &&
>>+	     (result & EEH_STATE_DMA_ENABLED) &&
>>+	     (result & EEH_STATE_MMIO_ENABLED))
>>+		ret = EEH_PE_STATE_NORMAL;
>>+	else if (result & EEH_STATE_RESET_ACTIVE)
>>+		ret = EEH_PE_STATE_RESET;
>>+	else if (!(result & EEH_STATE_RESET_ACTIVE) &&
>>+		 !(result & EEH_STATE_DMA_ENABLED) &&
>>+		 !(result & EEH_STATE_MMIO_ENABLED))
>>+		ret = EEH_PE_STATE_STOPPED_IO_DMA;
>>+	else if (!(result & EEH_STATE_RESET_ACTIVE) &&
>>+		 (result & EEH_STATE_DMA_ENABLED) &&
>>+		 !(result & EEH_STATE_MMIO_ENABLED))
>>+		ret = EEH_PE_STATE_STOPPED_DMA;
>>+	else
>>+		ret = EEH_PE_STATE_UNAVAIL;
>>+
>>+	return ret;
>>+}
>>+EXPORT_SYMBOL_GPL(eeh_pe_get_state);
>>+
>>+/**
>>+ * eeh_pe_reset - Issue PE reset according to specified type
>>+ * @pdev: PCI device
>>+ * @option: reset type
>>+ *
>>+ * The routine is called to reset the specified PE with the
>>+ * indicated type, either fundamental reset or hot reset.
>>+ * PE reset is the most important part for error recovery.
>>+ */
>>+int eeh_pe_reset(struct pci_dev *pdev, int option)
>>+{
>>+	struct eeh_dev *edev;
>>+	struct eeh_pe *pe;
>>+	int ret = 0;
>>+
>>+	/* Device existing ? */
>>+	ret = eeh_dev_check(pdev, &edev, &pe);
>>+	if (ret)
>>+		return ret;
>>+
>>+	if (!eeh_ops || !eeh_ops->set_option || !eeh_ops->reset)
>>+		return -ENOENT;
>>+
>>+	switch (option) {
>>+	case EEH_RESET_DEACTIVATE:
>>+		ret = eeh_ops->reset(pe, option);
>>+		if (ret)
>>+			break;
>>+
>>+		/*
>>+		 * The PE is still in frozen state and we need clear
>
>to
>

Will fix, thanks.

>>+		 * that. It's good to clear frozen state after deassert
>>+		 * to avoid messy IO access during reset, which might
>>+		 * cause recursive frozen PE.
>>+		 */
>>+		ret = eeh_ops->set_option(pe, EEH_OPT_THAW_MMIO);
>>+		if (!ret)
>>+			ret = eeh_ops->set_option(pe, EEH_OPT_THAW_DMA);
>>+		if (!ret)
>>+			eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
>>+		break;
>>+	case EEH_RESET_HOT:
>>+	case EEH_RESET_FUNDAMENTAL:
>>+		ret = eeh_ops->reset(pe, option);
>>+		break;
>>+	default:
>>+		pr_debug("%s: Unsupported option %d\n",
>>+			__func__, option);
>>+		ret = -EINVAL;
>>+	}
>>+
>>+	return ret;
>>+}
>>+EXPORT_SYMBOL_GPL(eeh_pe_reset);
>>+
>>+/**
>>+ * eeh_pe_configure - Configure PCI bridges after PE reset
>>+ * @pdev: PCI device
>>+ *
>>+ * The routine is called to restore the PCI config space for
>>+ * those PCI devices, especially PCI bridges affected by PE
>>+ * reset issued previously.
>
>So would it make sense to combine this with the reset call?
>

I hope to keep it as it's one of the major steps to do error
recovery. Lets keep it.

>>+ */
>>+int eeh_pe_configure(struct pci_dev *pdev)
>>+{
>>+	struct eeh_dev *edev;
>>+	struct eeh_pe *pe;
>>+	int ret = 0;
>>+
>>+	/* Device existing ? */
>>+	ret = eeh_dev_check(pdev, &edev, &pe);
>>+	if (ret)
>>+		return ret;
>>+
>>+	/* Restore config space for the affected devices */
>>+	eeh_pe_restore_bars(pe);
>>+
>>+	return ret;
>>+}
>>+EXPORT_SYMBOL_GPL(eeh_pe_configure);
>>+
>>  static int proc_eeh_show(struct seq_file *m, void *v)
>>  {
>>  	if (!eeh_enabled()) {
>>diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>>index 7ba0424..301ac18 100644
>>--- a/drivers/vfio/pci/vfio_pci.c
>>+++ b/drivers/vfio/pci/vfio_pci.c
>>@@ -25,6 +25,9 @@
>>  #include <linux/types.h>
>>  #include <linux/uaccess.h>
>>  #include <linux/vfio.h>
>>+#ifdef CONFIG_EEH
>>+#include <asm/eeh.h>
>>+#endif
>>  #include "vfio_pci_private.h"
>>@@ -152,32 +155,57 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
>>  	pci_restore_state(pdev);
>>  }
>>+static void vfio_eeh_pci_release(struct pci_dev *pdev)
>>+{
>>+#ifdef CONFIG_EEH
>>+	eeh_dev_release(pdev);
>>+#endif
>>+}
>>+
>>  static void vfio_pci_release(void *device_data)
>>  {
>>  	struct vfio_pci_device *vdev = device_data;
>>-	if (atomic_dec_and_test(&vdev->refcnt))
>>+	if (atomic_dec_and_test(&vdev->refcnt)) {
>>+		vfio_eeh_pci_release(vdev->pdev);
>>  		vfio_pci_disable(vdev);
>>+	}
>>  	module_put(THIS_MODULE);
>>  }
>>+static int vfio_eeh_pci_open(struct pci_dev *pdev)
>>+{
>>+	int ret = 0;
>>+
>>+#ifdef CONFIG_EEH
>>+	ret = eeh_dev_open(pdev);
>>+#endif
>>+	return ret;
>>+}
>>+
>>  static int vfio_pci_open(void *device_data)
>>  {
>>  	struct vfio_pci_device *vdev = device_data;
>>+	int ret;
>>  	if (!try_module_get(THIS_MODULE))
>>  		return -ENODEV;
>>  	if (atomic_inc_return(&vdev->refcnt) == 1) {
>>-		int ret = vfio_pci_enable(vdev);
>>-		if (ret) {
>>-			module_put(THIS_MODULE);
>>-			return ret;
>>-		}
>>+		ret = vfio_pci_enable(vdev);
>>+		if (ret)
>>+			goto error;
>>+
>>+		ret = vfio_eeh_pci_open(vdev->pdev);
>>+		if (ret)
>>+			goto error;
>>  	}
>>  	return 0;
>>+error:
>>+	module_put(THIS_MODULE);
>>+	return ret;
>>  }
>>  static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
>>@@ -321,6 +349,91 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
>>  	return walk.ret;
>>  }
>>+static int vfio_eeh_pci_ioctl(struct pci_dev *pdev,
>>+			      unsigned int cmd,
>>+			      unsigned long arg)
>>+{
>>+	unsigned long minsz;
>>+	int ret = 0;
>>+
>>+#ifdef CONFIG_EEH
>>+	switch (cmd) {
>>+	case VFIO_EEH_PE_SET_OPTION: {
>>+		struct vfio_eeh_pe_set_option option;
>>+
>>+		minsz = offsetofend(struct vfio_eeh_pe_set_option, option);
>>+		if (copy_from_user(&option, (void __user *)arg, minsz))
>>+			return -EFAULT;
>>+		if (option.argsz < minsz)
>>+			return -EINVAL;
>>+
>>+		ret = eeh_pe_set_option(pdev, option.option);
>>+		break;
>>+	}
>>+	case VFIO_EEH_PE_GET_ADDR: {
>>+		struct vfio_eeh_pe_get_addr addr;
>>+
>>+		minsz = offsetofend(struct vfio_eeh_pe_get_addr, info);
>>+		if (copy_from_user(&addr, (void __user *)arg, minsz))
>>+			return -EFAULT;
>>+		if (addr.argsz < minsz)
>>+			return -EINVAL;
>>+
>>+		ret = eeh_pe_get_addr(pdev, addr.option);
>>+		if (ret >= 0) {
>>+			addr.info = ret;
>>+			if (copy_to_user((void __user *)arg, &addr, minsz))
>>+				return -EFAULT;
>>+			ret = 0;
>>+		}
>>+
>>+		break;
>>+	}
>>+	case VFIO_EEH_PE_GET_STATE: {
>>+		struct vfio_eeh_pe_get_state state;
>>+
>>+		minsz = offsetofend(struct vfio_eeh_pe_get_state, state);
>>+		if (copy_from_user(&state, (void __user *)arg, minsz))
>>+			return -EFAULT;
>>+		if (state.argsz < minsz)
>>+			return -EINVAL;
>>+
>>+		ret = eeh_pe_get_state(pdev);
>>+		if (ret >= 0) {
>>+			state.state = ret;
>>+			if (copy_to_user((void __user *)arg, &state, minsz))
>>+				return -EFAULT;
>>+			ret = 0;
>>+		}
>>+		break;
>>+	}
>>+	case VFIO_EEH_PE_RESET: {
>>+		struct vfio_eeh_pe_reset reset;
>>+
>>+		minsz = offsetofend(struct vfio_eeh_pe_reset, option);
>>+		if (copy_from_user(&reset, (void __user *)arg, minsz))
>>+			return -EFAULT;
>>+		if (reset.argsz < minsz)
>>+			return -EINVAL;
>>+
>>+		ret = eeh_pe_reset(pdev, reset.option);
>>+		break;
>>+	}
>>+	case VFIO_EEH_PE_CONFIGURE:
>>+		ret = eeh_pe_configure(pdev);
>>+		break;
>>+	default:
>>+		ret = -EINVAL;
>>+		pr_debug("%s: Cannot handle command %d\n",
>>+			__func__, cmd);
>>+	}
>>+#else
>>+	ret = -ENOENT;
>>+#endif
>>+
>>+	return ret;
>>+}
>>+
>>  static long vfio_pci_ioctl(void *device_data,
>>  			   unsigned int cmd, unsigned long arg)
>>  {
>>@@ -682,6 +795,12 @@ hot_reset_release:
>>  		kfree(groups);
>>  		return ret;
>>+	} else if (cmd == VFIO_EEH_PE_SET_OPTION ||
>>+		   cmd == VFIO_EEH_PE_GET_ADDR ||
>>+		   cmd == VFIO_EEH_PE_GET_STATE ||
>>+		   cmd == VFIO_EEH_PE_RESET ||
>>+		   cmd == VFIO_EEH_PE_CONFIGURE) {
>>+			return vfio_eeh_pci_ioctl(vdev->pdev, cmd, arg);
>>  	}
>>  	return -ENOTTY;
>>diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>index cb9023d..ef55682 100644
>>--- a/include/uapi/linux/vfio.h
>>+++ b/include/uapi/linux/vfio.h
>>@@ -455,6 +455,59 @@ struct vfio_iommu_spapr_tce_info {
>>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>>+/*
>>+ * EEH functionality can be enabled or disabled on one specific device.
>>+ * Also, the DMA or IO frozen state can be removed from the frozen PE
>>+ * if required.
>>+ */
>>+struct vfio_eeh_pe_set_option {
>>+	__u32 argsz;
>
>What is this argsz thing? Is this your way of maintaining backwards
>compatibility when we introduce new fields? A new field will change
>the ioctl number, so I don't think that makes a lot of sense :).
>
>Just make the ioctl have a u32 as incoming argument. No fancy
>structs, no complicated code.
>
>The same applies for a number of structs below.
>

ok. Will do in next revision.

>>+	__u32 option;
>>+};
>>+
>>+#define VFIO_EEH_PE_SET_OPTION		_IO(VFIO_TYPE, VFIO_BASE + 21)
>>+
>>+/*
>>+ * Each EEH PE should have unique address to be identified. The command
>>+ * helps to retrieve the address and the sharing mode of the PE.
>>+ */
>>+struct vfio_eeh_pe_get_addr {
>>+	__u32 argsz;
>>+	__u32 option;
>>+	__u32 info;
>
>Any particular reason you need the info field? Can't the return value
>of the ioctl hold this? Then you only have a single u32 argument left
>to the ioctl again.
>

ok. Will do in next revision.

>>+};
>>+
>>+#define VFIO_EEH_PE_GET_ADDR		_IO(VFIO_TYPE, VFIO_BASE + 22)
>>+
>>+/*
>>+ * EEH PE might have been frozen because of PCI errors. Also, it might
>>+ * be experiencing reset for error revoery. The following command helps
>>+ * to get the state.
>>+ */
>>+struct vfio_eeh_pe_get_state {
>>+	__u32 argsz;
>>+	__u32 state;
>>+};
>>+
>>+#define VFIO_EEH_PE_GET_STATE		_IO(VFIO_TYPE, VFIO_BASE + 23)
>>+
>>+/*
>>+ * Reset is the major step to recover problematic PE. The following
>>+ * command helps on that.
>>+ */
>>+struct vfio_eeh_pe_reset {
>>+	__u32 argsz;
>>+	__u32 option;
>>+};
>>+
>>+#define VFIO_EEH_PE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 24)
>>+
>>+/*
>>+ * One of the steps for recovery after PE reset is to configure the
>>+ * PCI bridges affected by the PE reset.
>>+ */
>>+#define VFIO_EEH_PE_CONFIGURE		_IO(VFIO_TYPE, VFIO_BASE + 25)
>>+
>>  /* ***************************************************************** */
>>  #endif /* _UAPIVFIO_H */

Thanks,
Gavin