[PATCH v6 2/3] drivers/vfio: EEH support for VFIO PCI device

Alex Williamson alex.williamson at redhat.com
Sat May 24 00:29:59 EST 2014


On Fri, 2014-05-23 at 14:37 +1000, Gavin Shan wrote:
> On Thu, May 22, 2014 at 09:10:53PM -0600, Alex Williamson wrote:
> >On Thu, 2014-05-22 at 18:23 +1000, Gavin Shan wrote:
> >> The patch adds new IOCTL commands for VFIO PCI device to support
> >> EEH functionality for PCI devices, which have been passed through
> >> from host to somebody else via VFIO.
> >> 
> >> Signed-off-by: Gavin Shan <gwshan at linux.vnet.ibm.com>
> >> ---
> >>  Documentation/vfio.txt         |  88 ++++++++++-
> >>  arch/powerpc/include/asm/eeh.h |  17 +++
> >>  arch/powerpc/kernel/eeh.c      | 321 +++++++++++++++++++++++++++++++++++++++++
> >>  drivers/vfio/pci/vfio_pci.c    | 131 ++++++++++++++++-
> >>  include/uapi/linux/vfio.h      |  53 +++++++
> >>  5 files changed, 603 insertions(+), 7 deletions(-)
> >
> >Maybe a chicken and egg problem, but it seems like we could split the
> >platform code and vfio code into separate patches.
> >
> 
> Ok. I'll keep egg/chicken separated in next revision.
> 
> >> 
> >> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> >> index b9ca023..dd13db6 100644
> >> --- a/Documentation/vfio.txt
> >> +++ b/Documentation/vfio.txt
> >> @@ -305,7 +305,10 @@ faster, the map/unmap handling has been implemented in real mode which provides
> >>  an excellent performance which has limitations such as inability to do
> >>  locked pages accounting in real time.
> >>  
> >> -So 3 additional ioctls have been added:
> >> +4) PPC64 guests detect PCI errors and recover from them via EEH RTAS services,
> >> +which works on the basis of additional ioctl commands.
> >> +
> >> +So 8 additional ioctls have been added:
> >>  
> >>  	VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
> >>  		of the DMA window on the PCI bus.
> >> @@ -316,6 +319,20 @@ So 3 additional ioctls have been added:
> >>  
> >>  	VFIO_IOMMU_DISABLE - disables the container.
> >>  
> >> +	VFIO_EEH_PE_SET_OPTION - enables or disables EEH functinality on the
> >> +		specified device. Also, it can be used to remove IO or DMA
> >> +		stopped state on the frozen PE.
> >> +
> >> +	VFIO_EEH_PE_GET_ADDR - retrieve the unique address of the specified
> >> +		PE or query PE sharing mode.
> >> +
> >> +	VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state.
> >> +
> >> +	VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for
> >> +		error recovering.
> >> +
> >> +	VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's
> >> +		one of the major steps for error recoverying.
> >>  
> >>  The code flow from the example above should be slightly changed:
> >>  
> >> @@ -346,6 +363,75 @@ The code flow from the example above should be slightly changed:
> >>  	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
> >>  	.....
> >>  
> >> +Based on the initial example we have, the following piece of code could be
> >> +reference for EEH setup and error handling:
> >> +
> >> +	struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) };
> >> +	struct vfio_eeh_pe_get_addr addr = { .argsz = sizeof(addr) };
> >> +	struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) };
> >> +	struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) };
> >> +	struct vfio_eeh_pe_configure config = { .argsz = sizeof(config) };
> >> +
> >> +	....
> >> +
> >> +	/* Get a file descriptor for the device */
> >> +	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
> >> +
> >> +	/* Enable the EEH functionality on the device */
> >> +	option.option = EEH_OPT_ENABLE;
> >> +	ioctl(device, VFIO_EEH_PE_SET_OPTION, &option);
> >> +
> >> +	/* Retrieve PE address and create and maintain PE by yourself */
> >> +	addr.option = EEH_OPT_GET_PE_ADDR;
> >> +	ioctl(device, VFIO_EEH_PE_GET_ADDR, &addr);
> >> +
> >> +	/* Assure EEH is supported on the PE and make PE functional */
> >> +	ioctl(device, VFIO_EEH_PE_GET_STATE, &state);
> >> +
> >> +	/* Save device's state. pci_save_state() would be good enough
> >> +	 * as an example.
> >> +	 */
> >> +
> >> +	/* Test and setup the device */
> >> +	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
> >> +
> >> +	....
> >> +
> >> +	/* When 0xFF's returned from reading PCI config space or IO BARs
> >> +	 * of the PCI device. Check the PE state to see if that has been
> >> +	 * frozen.
> >> +	 */
> >> +	ioctl(device, VFIO_EEH_PE_GET_STATE, &state);
> >
> >There's no notification, the user needs to observe the return value an
> >poll?  Should we be enabling an eventfd to notify the user of the state
> >change?

Ok

> Yes. The user needs to monitor the return value. we should have one notification,
> but it's for later as we discussed :-)
> 
> >> +
> >> +	/* Waiting for pending PCI transactions to be completed and don't
> >> +	 * produce any more PCI traffic from/to the affected PE until
> >> +	 * recovery is finished.
> >> +	 */
> >> +
> >> +	/* Enable IO for the affected PE and collect logs. Usually, the
> >> +	 * standard part of PCI config space, AER registers are dumped
> >> +	 * as logs for further analysis.
> >> +	 */
> >> +	option.option = EEH_OPT_THAW_MMIO;
> >> +	ioctl(device, VFIO_EEH_PE_SET_OPTION, &option);
> >
> >How does the guest learn about the error?  Does it need to?
> 
> When guest detects 0xFF's from reading PCI config space or IO, it's going
> check the device (PE) state. If the device (PE) has been put into frozen
> state, the recovery will be started.

No, sorry, I mean how does the user get information about the error?
The interface we have here is:
a) find that something bad has happened
b) kick it into working again
c) continue

How does the user figure out what happened and if it makes sense to
attempt to recover?  Where does the user learn that their disk is on
fire?

> >> +
> >> +	/* Issue PE reset */
> >> +	reset.option = EEH_RESET_HOT;
> >> +	ioctl(device, VFIO_EEH_PE_RESET, &reset);
> >> +
> >> +	/* Configure the PCI bridges for the affected PE */
> >> +	ioctl(device, VFIO_EEH_PE_CONFIGURE, NULL);
> >> +
> >
> >I'm not sure I see why we've split these into separate ioctls.  FWIW,
> >the one ioctl we currently have for reset that takes no options is
> >probably going to be the first to get deprecated because of it.
> >
> >> +	/* Restored state we saved at initialization time. pci_restore_state()
> >> +	 * is good enough as an example.
> >> +	 */
> >> +
> >> +	/* Hopefully, error is recovered successfully. Now, you can resume to
> >> +	 * start PCI traffic to/from the affected PE.
> >> +	 */
> >> +
> >> +	....
> >> +
> >>  -------------------------------------------------------------------------------
> >>  
> >>  [1] VFIO was originally an acronym for "Virtual Function I/O" in its
> >> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> >> index 34a2d83..dd5f1cf 100644
> >> --- a/arch/powerpc/include/asm/eeh.h
> >> +++ b/arch/powerpc/include/asm/eeh.h
> >> @@ -191,6 +191,11 @@ enum {
> >>  #define EEH_OPT_ENABLE		1	/* EEH enable	*/
> >>  #define EEH_OPT_THAW_MMIO	2	/* MMIO enable	*/
> >>  #define EEH_OPT_THAW_DMA	3	/* DMA enable	*/
> >> +#define EEH_OPT_GET_PE_ADDR	0	/* Get PE addr	*/
> >> +#define EEH_OPT_GET_PE_MODE	1	/* Get PE mode	*/
> >> +#define EEH_PE_MODE_NONE	0	/* Invalid mode	*/
> >> +#define EEH_PE_MODE_NOT_SHARED	1	/* Not shared	*/
> >> +#define EEH_PE_MODE_SHARED	2	/* Shared mode	*/
> >>  #define EEH_STATE_UNAVAILABLE	(1 << 0)	/* State unavailable	*/
> >>  #define EEH_STATE_NOT_SUPPORT	(1 << 1)	/* EEH not supported	*/
> >>  #define EEH_STATE_RESET_ACTIVE	(1 << 2)	/* Active reset		*/
> >> @@ -198,6 +203,11 @@ enum {
> >>  #define EEH_STATE_DMA_ACTIVE	(1 << 4)	/* Active DMA		*/
> >>  #define EEH_STATE_MMIO_ENABLED	(1 << 5)	/* MMIO enabled		*/
> >>  #define EEH_STATE_DMA_ENABLED	(1 << 6)	/* DMA enabled		*/
> >> +#define EEH_PE_STATE_NORMAL		0	/* Normal state		*/
> >> +#define EEH_PE_STATE_RESET		1		/* PE reset	*/
> >> +#define EEH_PE_STATE_STOPPED_IO_DMA	2		/* Stopped	*/
> >> +#define EEH_PE_STATE_STOPPED_DMA	4		/* Stopped DMA	*/
> >> +#define EEH_PE_STATE_UNAVAIL		5		/* Unavailable	*/
> >>  #define EEH_RESET_DEACTIVATE	0	/* Deactivate the PE reset	*/
> >>  #define EEH_RESET_HOT		1	/* Hot reset			*/
> >>  #define EEH_RESET_FUNDAMENTAL	3	/* Fundamental reset		*/
> >> @@ -305,6 +315,13 @@ void eeh_add_device_late(struct pci_dev *);
> >>  void eeh_add_device_tree_late(struct pci_bus *);
> >>  void eeh_add_sysfs_files(struct pci_bus *);
> >>  void eeh_remove_device(struct pci_dev *);
> >> +int eeh_dev_open(struct pci_dev *pdev);
> >> +void eeh_dev_release(struct pci_dev *pdev);
> >> +int eeh_pe_set_option(struct pci_dev *pdev, int option);
> >> +int eeh_pe_get_addr(struct pci_dev *pdev, int option);
> >> +int eeh_pe_get_state(struct pci_dev *pdev);
> >> +int eeh_pe_reset(struct pci_dev *pdev, int option);
> >> +int eeh_pe_configure(struct pci_dev *pdev);
> >>  
> >>  /**
> >>   * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
> >> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> >> index 9c6b899..b90a474 100644
> >> --- a/arch/powerpc/kernel/eeh.c
> >> +++ b/arch/powerpc/kernel/eeh.c
> >> @@ -108,6 +108,9 @@ struct eeh_ops *eeh_ops = NULL;
> >>  /* Lock to avoid races due to multiple reports of an error */
> >>  DEFINE_RAW_SPINLOCK(confirm_error_lock);
> >>  
> >> +/* Lock to protect passed flags */
> >> +static DEFINE_MUTEX(eeh_dev_mutex);
> >> +
> >>  /* Buffer for reporting pci register dumps. Its here in BSS, and
> >>   * not dynamically alloced, so that it ends up in RMO where RTAS
> >>   * can access it.
> >> @@ -1098,6 +1101,324 @@ void eeh_remove_device(struct pci_dev *dev)
> >>  	edev->mode &= ~EEH_DEV_SYSFS;
> >>  }
> >>  
> >> +/**
> >> + * eeh_dev_open - Mark EEH device and PE as passed through
> >> + * @pdev: PCI device
> >> + *
> >> + * Mark the indicated EEH device and PE as passed through.
> >> + * In the result, the EEH errors detected on the PE won't be
> >> + * reported. The owner of the device will be responsible for
> >> + * detection and recovery.
> >> + */
> >> +int eeh_dev_open(struct pci_dev *pdev)
> >> +{
> >> +	struct eeh_dev *edev;
> >> +
> >> +	mutex_lock(&eeh_dev_mutex);
> >> +
> >> +	/* No PCI device ? */
> >> +	if (!pdev) {
> >> +		mutex_unlock(&eeh_dev_mutex);
> >> +		return -ENODEV;
> >> +	}
> >> +
> >> +	/* No EEH device ? */
> >> +	edev = pci_dev_to_eeh_dev(pdev);
> >> +	if (!edev || !edev->pe) {
> >> +		mutex_unlock(&eeh_dev_mutex);
> >> +		return -ENODEV;
> >> +	}
> >> +
> >> +	eeh_dev_set_passed(edev, true);
> >> +	eeh_pe_set_passed(edev->pe, true);
> >> +	mutex_unlock(&eeh_dev_mutex);
> >> +
> >> +	return 0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(eeh_dev_open);
> >> +
> >> +/**
> >> + * eeh_dev_release - Reclaim the ownership of EEH device
> >> + * @pdev: PCI device
> >> + *
> >> + * Reclaim ownership of EEH device, potentially the corresponding
> >> + * PE. In the result, the EEH errors detected on the PE will be
> >> + * reported and handled as usual.
> >> + */
> >> +void eeh_dev_release(struct pci_dev *pdev)
> >> +{
> >> +	bool release_pe = true;
> >> +	struct eeh_pe *pe = NULL;
> >> +	struct eeh_dev *tmp, *edev;
> >> +
> >> +	mutex_lock(&eeh_dev_mutex);
> >> +
> >> +	/* No PCI device ? */
> >> +	if (!pdev) {
> >> +		mutex_unlock(&eeh_dev_mutex);
> >> +		return;
> >> +	}
> >> +
> >> +	/* No EEH device ? */
> >> +	edev = pci_dev_to_eeh_dev(pdev);
> >> +	if (!edev || !eeh_dev_passed(edev) ||
> >> +	    !edev->pe || !eeh_pe_passed(pe)) {
> >> +		mutex_unlock(&eeh_dev_mutex);
> >> +		return;
> >> +	}
> >> +
> >> +	/* Release device */
> >> +	pe = edev->pe;
> >> +	eeh_dev_set_passed(edev, false);
> >> +
> >> +	/* Release PE */
> >> +	eeh_pe_for_each_dev(pe, edev, tmp) {
> >> +		if (eeh_dev_passed(edev)) {
> >> +			release_pe = false;
> >> +			break;
> >> +		}
> >> +	}
> >> +
> >> +	if (release_pe)
> >> +		eeh_pe_set_passed(pe, false);
> >> +
> >> +	mutex_unlock(&eeh_dev_mutex);
> >> +}
> >> +EXPORT_SYMBOL(eeh_dev_release);
> >> +
> >> +static int eeh_dev_check(struct pci_dev *pdev,
> >> +			 struct eeh_dev **pedev,
> >> +			 struct eeh_pe **ppe)
> >> +{
> >> +	struct eeh_dev *edev;
> >> +
> >> +	/* No device ? */
> >> +	if (!pdev)
> >> +		return -ENODEV;
> >> +
> >> +	edev = pci_dev_to_eeh_dev(pdev);
> >> +	if (!edev || !eeh_dev_passed(edev) ||
> >> +	    !edev->pe || !eeh_pe_passed(edev->pe))
> >> +		return -ENODEV;
> >> +
> >> +	if (pedev)
> >> +		*pedev = edev;
> >> +	if (ppe)
> >> +		*ppe = edev->pe;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +/**
> >> + * eeh_pe_set_option - Set options for the indicated PE
> >> + * @pdev: PCI device
> >> + * @option: requested option
> >> + *
> >> + * The routine is called to enable or disable EEH functionality
> >> + * on the indicated PE, to enable IO or DMA for the frozen PE.
> >> + */
> >> +int eeh_pe_set_option(struct pci_dev *pdev, int option)
> >> +{
> >> +	struct eeh_dev *edev;
> >> +	struct eeh_pe *pe;
> >> +	int ret = 0;
> >> +
> >> +	/* Device existing ? */
> >> +	ret = eeh_dev_check(pdev, &edev, &pe);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	switch (option) {
> >> +	case EEH_OPT_DISABLE:
> >> +	case EEH_OPT_ENABLE:
> >> +		break;
> >> +	case EEH_OPT_THAW_MMIO:
> >> +	case EEH_OPT_THAW_DMA:
> >> +		if (!eeh_ops || !eeh_ops->set_option) {
> >> +			ret = -ENOENT;
> >> +			break;
> >> +		}
> >> +
> >> +		ret = eeh_ops->set_option(pe, option);
> >> +		break;
> >> +	default:
> >> +		pr_debug("%s: Option %d out of range (%d, %d)\n",
> >> +			__func__, option, EEH_OPT_DISABLE, EEH_OPT_THAW_DMA);
> >> +		ret = -EINVAL;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(eeh_pe_set_option);
> >> +
> >> +/**
> >> + * eeh_pe_get_addr - Retrieve the PE address or sharing mode
> >> + * @pdev: PCI device
> >> + * @option: option
> >> + *
> >> + * Retrieve the PE address or sharing mode.
> >> + */
> >> +int eeh_pe_get_addr(struct pci_dev *pdev, int option)
> >> +{
> >> +	struct pci_bus *bus;
> >> +	struct eeh_dev *edev;
> >> +	struct eeh_pe *pe;
> >> +	int ret = 0;
> >> +
> >> +	/* Device existing ? */
> >> +	ret = eeh_dev_check(pdev, &edev, &pe);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	switch (option) {
> >> +	case EEH_OPT_GET_PE_ADDR:
> >> +		bus = eeh_pe_bus_get(pe);
> >> +		if (!bus) {
> >> +			ret = -ENODEV;
> >> +			break;
> >> +		}
> >> +
> >> +		/* PE address has format "00BBSS00" */
> >> +		ret = bus->number << 16;
> >> +		break;
> >> +	case EEH_OPT_GET_PE_MODE:
> >> +		/* Wa always have shared PE */
> >> +		ret = EEH_PE_MODE_SHARED;
> >> +		break;
> >> +	default:
> >> +		pr_debug("%s: option %d out of range (0, 1)\n",
> >> +			__func__, option);
> >> +		ret = -EINVAL;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(eeh_pe_get_addr);
> >> +
> >> +/**
> >> + * eeh_pe_get_state - Retrieve PE's state
> >> + * @pdev: PCI device
> >> + *
> >> + * Retrieve the PE's state, which includes 3 aspects: enabled
> >> + * DMA, enabled IO and asserted reset.
> >> + */
> >> +int eeh_pe_get_state(struct pci_dev *pdev)
> >> +{
> >> +	struct eeh_dev *edev;
> >> +	struct eeh_pe *pe;
> >> +	int result, ret = 0;
> >> +
> >> +	/* Device existing ? */
> >> +	ret = eeh_dev_check(pdev, &edev, &pe);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	if (!eeh_ops || !eeh_ops->get_state)
> >> +		return -ENOENT;
> >> +
> >> +	result = eeh_ops->get_state(pe, NULL);
> >> +	if (!(result & EEH_STATE_RESET_ACTIVE) &&
> >> +	     (result & EEH_STATE_DMA_ENABLED) &&
> >> +	     (result & EEH_STATE_MMIO_ENABLED))
> >> +		ret = EEH_PE_STATE_NORMAL;
> >> +	else if (result & EEH_STATE_RESET_ACTIVE)
> >> +		ret = EEH_PE_STATE_RESET;
> >> +	else if (!(result & EEH_STATE_RESET_ACTIVE) &&
> >> +		 !(result & EEH_STATE_DMA_ENABLED) &&
> >> +		 !(result & EEH_STATE_MMIO_ENABLED))
> >> +		ret = EEH_PE_STATE_STOPPED_IO_DMA;
> >> +	else if (!(result & EEH_STATE_RESET_ACTIVE) &&
> >> +		 (result & EEH_STATE_DMA_ENABLED) &&
> >> +		 !(result & EEH_STATE_MMIO_ENABLED))
> >> +		ret = EEH_PE_STATE_STOPPED_DMA;
> >> +	else
> >> +		ret = EEH_PE_STATE_UNAVAIL;
> >> +
> >> +	return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(eeh_pe_get_state);
> >> +
> >> +/**
> >> + * eeh_pe_reset - Issue PE reset according to specified type
> >> + * @pdev: PCI device
> >> + * @option: reset type
> >> + *
> >> + * The routine is called to reset the specified PE with the
> >> + * indicated type, either fundamental reset or hot reset.
> >> + * PE reset is the most important part for error recovery.
> >> + */
> >> +int eeh_pe_reset(struct pci_dev *pdev, int option)
> >> +{
> >> +	struct eeh_dev *edev;
> >> +	struct eeh_pe *pe;
> >> +	int ret = 0;
> >> +
> >> +	/* Device existing ? */
> >> +	ret = eeh_dev_check(pdev, &edev, &pe);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	if (!eeh_ops || !eeh_ops->set_option || !eeh_ops->reset)
> >> +		return -ENOENT;
> >> +
> >> +	switch (option) {
> >> +	case EEH_RESET_DEACTIVATE:
> >> +		ret = eeh_ops->reset(pe, option);
> >> +		if (ret)
> >> +			break;
> >> +
> >> +		/*
> >> +		 * The PE is still in frozen state and we need clear
> >> +		 * that. It's good to clear frozen state after deassert
> >> +		 * to avoid messy IO access during reset, which might
> >> +		 * cause recursive frozen PE.
> >> +		 */
> >> +		ret = eeh_ops->set_option(pe, EEH_OPT_THAW_MMIO);
> >> +		if (!ret)
> >> +			ret = eeh_ops->set_option(pe, EEH_OPT_THAW_DMA);
> >> +		if (!ret)
> >> +			eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
> >> +		break;
> >> +	case EEH_RESET_HOT:
> >> +	case EEH_RESET_FUNDAMENTAL:
> >> +		ret = eeh_ops->reset(pe, option);
> >> +		break;
> >> +	default:
> >> +		pr_debug("%s: Unsupported option %d\n",
> >> +			__func__, option);
> >> +		ret = -EINVAL;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(eeh_pe_reset);
> >> +
> >> +/**
> >> + * eeh_pe_configure - Configure PCI bridges after PE reset
> >> + * @pdev: PCI device
> >> + *
> >> + * The routine is called to restore the PCI config space for
> >> + * those PCI devices, especially PCI bridges affected by PE
> >> + * reset issued previously.
> >> + */
> >> +int eeh_pe_configure(struct pci_dev *pdev)
> >> +{
> >> +	struct eeh_dev *edev;
> >> +	struct eeh_pe *pe;
> >> +	int ret = 0;
> >> +
> >> +	/* Device existing ? */
> >> +	ret = eeh_dev_check(pdev, &edev, &pe);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	/* Restore config space for the affected devices */
> >> +	eeh_pe_restore_bars(pe);
> >> +
> >> +	return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(eeh_pe_configure);
> >> +
> >>  static int proc_eeh_show(struct seq_file *m, void *v)
> >>  {
> >>  	if (!eeh_enabled()) {
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index 7ba0424..301ac18 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -25,6 +25,9 @@
> >>  #include <linux/types.h>
> >>  #include <linux/uaccess.h>
> >>  #include <linux/vfio.h>
> >> +#ifdef CONFIG_EEH
> >> +#include <asm/eeh.h>
> >> +#endif
> >
> >Can we make a vfio_pci_eeh file that properly stubs out anything called
> >from common code?  I don't want to see all these inline ifdefs.
> >
> 
> well. do you want see drivers/vfio/vfio-pci/vfio_pci_eeh.c and export
> following functins for vfio_pci.c to use?
> 
> vfio_pci_eeh_open()
> vfio_pci_eeh_release()
> vfio_pci_eeh_ioctl()

Yes

> >>  
> >>  #include "vfio_pci_private.h"
> >>  
> >> @@ -152,32 +155,57 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
> >>  	pci_restore_state(pdev);
> >>  }
> >>  
> >> +static void vfio_eeh_pci_release(struct pci_dev *pdev)
> >> +{
> >> +#ifdef CONFIG_EEH
> >> +	eeh_dev_release(pdev);
> >> +#endif
> >> +}
> >> +
> >>  static void vfio_pci_release(void *device_data)
> >>  {
> >>  	struct vfio_pci_device *vdev = device_data;
> >>  
> >> -	if (atomic_dec_and_test(&vdev->refcnt))
> >> +	if (atomic_dec_and_test(&vdev->refcnt)) {
> >> +		vfio_eeh_pci_release(vdev->pdev);
> >>  		vfio_pci_disable(vdev);
> >> +	}
> >>  
> >>  	module_put(THIS_MODULE);
> >>  }
> >>  
> >> +static int vfio_eeh_pci_open(struct pci_dev *pdev)
> >> +{
> >> +	int ret = 0;
> >> +
> >> +#ifdef CONFIG_EEH
> >> +	ret = eeh_dev_open(pdev);
> >> +#endif
> >> +	return ret;
> >> +}
> >> +
> >>  static int vfio_pci_open(void *device_data)
> >>  {
> >>  	struct vfio_pci_device *vdev = device_data;
> >> +	int ret;
> >>  
> >>  	if (!try_module_get(THIS_MODULE))
> >>  		return -ENODEV;
> >>  
> >>  	if (atomic_inc_return(&vdev->refcnt) == 1) {
> >> -		int ret = vfio_pci_enable(vdev);
> >> -		if (ret) {
> >> -			module_put(THIS_MODULE);
> >> -			return ret;
> >> -		}
> >> +		ret = vfio_pci_enable(vdev);
> >> +		if (ret)
> >> +			goto error;
> >> +
> >> +		ret = vfio_eeh_pci_open(vdev->pdev);
> >> +		if (ret)
> >> +			goto error;
> >>  	}
> >>  
> >>  	return 0;
> >> +error:
> >> +	module_put(THIS_MODULE);
> >> +	return ret;
> >>  }
> >>  
> >>  static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
> >> @@ -321,6 +349,91 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
> >>  	return walk.ret;
> >>  }
> >>  
> >> +static int vfio_eeh_pci_ioctl(struct pci_dev *pdev,
> >> +			      unsigned int cmd,
> >> +			      unsigned long arg)
> >> +{
> >> +	unsigned long minsz;
> >> +	int ret = 0;
> >> +
> >> +#ifdef CONFIG_EEH
> >> +	switch (cmd) {
> >> +	case VFIO_EEH_PE_SET_OPTION: {
> >> +		struct vfio_eeh_pe_set_option option;
> >> +
> >> +		minsz = offsetofend(struct vfio_eeh_pe_set_option, option);
> >> +		if (copy_from_user(&option, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +		if (option.argsz < minsz)
> >> +			return -EINVAL;
> >> +
> >> +		ret = eeh_pe_set_option(pdev, option.option);
> >> +		break;
> >> +	}
> >> +	case VFIO_EEH_PE_GET_ADDR: {
> >> +		struct vfio_eeh_pe_get_addr addr;
> >> +
> >> +		minsz = offsetofend(struct vfio_eeh_pe_get_addr, info);
> >> +		if (copy_from_user(&addr, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +		if (addr.argsz < minsz)
> >> +			return -EINVAL;
> >> +
> >> +		ret = eeh_pe_get_addr(pdev, addr.option);
> >> +		if (ret >= 0) {
> >> +			addr.info = ret;
> >> +			if (copy_to_user((void __user *)arg, &addr, minsz))
> >> +				return -EFAULT;
> >> +			ret = 0;
> >> +		}
> >> +
> >> +		break;
> >> +	}
> >> +	case VFIO_EEH_PE_GET_STATE: {
> >> +		struct vfio_eeh_pe_get_state state;
> >> +
> >> +		minsz = offsetofend(struct vfio_eeh_pe_get_state, state);
> >> +		if (copy_from_user(&state, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +		if (state.argsz < minsz)
> >> +			return -EINVAL;
> >> +
> >> +		ret = eeh_pe_get_state(pdev);
> >> +		if (ret >= 0) {
> >> +			state.state = ret;
> >> +			if (copy_to_user((void __user *)arg, &state, minsz))
> >> +				return -EFAULT;
> >> +			ret = 0;
> >> +		}
> >> +		break;
> >> +	}
> >> +	case VFIO_EEH_PE_RESET: {
> >> +		struct vfio_eeh_pe_reset reset;
> >> +
> >> +		minsz = offsetofend(struct vfio_eeh_pe_reset, option);
> >> +		if (copy_from_user(&reset, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +		if (reset.argsz < minsz)
> >> +			return -EINVAL;
> >> +
> >> +		ret = eeh_pe_reset(pdev, reset.option);
> >> +		break;
> >> +	}
> >> +	case VFIO_EEH_PE_CONFIGURE:
> >> +		ret = eeh_pe_configure(pdev);
> >> +		break;
> >> +	default:
> >> +		ret = -EINVAL;
> >> +		pr_debug("%s: Cannot handle command %d\n",
> >> +			__func__, cmd);
> >> +	}
> >> +#else
> >> +	ret = -ENOENT;
> >> +#endif
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  static long vfio_pci_ioctl(void *device_data,
> >>  			   unsigned int cmd, unsigned long arg)
> >>  {
> >> @@ -682,6 +795,12 @@ hot_reset_release:
> >>  
> >>  		kfree(groups);
> >>  		return ret;
> >> +	} else if (cmd == VFIO_EEH_PE_SET_OPTION ||
> >> +		   cmd == VFIO_EEH_PE_GET_ADDR ||
> >> +		   cmd == VFIO_EEH_PE_GET_STATE ||
> >> +		   cmd == VFIO_EEH_PE_RESET ||
> >> +		   cmd == VFIO_EEH_PE_CONFIGURE) {
> >> +			return vfio_eeh_pci_ioctl(vdev->pdev, cmd, arg);
> >>  	}
> >>  
> >>  	return -ENOTTY;
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index cb9023d..ef55682 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -455,6 +455,59 @@ struct vfio_iommu_spapr_tce_info {
> >>  
> >>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
> >>  
> >> +/*
> >> + * EEH functionality can be enabled or disabled on one specific device.
> >> + * Also, the DMA or IO frozen state can be removed from the frozen PE
> >> + * if required.
> >> + */
> >> +struct vfio_eeh_pe_set_option {
> >> +	__u32 argsz;
> >> +	__u32 option;
> >> +};
> >> +
> >> +#define VFIO_EEH_PE_SET_OPTION		_IO(VFIO_TYPE, VFIO_BASE + 21)
> >
> >What proposed ioctls are making you jump to 21?
> >
> >argsz is probably not as useful without a flags field.  Otherwise the
> >caller can't indicate what the extra space is.
> >
> 
> The QEMU patches are based on Alexey's additional feature ("ddw"), which
> consumed several ioctl commands.
> 
> ok. So you also prefer to remove "argsz"?

No, I prefer to stay consistent with the rest of the VFIO API and use
argsz + flags.

> >> +
> >> +/*
> >> + * Each EEH PE should have unique address to be identified. The command
> >> + * helps to retrieve the address and the sharing mode of the PE.
> >> + */
> >> +struct vfio_eeh_pe_get_addr {
> >> +	__u32 argsz;
> >> +	__u32 option;
> >> +	__u32 info;
> >> +};
> >> +
> >> +#define VFIO_EEH_PE_GET_ADDR		_IO(VFIO_TYPE, VFIO_BASE + 22)
> >> +
> >> +/*
> >> + * EEH PE might have been frozen because of PCI errors. Also, it might
> >> + * be experiencing reset for error revoery. The following command helps
> >> + * to get the state.
> >> + */
> >> +struct vfio_eeh_pe_get_state {
> >> +	__u32 argsz;
> >> +	__u32 state;
> >> +};
> >> +
> >> +#define VFIO_EEH_PE_GET_STATE		_IO(VFIO_TYPE, VFIO_BASE + 23)
> >> +
> >> +/*
> >> + * Reset is the major step to recover problematic PE. The following
> >> + * command helps on that.
> >> + */
> >> +struct vfio_eeh_pe_reset {
> >> +	__u32 argsz;
> >> +	__u32 option;
> >> +};
> >> +
> >> +#define VFIO_EEH_PE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 24)
> >> +
> >> +/*
> >> + * One of the steps for recovery after PE reset is to configure the
> >> + * PCI bridges affected by the PE reset.
> >> + */
> >> +#define VFIO_EEH_PE_CONFIGURE		_IO(VFIO_TYPE, VFIO_BASE + 25)
> >
> >What can the user do differently by making these separate ioctls?
> >
> 
> hrm, I didn't understood as well. Alex.G could have the explaination.

As agraf noted, I'm asking why reset and configure are separate when
they seem to be used together.  Thanks,

Alex

> >> +
> >>  /* ***************************************************************** */
> >>  
> >>  #endif /* _UAPIVFIO_H */
> 
> Thanks,
> Gavin
> 





More information about the Linuxppc-dev mailing list