[PATCH v6 2/3] drivers/vfio: EEH support for VFIO PCI device
Gavin Shan
gwshan at linux.vnet.ibm.com
Sat May 24 12:06:21 EST 2014
On Fri, May 23, 2014 at 08:29:59AM -0600, Alex Williamson wrote:
>On Fri, 2014-05-23 at 14:37 +1000, Gavin Shan wrote:
>> On Thu, May 22, 2014 at 09:10:53PM -0600, Alex Williamson wrote:
>> >On Thu, 2014-05-22 at 18:23 +1000, Gavin Shan wrote:
.../...
>No, sorry, I mean how does the user get information about the error?
>The interface we have here is:
>a) find that something bad has happened
>b) kick it into working again
>c) continue
>
>How does the user figure out what happened and if it makes sense to
>attempt to recover? Where does the user learn that their disk is on
>fire?
>
When 0xFF's returned from config or IO read, user should check the
device (PE)'s state with ioctl command VFIO_EEH_PE_GET_STATE. If the
device (PE) has been put into "frozen" state, It's confirmed the device
("disk" you mentioned) is on fire. User should kick off recovery, which
includes:
- User stops any operatins (config, IO, DMA) on the device because any
PCI traffic to "frozen" device will be dropped from software or hardware
level. Also, we don't expect DMA traffic during recovery. Otherwise,
we will bump into recursive errors and the recovery should fail.
- VFIO_EEH_PE_SET_OPTION to enable I/O path ("DMA" path is still under frozen
state). EEH_VFIO_PE_CONFIGURE to reconfigure affected PCI bridges and then
do error log retrieval.
- VFIO_EEH_PE_RESET to reset the affected device (PE). EEH_VFIO_PE_CONFIUGRE
to restore BARs.
- User resumes the device to start PCI traffic and device is brought to
funtional state.
.../...
>
>No, I prefer to stay consistent with the rest of the VFIO API and use
>argsz + flags.
>
Here's the recap for previous reply: I have several cases for ioctl().
- ioctl(fd, cmd, NULL): I needn't any input info.
- ioctl(fd, cmd, &data): I need input info
For all the cases, should I simply have a data struct to include "argsz+flags"?
For return value from ioctl(), can we simply to have additional field in the
above data struct to carry it? "0" is the information I have to return for
some of the cases.
.../...
>As agraf noted, I'm asking why reset and configure are separate when
>they seem to be used together.
>
Ok. It's the recap: they're 2 separate steps of error recovery as
defined in PAPR spec. Also, they correspond to 2 separate RTAS calls.
So I don't think we can put them together.
Thanks,
Gavin
More information about the Linuxppc-dev
mailing list