[PATCH/RFC] PCI Error Recovery

Sun Mar 13 04:22:25 EST 2005

On Sat, Mar 12, 2005 at 10:50:35PM +1100, Benjamin Herrenschmidt wrote:
...
> > To use the device again call "foo" first to fix the device.
> > foo then returns if it fixed the device or not.
> 
> I want something along those lines, except that I wnat it asynchronous
> because of the issue of drivers sharing the same bus segment that need
> to be all notifed before we can re-enable things. Also, I can either
> just re-enable IOs, re-enable DMA, both, reset the slot, etc....
> 
> I may not offer that rich functionality in the generic API,

Why not?
Can't we do that today with various PCI initialization
routines that provide arch (pcibios) specific hooks?
e.g. pci_set_master vs pci_enable_device

I'm wondering if the second part of the error recovery path in
the driver can use it's "normal" initialization sequence.
Proably needs adjusting to look for error states and the first
part will need to clean up pending IO requests.

>  but I need
> to find the right "cutting point". Just re-enabling IOs is useful for
> drivers who can extract diagnostic infos from the device, for example
> after a DMA error.

By "IO", I'm guessing you mean MMIO or IO Port space access.
This implies only the device driver knows what/where any diag info lives.
But some of the info is architected in PCI: SERR and PERR status bits.
PCIe seems to be richer in error reporting but I don't know details.

I think the majority of the error info is much more likely to be held
in driver state and platform chipset state. E.g. only the driver will
be able to associate a particular IO request with the invalid DMA or
MMIO address that the chipset captured. The driver can reject that IO
(with extreme prejudice so it doesn't get retried) and restart the PCI
device.

In case it's not obvious, this is all just hand waving and maybe
it will inspire something more realistic...

> Resetting the slot may be necessary to get some devices back.

*nod*  Or even several slots.

> > I don't get why the driver even needs to know about isolation
> > or not. It's not fundamentally different from an bus abort
> > on other systems, just that it lasts longer. 

I think the driver just needs to know if it's ok to do MMIO/IO Port
access to the device or not at any given point in time.

A simpler strategy could be to just blow away (PCI Bus reset) the failed
device(s) and reconfigure the PCI bus. Then call back into the drivers
to tell them their devices suffered an "event". But then finer grain
recovery isn't really possible.

grant