[PATCH/RFC] PCI Error Recovery

Tue Mar 15 04:49:06 EST 2005

Hi,

> The problem is that we have
> potentially more than one driver affected. Even if the error was
> triggered by one card/function, several cards/functions may have been
> isolated etc...

To be specific, on PPC64 we have PIC busses that are physical cables
that run from one rack-mounted drawer to the other rack cage that 
contains the cpu (the "CEC central electronics complex").  Each 
rack-monted cage may hold 4 or 8 or 16 PCI cards, and a failure on 
that bus could take out multiple PCI cards at once. 

Even on a plain-jane desktop system, one is confronted with
"multi-function pci cards" which can cause multiple drivers to be
loaded.

> We need to "notify" all drivers, give them a chance to re-enable device
> & gather diagnostic data, etc... before we try to reset the slot if a
> driver decides it requires that to happen. Also, if a driver is ok after
> just enabling the device() re-initializes itself, but it's sibling
> decides it needs to reset the slot ?

[...]

> Yes, but I think fine grained recovery ends up beeing an API nightmare
> when you start dealing with several drivers on the same segment with
> conflicting requirements for recovery.

I'm thinking of having a way of asking all affected drivers "what can 
you deal with?" and then playing to the lowest common denominator. 
For example, the current reset sequence tries to do a hotplug add-remove
if the driver is ignorant. At this time, I distinguish "ignorant"
from "not ignorant" based on whether the 'state change' callback is null
or not.   I'll try to think of something just a tiny bit more
fine-grained than this.

--linas