[PATCH/RFC] PCI Error Recovery
Linas Vepstas
linas at austin.ibm.com
Tue Mar 15 04:49:06 EST 2005
Hi,
> The problem is that we have
> potentially more than one driver affected. Even if the error was
> triggered by one card/function, several cards/functions may have been
> isolated etc...
To be specific, on PPC64 we have PIC busses that are physical cables
that run from one rack-mounted drawer to the other rack cage that
contains the cpu (the "CEC central electronics complex"). Each
rack-monted cage may hold 4 or 8 or 16 PCI cards, and a failure on
that bus could take out multiple PCI cards at once.
Even on a plain-jane desktop system, one is confronted with
"multi-function pci cards" which can cause multiple drivers to be
loaded.
> We need to "notify" all drivers, give them a chance to re-enable device
> & gather diagnostic data, etc... before we try to reset the slot if a
> driver decides it requires that to happen. Also, if a driver is ok after
> just enabling the device() re-initializes itself, but it's sibling
> decides it needs to reset the slot ?
[...]
> Yes, but I think fine grained recovery ends up beeing an API nightmare
> when you start dealing with several drivers on the same segment with
> conflicting requirements for recovery.
I'm thinking of having a way of asking all affected drivers "what can
you deal with?" and then playing to the lowest common denominator.
For example, the current reset sequence tries to do a hotplug add-remove
if the driver is ignorant. At this time, I distinguish "ignorant"
from "not ignorant" based on whether the 'state change' callback is null
or not. I'll try to think of something just a tiny bit more
fine-grained than this.
--linas
More information about the Linuxppc64-dev
mailing list