pci error recovery procedure

Zhang, Yanmin yanmin_zhang at linux.intel.com
Thu Sep 7 11:56:19 EST 2006

On Thu, 2006-09-07 at 04:01, Linas Vepstas wrote:
> On Wed, Sep 06, 2006 at 09:26:56AM +0800, Zhang, Yanmin wrote:
> > > > The
> > > > error_detected of the drivers in the latest kernel who support err handlers
> > > > always returns PCI_ERS_RESULT_NEED_RESET. They are typical examples.
> > > 
> > > Just because the current drivers do it this way does not mean that this is
> > > the best way to do things.
> >
> > If it's not the best way, why did you choose to reset slot for e1000/e100/ipr
> > error handlers? They are typical widely-used devices. To make it easier to
> > add error handlers?
> I did it that way just to get going, get something working. I do not
> have hardware specs for any of these devices, and do not have much of 
> an idea of what they are capable of;
Yes, it's difficult to add fine-grained error handlers for guys who are not
the driver developers.

>  the recovery code I wrote is of
> "brute force, hit it with a hammer"-nature.  Driver writers who 
> know thier hardware well, and are interested in a more refined 
> approach are encouraged to actualy use a more refined approach.
I guess almost no driver developer is happy to spend lots of time to
add refined steps. They would like to focus on normal process (for achievement
feeling? :) ).
In addition, if they use fine-grained steps in error handlers, all these
steps might be rewritten when the device specs is upgraded. Fine-grained steps in
error handlers are more difficut to debug.

It's impossible for you to develop error handlers for all device drivers.

The error handlers look a little like suspend/resume. Of course, it's more
complicated. If we could keep it as simple as suspend/resume, it's more welcomed.

pci error shouldn't happen frequently. And when it happens, I think mostly it's
an endpoint device instead of bridge. When it happens, if we choose always
reset slot, performance could be degraded, but not too much. I just deduce, and 
didn't test it on a machine with hundreds of devices.

More information about the Linuxppc-dev mailing list