EEH/Hotplug (was Re: [PATCH] rpaphp broken in ameslab)

Sat Jul 3 04:12:15 EST 2004

On Fri, Jul 02, 2004 at 12:29:49PM -0500, linas at austin.ibm.com wrote:
> On Fri, Jul 02, 2004 at 03:57:41PM +1000, Paul Mackerras wrote:
> > > > I want to see a notifier list exported by eeh.c as I proposed in a
> > > > previous email before that goes upstream.
> > >
> > > Its currently implemented as a work queue. Is that acceptable?
> > > To keep gregkh happy, I'll move the work-queue to
> > > drivers/pci/hotplug/rpaphp_eeh.c, will this work?
> >
> > It's not the work queue that is the problem, it is that the EEH code
> > is taking a decision about what hotplug should do.  I am saying that
> > the EEH code should offer to provide notifications to any interested
> > code about slot isolation events.  The slot isolation event is a fact,
> > the request to do an unplug operation is policy.  Let's leave the
> > policy up to the rpaphp driver and/or userspace.
>
> I'm not yet convinced that hotplug should be the focal point for
> device driver policy decisions,

Sorry, but you're a bit late to the table for trying to change this
overall kernel design decision :)

> but I'll go ahead and implement the notifier chain for now, and see
> what happens.

Thank you.

> Note that the scsi generic layer implements a bunch of policy
> almost the same kind of thing, except that its for the scsi bus,
> and not for the pci bus.   Not all scsi device drivers use the
> scsi-generic layer, but those that do get a reset sequence something
> like the following:
>
> -- if device not responding, reset device
> -- if above failed, retry a few times.
> -- if still failed, reset scsi bus
> -- if still failed, retry a few times ...
> -- if above failed, reset scsi controller
>
> For pci bus disconnection events that affected scsi devices, I was
> going to tap into that 'policy' code.  I'm not sure I want to comment
> more until I try the prototype.

scsi errors and pci errors are quite different things.  For one, I'm
pretty sure the scsi stuff is specified by the spec.  And it's way more
common than pci errors would be.

It's also done in a generic manner, not a arch specific way, which is a
good thing.

> I'm not sure if anyone is thinking about i/o fabrics yet, or how
> that policy gets done ... for example, one disk is attached to
> two scsi controllers, and there was an eeh event on one of the
> controllers; where is the failover policy implemented?  Currently,
> I think all the device drivers that do this are all proprietary ...

The multipath people are working on this, using dm and userspace stuff.
The kernel drivers that try to do this within the kernel have been
rejected for one reason or another (not the least being that no company
seems to want to release them...)

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/