more eeh

Sat Mar 20 04:45:30 EST 2004

On Thu, Mar 18, 2004 at 04:01:16PM -0800, Greg KH wrote:
>
> But you want userspace to do this.  There are systems with a few
> different PCI Hotplug controller drivers on them.  The different
> controller drivers control different slots.  Userspace is the only place
> that can reliably handle this.

There are several issues here.  First, the only controllers that
have this feature are the ppc64 phb controllers.  Which is not
an excuse for sloppy coding, since , yes, there might be other hardware
in the future that does this.

More importantly, you've got to recognize that many (most?) EEH
events are going to be 'transient' i.e. single-shot parity errors
and the like.  If the error occured e.g. on a scsi controller,
this type of errors can be recovered without any need to unmount
the file system that sits above the block device that sits on the
scsi driver.

In particular, if the EEH error hit the scsi controller that has
the root volume, there would be no way to actually call user-space
code (since this code is probably not paged into the kernel, and
there can't be any disk access till the error is cleared.)

To reiterate: if there is a *permanent* hardware failure that EEH
cannot recover from, then, yes, the right thing is to bounce it back
up to the user-space scripts that can then deal with the event.
Else, for transient events, its is far more elegent to handle these
in a layer that hides them from the affected block devices/socket layer.

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/