more eeh

Sat Mar 20 05:42:53 EST 2004

On Fri, Mar 19, 2004 at 11:45:30AM -0600, linas at austin.ibm.com wrote:
> On Thu, Mar 18, 2004 at 04:01:16PM -0800, Greg KH wrote:
> >
> > But you want userspace to do this.  There are systems with a few
> > different PCI Hotplug controller drivers on them.  The different
> > controller drivers control different slots.  Userspace is the only place
> > that can reliably handle this.
>
> There are several issues here.  First, the only controllers that
> have this feature are the ppc64 phb controllers.

For the PPC64 platform today.  I guarantee that this will not be true
this time next year.

> Which is not an excuse for sloppy coding, since , yes, there might be
> other hardware in the future that does this.

There is other hardware shipping next month that does this (or whenever
PCI Express finally makes it out into the real world, should be any day
now...)

So, you are correct, there is no excuse for sloppy coding, or special
casing this kind of stuff.

> More importantly, you've got to recognize that many (most?) EEH
> events are going to be 'transient' i.e. single-shot parity errors
> and the like.

I don't know, is this really true?  Do you have any research showing
this?  I've seen flaky pci cards die horrible deaths all the time in my
testing.

> If the error occured e.g. on a scsi controller, this type of errors
> can be recovered without any need to unmount the file system that sits
> above the block device that sits on the scsi driver.

"transient", yes.  But what determines if this is such a error and not a
more serious one?  Do you have that level of "seriousness" detection in
your hardware controller?

> In particular, if the EEH error hit the scsi controller that has
> the root volume, there would be no way to actually call user-space
> code (since this code is probably not paged into the kernel, and
> there can't be any disk access till the error is cleared.)

True, but again, it's a rare case, right?  If you are really worried
about this kind of stuff, put your hotplug scripts (and bash) on a ramfs
partition.  I've heard of embedded people doing this all the time to
allow disks to spin down and yet still have a system with good response
times to different events.

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/