more eeh

Sat Mar 20 08:24:07 EST 2004

On Fri, Mar 19, 2004 at 10:42:53AM -0800, Greg KH wrote:
>
> > More importantly, you've got to recognize that many (most?) EEH
> > events are going to be 'transient' i.e. single-shot parity errors
> > and the like.
>
> I don't know, is this really true?  Do you have any research showing
> this?  I've seen flaky pci cards die horrible deaths all the time in my
> testing.

Yes, I wish I had good data on this stuff; I haven't yet found anyone who
has it, and at the moment, I'm just getting anectodal feeback.  Basically,
there are complaints from the field that the Linux kernel panics on error,
and after reboot, the hardware's fine, so please fix the panic.   It
would take some sort of data mining of customer support calls to get good
science out of this;  I don't know if thats been done or who has done this.
Who knows, IBM research journal might have an article on this.

> > If the error occured e.g. on a scsi controller, this type of errors
> > can be recovered without any need to unmount the file system that sits
> > above the block device that sits on the scsi driver.
>
> "transient", yes.  But what determines if this is such a error and not a
> more serious one?  Do you have that level of "seriousness" detection in
> your hardware controller?

I'm expecting the one-shot errors to be parity errors, the hardware
detects parity errors.  I don't know PCI well enough to know the other
likely scenarios.  I do remember from my childhood that on the scsi bus,
if you short the bus attn line to ground, it will take down the whole
scsi chain.  I assume there are similar scenarios on pci.

> > In particular, if the EEH error hit the scsi controller that has
> > the root volume, there would be no way to actually call user-space
> > code (since this code is probably not paged into the kernel, and
> > there can't be any disk access till the error is cleared.)
>
> True, but again, it's a rare case, right?  If you are really worried
> about this kind of stuff, put your hotplug scripts (and bash) on a ramfs
> partition.  I've heard of embedded people doing this all the time to
> allow disks to spin down and yet still have a system with good response
> times to different events.

Hmm. That's an idea.  Can I use the scripts to recover the device without
having to unmount the filesystem above?  I was thinking of a recovery
mechanism similar to the current scsi-generic mechanism of reseting the
adapter first, then, if that doesn't work, then the bus, if that doesn't work
the hba, and if it still fails, only then report the error to higher layers.
The scsi error resets aren't scripted (or weren't last time I looked :).
I'm not sure if they should be; I suppose in some wild SAN fabric one
would need to but I don't know that level of stuff.

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/