EEH Ethernet [was Re: Unapplied patches?]

Thu Aug 19 06:17:54 EST 2004

On Wed, Aug 18, 2004 at 01:52:51PM +1000, Paul Mackerras was heard to remark:
>
> I did want to get the notifier list for EEH isolation events in but
> you still had the "only if ethernet" policy code in there.  We had a
> short conversation about that, and despite what you said, I still
> don't think it is right.

OK, I'm agnostic.  Recall, the ethernet check is interim scaffolding.
Others are untested/unsupported.  For example, I tried a 4-port USB
device once, and the kernel died a flaming death.  So the current
"check if ethernet and try to recover" is more of a statement about
what's known to work.  I'm hoping to broaden support real soon now.

In the interest of keeping things rolling, can I get you to accept the
patch 'with ethernet check' for now, and if nothing superceeds it in a
month or two, then you can strip it out?

> I think we possibly need something that
> counts pending EEH errors and panics if the count exceeds a threshold
> instead.

Well, it won't work quite like that; The very first error can either be
recovered, or it can't be.  There's no way to 'ignore' eeh errors.

I am planning on countig the number of times that the same hardware
has faulted, and offlining it if the count exceeds a threshold.  This
would prevent an infinite loop of going down, recovering, going down,
etc.  The hard part turns out to be that once a device has been removed,
there aren't any kernel structures left to keep track of "that device",
so its a little tricky to figure out if its the same device that keeps
failing over and over.

> It compiles fine if you turn off hotplug PCI. :) It certainly needs to
> be fixed, but I want it fixed properly.

Actually, that part is a patch that gregkh needs to apply.  I guess I
have to harangue him first.

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/