EEH Ethernet [was Re: Unapplied patches?]
Linas Vepstas
linas at austin.ibm.com
Thu Aug 19 06:17:54 EST 2004
On Wed, Aug 18, 2004 at 01:52:51PM +1000, Paul Mackerras was heard to remark:
>
> I did want to get the notifier list for EEH isolation events in but
> you still had the "only if ethernet" policy code in there. We had a
> short conversation about that, and despite what you said, I still
> don't think it is right.
OK, I'm agnostic. Recall, the ethernet check is interim scaffolding.
Others are untested/unsupported. For example, I tried a 4-port USB
device once, and the kernel died a flaming death. So the current
"check if ethernet and try to recover" is more of a statement about
what's known to work. I'm hoping to broaden support real soon now.
In the interest of keeping things rolling, can I get you to accept the
patch 'with ethernet check' for now, and if nothing superceeds it in a
month or two, then you can strip it out?
> I think we possibly need something that
> counts pending EEH errors and panics if the count exceeds a threshold
> instead.
Well, it won't work quite like that; The very first error can either be
recovered, or it can't be. There's no way to 'ignore' eeh errors.
I am planning on countig the number of times that the same hardware
has faulted, and offlining it if the count exceeds a threshold. This
would prevent an infinite loop of going down, recovering, going down,
etc. The hard part turns out to be that once a device has been removed,
there aren't any kernel structures left to keep track of "that device",
so its a little tricky to figure out if its the same device that keeps
failing over and over.
> It compiles fine if you turn off hotplug PCI. :) It certainly needs to
> be fixed, but I want it fixed properly.
Actually, that part is a patch that gregkh needs to apply. I guess I
have to harangue him first.
--linas
** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/
More information about the Linuxppc64-dev
mailing list