more eeh

Thu Mar 18 09:51:47 EST 2004

On Wed, Mar 17, 2004 at 09:23:27AM -0800, Greg KH wrote:
>
> On Tue, Mar 16, 2004 at 06:40:53PM -0600, Nathan Fontenot wrote:
> > Hope I don't spoil anyones dinner with another dose of eeh :)
> >
> > I have attached a patch that generates the hotplug event in the
> > kernel. At least it's supposed to do that.  This eliminates the
> > need for any kind of an eeh daemon and any /proc usage (two good
> > things).
>
> No, you aren't generating a hotplug event here, you are instantly
> shutting the power to the device off, after telling the driver bound to
> the device to disconnect.  Is that what you really want to do?  It's
> quite severe, and is a pretty harsh policy.

Yes, its harsh.  This is the 'short-term' solution.  Hope to have
something better (a lot better) later.

> Think scsi devices with lots of filesystems mounted.  boom.
> Think multiport ethernet devices with loads of network traffic going
> over the other ethernet devices.  boom.

Yes, boom. Currently, its a kernel panic, which is even worse.
Nathan was trying to at least get rid of the kernel panic, so that
at least the system can limp for just long enough for the sysadmin
to do something.

> Why not do this
> 	- get eeh event
> 	- determine which pci_dev this happened to.
> 	- switch back to a task context
> 	- call kobject_hotplug for the pci_dev with the action="fault"
> 	- put a script in /etc/hotplug.d/pci/ that catches all
> 	  ACTION=fault events and decides what to do with them.  You

Well, there are some subtle points that make this clomplicated.

1) The're not 'events' in the sense of being interrupts or messages
   or something like that.  By the time the linux kernel finds out
   about it, in interrupt or task context,  the eeh hardware has
   already off-lined the adapter.

   An adapter the is offlined by eeh hardware returns -1 on reads
   and ignores all writes.   An adapter that has the power turned
   off returns -1 on reads and ignores all writes.  So, in this
   certain narrow sense, turning off the power is a no-op as far as
   hardware behaviour is concerned.

2) you're right, paulus is right, most of the recovery and etc.
   needs to happen in a task context.  For the 'ultimate' solution,
   I was thinking a kernel daemon; but maybe something else is
   possible.

3) We know that some fraction of EEH events are perma-failures
   (hardware is busted), and these need to trickle up to user scripts,
   presumably exactly with the scenario you describe.  We also know
   that some are one-shot parity errors that can be transparently
   recovered from.

   For the later, I was really hoping for a design that reset/restarted
   in the device driver, and the higher layers (block device/sockets)
   aren't even aware that that there was a momentary interruption of
   service.  But at this time, that's not in this current patch.

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/