more eeh

Thu Mar 18 09:54:21 EST 2004

Greg KH writes:

> No, you aren't generating a hotplug event here, you are instantly
> shutting the power to the device off, after telling the driver bound to
> the device to disconnect.  Is that what you really want to do?  It's
> quite severe, and is a pretty harsh policy.

It's the hardware and firmware designers that have taken this policy.
The question is: what do you do with the sort of PCI errors that would
normally result in assertion of the SERR# (system error) line, such as
an address parity error?  On a desktop system you can make the SERR#
signal cause a machine check, but you don't want to do that on a
partitioned system since that would stop all the partitions, not just
the one that was using the device in question.

So the scheme that the hardware designers came up with was to add
logic to the PCI-PCI bridges (we have one per slot, to support
hotplug) to allow a slot to be electrically isolated from the rest of
the system.  Then, if the system detects an address parity error on a
DMA transaction initiated by a particular device, it can just abort
that transaction and isolate that device immediately, and thus stop
the error from affecting any other part of the system.

When the slot is in this state, any writes to the device get thrown
away and any reads return all 1's.  There are in fact ways to get
through the bridge, via firmware calls, which the driver can use to
(for example) dump the state of the device.  There are also firmware
calls to reset the device and to restore the normal connection through
the PCI-PCI bridge.

The idea of presenting this to drivers as a hot-unplug event followed
by a hot-plug event (after the device has been reset and reconnected)
was my suggestion as the best way to present to the drivers what the
hardware is doing.  I envisaged three classes of drivers: (a) those
that were very pSeries-specific and could use a pSeries-specific API
to cope with all this; (b) drivers that could cope with asynchronous
plug and unplug events, to which the EEH shenanigans could be
presented as plug/unplug events, and (c) drivers which couldn't cope
at all.

My hope was that a lot of drivers could be in class (b).  I was hoping
that most hot-plug aware drivers could be hardened sufficiently to be
in class (b) without too much effort, and that that hardening would be
acceptable to the driver maintainers (whereas the changes to put a
driver in class (a) would, I expect, not be acceptable).  The main
things are that the device can be unplugged without prior notification
and that the driver has to not do anything silly (like spinning
forever) if reads from the device start returning all 1's.

I was thinking that the unplug event generation, resetting and
reconnecting of the device, and plug event generation would be done by
a kernel thread.  I don't think we want to rely on userspace for that,
because userspace may get blocked while the device is gone.

> Think scsi devices with lots of filesystems mounted.  boom.
> Think multiport ethernet devices with loads of network traffic going
> over the other ethernet devices.  boom.

Well yes.  At least with network devices, if they get unplugged, reset
and replugged, we have the chance for the hotplug scripts to restore
the correct addresses and routes, based on the device's MAC address.

For scsi host adaptors, it's less pretty.  There might be an argument
for writing a class (a) driver for the scsi HBA for your root disk.
Such a driver could present the whole EEH disconnect/reset/reconnect
thing to the SCSI subsystem as a bus reset, for example.

> It's also not going to work, as you are doing this from interrupt
> context, and the pci disconnect sequence is expecting to have a task
> context and will sleep.
>
> Why not do this (as this is what I think Anton was suggesting you do):
> 	- get eeh event
> 	- determine which pci_dev this happened to.
> 	- switch back to a task context
> 	- call kobject_hotplug for the pci_dev with the action="fault"
> 	- put a script in /etc/hotplug.d/pci/ that catches all
> 	  ACTION=fault events and decides what to do with them.  You
> 	  have a full pointer to the sysfs directory of the pci device
> 	  at this moment in time, so you can see what driver is bound to
> 	  the device, and if you really want to, you can turn the device
> 	  off (after bringing down the network connection or unmounting
> 	  any attached filesystems.)
>
> This pushes all of your policy to userspace, allows you to fit into the
> proper kernel event notifier, and allows you to write a shell script if
> you want to do so.
>
> And it makes the kernel code a whole lot smaller and simpler.
>
> Sound good?

I would rather get the notification to the driver quickly without
relying on userspace (but of course from task context not interrupt
context).  What happens after that could be driven by userspace,
except that I worry about what happens if userspace gets blocked by
the device being unavailable.

Greg, I would really value your considered thoughts about how to
handle this stuff properly.  EEH is a fact of life for us - I don't
want to defend the approach, but it is in hardware today and we have
to deal with it.

Thanks,
Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/