more eeh

Fri Mar 19 11:32:29 EST 2004

Greg KH writes:

> > When the slot is in this state, any writes to the device get thrown
> > away and any reads return all 1's.
>
> Which is the same as PCMCIA sees when the device is disconnected, right?

Yes, exactly.

> > I was thinking that the unplug event generation, resetting and
> > reconnecting of the device, and plug event generation would be done by
> > a kernel thread.  I don't think we want to rely on userspace for that,
> > because userspace may get blocked while the device is gone.
>
> But you want userspace to do this.  There are systems with a few
> different PCI Hotplug controller drivers on them.  The different
> controller drivers control different slots.  Userspace is the only place
> that can reliably handle this.

I don't understand this; surely if you get a pointer to a pci_dev you
can get to pci_dev->driver->remove and call that, can't you?  Or are
you saying that there is no consistent API to the drivers for the
hot-plug capable PCI bridges?

> And if you are a kernel thread, you would have the same issues that
> dropping to userspace and doing the disconnect there causes.

Not all of the same issues; a kernel thread doesn't have to worry
about its code or data being paged out, for instance.

> So I still think that my userspace proposal is the proper way to do
> this.  It works with all pci hotplug drivers, and allows userspace to
> implement any type of policy that it wishes to (disconnecting
> filesystems, bringing down network connections, logging the event to the
> proper place, etc.)
>
> > I would rather get the notification to the driver quickly without
> > relying on userspace (but of course from task context not interrupt
> > context).  What happens after that could be driven by userspace,
> > except that I worry about what happens if userspace gets blocked by
> > the device being unavailable.
>
> You've never actually timed a hotplug event have you :)

Well, I would be concerned about the maximum latency, not the average
latency.  I accept that the average would be milliseconds, but the
maximum could be tens of seconds on a heavily loaded system, couldn't
it?  Especially if it involves execing a new process and that requires
disk I/O.

> Now the issue of putting the hotplug script on a disk that just got a
> error would indicate that you really need a type (a) driver for that
> kind of thing.

Part of my thinking is that I would like the API for type (a) drivers
to be an extension of the PCI hotplug API rather than being completely
disjoint.  In other words, I would like the type (a) driver to get the
unplug event, and then determine (via a special call, or a parameter
to the remove() function) that this is an EEH event and therefore the
absence of the device is likely to be transient.  The driver would
then not report the removal immediately, but would wait (with a
timeout) for the device to come back.  When it came back it would
recognize that this is the same device, reinitialize it and carry on.
If the device didn't come back shortly, then it would do the normal
device removal things.

In any case, whatever the API, we are going to have to have the
infrastructure in the kernel to do the slot reset and reconnect, for
type (a) drivers to use.  Type (a) drivers need to be able to recover
without relying on userspace, obviously.  It doesn't make sense to me
to have the same logic in two places, in the kernel and in userspace,
and use one or the other depending on what sort of driver we have.

Thoughts?

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/