PCI errors [was Re: "sparse" warnings..]

Wed May 5 10:47:39 EST 2004

On Wed, May 05, 2004 at 08:39:39AM +1000, Paul Mackerras wrote:
>
> (The discussion that Greg KH mentioned was about how to get the unplug
> notification to the driver; Greg advocates that the kernel tells
> userspace about the EEH event, and userspace then drives the recovery
> process: tell the driver the card is gone, reset the slot and card,
> tell the driver there is a new card in there.)

Yes, that's the best we'll be able to do in the short term.
I've got a little test harness that does this, it crashes the
kernel with a null pointer deref about the 5th time around.
Its some hopefully minor bug in the rpaphp hotplug code.

The Greg KH conversation ended with the factoid that if the PCI
event whacks the disk on which the root filesystem sits, all is
lost.  We conclude that the scsi drivers must recover in the
kernel.  I don't think its hard (fingers crossed) I think
its a lot like an HBA reset, and a lot of the support is already
in the scsi_generic layer.  The point being that SCSI already
has a cascading chain of resets that it tries out when things
don't work: first it tries a (disk) device reset, then a scsi
bus reset, then the scsi controller reset.  Recovering from
an EEH error should be identical to a controller reset, except
that we need to unfreeze the slot before doing that reset.

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/