PCI errors [was Re: "sparse" warnings..]

linas at austin.ibm.com linas at austin.ibm.com
Wed May 5 10:33:30 EST 2004


Hi,

On Tue, May 04, 2004 at 01:22:12PM -0700, Linus Torvalds wrote:
>
>
> Device drivers that don't know about error handling can't do anything sane
> about it anyway, so for them the errors should just be ignored. Maybe
> logged, but as far as the driver is concerned, it never happened. That's
> why I'd want the error checking to have to be explicit - we should never
> do anything if the driver doesn't explicitly agree with it.

Except that is not how the hardware works.  Once you get the error,
that's it, the device is blown up out of the water, its history.
Its impossible to ignore this error.

Think about it: there was a PCI Address parity error.  What should one
do?  Pretend the read/write to some bogus address succeeded?  Who
knows what might happen next?

In fact, the actual hardware reacts in a fashion very similar to
having a user yank out a PCMCIA card out of a slot, and conceptually
this is probably the easiest way to think about it.  It stops all
further I/O between the device and the system, and that's the end
of the story until that hardware slot is reset.   All further reads
by the device driver return 0xffffff, all further writes are dropped
on the floor.  DMA in the opposite direction is cut off.   Basically,
the device driver is toast at this point, and one has to do something.

Note: on the newest hardware, there is no way to turn this error
detection off.  If the error is detected, the slot will freeze.

I've been playing with the idea of treating these as if they were
hotplug events.  This is ugly but maybe acceptable for ethernet
outages, but is, well, unworkable if the error happened to affect the
root file system.

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/





More information about the Linuxppc64-dev mailing list