PCI errors [was Re: "sparse" warnings..]

Wed May 5 05:25:59 EST 2004

On Tue, May 04, 2004 at 11:15:01AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 4 May 2004 linas at austin.ibm.com wrote:
> > extended error handling, a way of reporting PCI bus errors that would
> > otherwise cause machine-checks.
>
> So what was wrong with the suggested interface, ie having something like

I missed out on the chain of emails where this was suggested.

> 	pci_clear_error(pdev);
> 	x = pci_inw_check(pdev, port);
> 	pci_outw_check(pdev, x | BIT, port);
> 	error = pci_check_error(pdev);
>
> (or whatever.. I don't care about the names, but what I do _not_ want to
> have is something that checks synchronously with the IO. To me, the
> important part is that there is a separate "check errors" phase _after_
> the IO has been completed, which is the one that ends up possibly waiting
> for the posted writes to have actually gone to the device etc).

The above is fine for any 'fully EEH aware device driver', and should
be the interface used.  However, it requires modifications to the device
driver.  There's two problems with that.

First is the cultural problem: If this is percieved to be a ppc64
stunt, no one will be interested.  If this technology shows up on
non-ppc64 platforms, and there is a pronouncment e.g. from you, that
'yay verily device drivers must be written in this way', then maybe
the idea will get some traction and some device drivers will get converted.
I've been told conflicting things about non-ppc64 hardware; I picked
through the next generation PCI-X spec, but couldn't find anything
comparable.  I'm not a PCI expert, its not clear to me what's going
there.

The second problem is a more nebulous policy question.  If an error
is detected, and the device driver doesn't know how to deal with it,
should one ignore it, and risk potential data corruption, or should
one panic the machine?  The high-availablity guys fear pernicious
data corruption so much that they would rather panic the machine
than ignore the error.  So the current policy is to call "panic"
(althought I hope to change that soon).

If you buy into the 'panic-on-error' philosophy, then the problem
becomes one of how to detect the error when your device driver is
not EEH-aware.  Unfortunately, a fault doesn't cause an exception,
so the only way to detect the error is to do it in-line, which
results in the current design.

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/