Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)
Benjamin Herrenschmidt
benh at kernel.crashing.org
Sat Mar 19 12:24:07 EST 2005
On Fri, 2005-03-18 at 18:35 -0600, Linas Vepstas wrote:
> On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark:
> >
> > Additionally, in "real life", very few errors are cause by known errata.
> > If the drivers know about the errata, they usually already work around
> > them. Afaik, most of the errors are caused by transcient conditions on
> > the bus or the device, like a bit beeing flipped, or thermal
> > conditions...
>
>
> Heh. Let me describe "real life" a bit more accurately.
>
> We've been running with pci error detection enabled here for the last
> two years. Based on this experience, the ballpark figures are:
>
> 90% of all detected errors were device driver bugs coupled to
> pci card hardware errata
Well, this have been in-lab testing to fight driver bugs/errata on early
rlease kernels, I'm talking about the context of a released solution
with stable drivers/hw.
> 9% poorly seated pci cards (remove/reseat will make problem go away)
>
> 1% transient/other.
Ok.
> We've seen *EVERY* and I mean *EVERY* device driver that we've put
> under stress tests (e.g. peak i/o rates for > 72 hours, e.g.
> massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
> driver tripped on an EEH error detect that was traced back to
> a device driver bug. Not to blame the drivers, a lot of these
> were related to pci card hardware/foirmware bugs. For example,
> I think grepping for "split completion" and "NAPI" in the
> patches/errata for e100 and e1000 for the last year will reveal
> some of the stuff that was found. As far as I know,
> for every bug found, a patch made it into mainline.
Yah, those are a pain. But then, it isn't the context described by
Nguyen where the driver "knows" about the errata and how to recover.
It's the context of a bug where the driver does not know what's going on
and/or doesn't have the proper workaround. My point was more that there
are very few cases where a driver will have to do recovery of PCI error
in known cases where it actually expect an error to happen.
> As a rule, it seems that finding these device driver bugs was
> very hard; we had some people work on these for months, and in
> the case of the e1000, we managed to get Intel engineers to fly
> out here and stare at PCI bus traces for a few days. (Thanks Intel!)
> Ditto for Emulex. For ipr, we had inhouse people.
>
> So overall, PCI error detection did have the expected effect
> (protecting the kernel from corruption, e.g. due to DMA's going
> to wild addresses), but I don't think anybody expected that the
> vast majority would be software/hardware bugs, instead of transient
> effects.
>
> What's ironic in all of this is that by adding error recovery,
> device driver bugs will be able to hide more effectively ...
> if there's a pci bus error due to a driver bug, the pci card
> will get rebooted, the kernel will burp for 3 seconds, and
> things will keep going, and most sysadmins won't notice or
> won't care.
Yes, but it will be logged at least, so we'll spot a lot of these during
our tests.
Ben.
More information about the Linuxppc64-dev
mailing list