PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI ErrorRecovery)

Thu Mar 17 14:51:49 EST 2005

Nguyen, Tom L writes:

> We need some PCI
> based error flows to understand the details of the flow so we can
> develop an interface compatible with both.

Here is a basic outline of what happens with EEH (Enhanced Error
Handling) on IBM PPC64 platforms.  This applies to PCI, PCI-X and
PCI-Express devices.

We have a PCI-PCI bridge per slot.  The bridge (and the PCI fabric
generally) look for errors such as address parity errors,
out-of-bounds DMA accesses by the device, or anything that would
normally cause SERR to be set.  If such an error occurs, the bridge
immediately isolates the device, meaning that writes by the CPU to the
device are discarded, reads by the CPU are returned with all 1s data,
and DMA accesses by the device are blocked.

What happens at the driver level depends on whether the driver is
EEH-aware or not.  (This description is more what we would like to
have rather than what is necessarily implemented at present).

If the driver is not EEH-aware but is hot-plug capable, then the
platform code will notice that reads from the device are returning all
1s and query firmware about the state of the slot.  Firmware will
indicate that the slot has been isolated.  Platform code can obtain
more specific information about the error from firmware and log it.
Then, platform code will generate a hot-unplug event for the slot.
After the driver has cleaned up and notified higher levels that its
device has gone away, platform code will call firmware to reset and
unisolate the slot, and then generate a hotplug event to tell the
driver that it can use the device - but as far as the driver is
concerned, it is a new device.

If the driver is EEH-aware, then we use the API that Ben has
proposed.  Platform code can either reset the slot (by calling
firmware) or not, depending on what the driver asks for, and also
depending on any other information the platform code has available to
it, such as specific information about the error that has occurred.
Platform code then unisolates the slot and then informs the driver
that it can reinitialize the device and restart any transfers that
were in progress.

Ben's API is aimed at supporting the code flows that we need for EEH
as well as those needed for recovery from errors on PCI Express.  Part
of the reason for not just requiring the driver to do everything
itself is that a slot isolation event can affect multiple drivers,
because the card in the slot could have a PCI-PCI bridge with multiple
devices behind it.  Thus the recovery process potentially requires a
degree of coordination between multiple drivers, and Ben's API
addresses that.  The same coordination could be required on PCI
Express, if I understand correctly, because a fault on an upstream
link could affect many devices downstream of that link.

Paul.