PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)
Nguyen, Tom L
tom.l.nguyen at intel.com
Fri Mar 18 05:53:46 EST 2005
On Wednesday, March 16, 2005 7:20 PM Benjamin Herrenschmidt wrote:
>> What mechanism (message??) is used to perform the bus and/or link
>> level reset? For PCI Express the reset is performed by the upstream
>> port driver. My API takes this into account. Are you assuming the
>> device on the bus does the reset or will there be a PCI bus driver
>> will do the reset? How will the PCI error handling code initiate a
>The "caller", that is the error management framework. I'm defining the
>API at the driver level, not the implementation at the core level.
>For example, on IBM pSeries with PCI-Express, we will probably not have
>an AER driver. This will be all dealt by the firmware which will mimmic
>that to the existing EEH error management. We'll have the same API to
>the reset that we have today for resetting a slot.
We decided to implement PCI Express error handling based on the PCI
Express specification in a platform independent manner. This allows any
platform that implements PCI Express AER per the PCI SIG specification
can take advantage of the advanced features, much like SHPC hot-plug or
PCI Express hot-plug implementations.
>You may have noticed in general that I didn't either define who is
>callign those callbacks. It's all implicit that this is done by
>error management code. For example, on ppc64, even the recovery step
>requires action from the platform since the slot has been physically
>isolated. After we have notified all drivers with the "error detected"
>callback, if we decide we can try the "recover" step (all drivers
>returned they could try it and we decided the error wasn't too fatal)
>will call the firmware to re-enable IOs on the slot and call the
For PCI Express the endpoint device driver can take recovery action on
its own, depending on the nature of the error so long as it does not
affect the upstream device. This can include endpoint device resets.
We expect the driver to do this upon error notification, if possible.
In PCI Express since the driver will have the most knowledge regarding
the error it will have the best ability to do device dependent recovery
and IO retry. If its recovery fails then the AER driver will ask the
upstream device driver to perform the link reset. Since this is more of
a side effect an explicit call to recover is not necessary. However, we
understand and agree that it is needed to support the general error
recovery cases for PCI.
To support the AER driver calling an upstream device to initiate a reset
of the link we need a specific callback since the driver doing the reset
is not the driver who got the error. In the case of general PCI this
could be useful if a PCI bus driver were available to support the
callback for a bridge device. This would also support specific error
recovery calls to reset an endpoint adapter. We need a call to request
a driver to perform a reset on a link or device.
More information about the Linuxppc64-dev