PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI ErrorRecovery)

Nguyen, Tom L tom.l.nguyen at intel.com
Thu Mar 17 09:55:28 EST 2005


Tuesday, March 15, 2005 3:28 PM Benjamin Herrenschmidt wrote:
> Please, look at my mail describing a different interface. I think your
> model can fit. However, one thing you need to do is what I call
> "synchronous error detection" as well.

Seems like we need to define the roles of the Error Driver (in the case
of PCI Express what I am referring to as the AER driver) and the device
driver (e.g. driver for a PCI SCSI card).  See the PCI Express HOW-TO
for how I defined the roles.  Can you provide insight into the roles for
PCI?
Need some scenarios to walk though that would validate any API usage
model.  We have gone through some to define the PCI Express AER
interface.  However, they were PCI Express specific.  We need some PCI
based error flows to understand the details of the flow so we can
develop an interface compatible with both.

Some specific comments regarding the API proposed as it relates to PCI
Express:

1. With message = PCIERR_ERROR_DETECTED
PCI Express has extensive error reporting information and requires more
than just an "int message" to report all the information. The
error_handler interface allows us to notify the driver but does not
allow the driver to report comprehensive error information that the
driver can gather from its device. PCI Express inherently builds in the
severity of the errors so that a query of the device regarding
recoverability/fatality is built into the error data the device sends to
the AER driver. The AER driver then uses this data to guide the error
recovery, because it owns error recovery communication with the
hierarchy in question.

2. With message = PCIERR_ERROR_RECOVER
In PCI Express the device driver programs the severity of the errors for
its device.  This programming allows the error information to embed the
recoverability of the error in the error message information forwarded
to the AER driver from the device.   It is assumed for all errors when
the device driver is notified by the AER driver the device driver will
take recovery actions with its HW device as it deems appropriate based
on the severity of the error.  However, the device driver cannot take
any action that affects the bus/link interface.  Under PCI Express this
is prevented because devices cannot reset upstream links.

I need a better understanding of how PCI works in these scenarios so
that we can come to a common API.  Can you provide some common PCI error
flows with the capabilities a device driver may have regarding error
recovery.

3. With message = PCIERR_ERROR_RESTART
Not necessary for PCI Express because it is a point to point protocol.
However, we can overload it so PCI Express uses as a mechanism to
communicate with all downstream devices affected by an upstream link
error/reset.

4. What mechanism (message??) is used to perform the bus and/or link
level reset?  For PCI Express the reset is performed by the upstream
port driver.  My API takes this into account.  Are you assuming the PCI
device on the bus does the reset or will there be a PCI bus driver that
will do the reset?  How will the PCI error handling code initiate a
reset?

5. Is the infrastructure being proposed for device level errors, PCI bus
level errors, or PCI Express link level errors? The connection between
the devices is what the errors are about.  We think that the focus
should be bus/link level errors.  

Do I misunderstand the usage model of this API somehow?

Thanks,
Long



More information about the Linuxppc64-dev mailing list