PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI ErrorRecovery)

Thu Mar 17 14:20:23 EST 2005

On Wed, 2005-03-16 at 14:55 -0800, Nguyen, Tom L wrote:

> 1. With message = PCIERR_ERROR_DETECTED
> PCI Express has extensive error reporting information and requires more
> than just an "int message" to report all the information. The
> error_handler interface allows us to notify the driver but does not
> allow the driver to report comprehensive error information that the
> driver can gather from its device. PCI Express inherently builds in the
> severity of the errors so that a query of the device regarding
> recoverability/fatality is built into the error data the device sends to
> the AER driver. The AER driver then uses this data to guide the error
> recovery, because it owns error recovery communication with the
> hierarchy in question.

As I explained already, I feel like this additional error informations
should be requested explicitely, but that isn't a strong feeling, we
could perfectly add an io error token (opaque error representation) to
this callback. Since it seems that it's preferred to have several
callbacks anyway rather than a switch/case on a message, the message
argument will be gone, and that specific callback will take an io error
token.

> 2. With message = PCIERR_ERROR_RECOVER
> In PCI Express the device driver programs the severity of the errors for
> its device.  This programming allows the error information to embed the
> recoverability of the error in the error message information forwarded
> to the AER driver from the device.   It is assumed for all errors when
> the device driver is notified by the AER driver the device driver will
> take recovery actions with its HW device as it deems appropriate based
> on the severity of the error.  However, the device driver cannot take
> any action that affects the bus/link interface.  Under PCI Express this
> is prevented because devices cannot reset upstream links.
> 
> I need a better understanding of how PCI works in these scenarios so
> that we can come to a common API.  Can you provide some common PCI error
> flows with the capabilities a device driver may have regarding error
> recovery.

This is more than just how PCI works. It's also how IBM's EEH works and
others. I'm trying to setup a model in which we can "fit" everybody.
Basically, if you already know that no recovery is possible with just
doing IOs to the chip, but a card reset will be possible, you can
directly go to the step PCIERR_ERROR_RESET. Remember that we are trying
to provide a driver-side API that isolates them of the underlying
mecanisms and policies.

If the driver knows (thanks to PCI Express stuff) that the error was
fatal (because it knows it's on pci-express and extracted the relevant
information out of the error token), then it could just return
PCIERR_RESULT_NEED_RESET right away from the PCIERR_ERROR_DETECTED
callback. It's up to the platform then to just give up if it decides it
can't bring the link back, or to reset everything and then call
PCIERR_ERROR_RESET.

Basically, my model allows driver to deal with that basic 3 states (and
2 recovery mecanisms):

 - reporting the error to the driver, quiesce it
 - if possible give a chance to the driver to recover just by issuing
IOs (that is the error wasn't fatal). That is a best try, it can fail.
 - if possible, reset the slot/bus and let drivers re-init the HW

I'm confident the PCI Express mecanism can fit nicely into this model,
but you are welcome to prove me wrong. In all above case, "if possible"
is a mix of driver and platform knowledge. That is, the platform tries
at best to serve the driver wishes (like going to state "recover" when
the driver says it can try to recover, based on result code form
"detect" callback), but may just decide it can't and reset. If the error
information provide the "recoverability" information from the very
beginning, then the driver just need to return the right code from the
"detect" callback based on it.

> 3. With message = PCIERR_ERROR_RESTART
> Not necessary for PCI Express because it is a point to point protocol.
> However, we can overload it so PCI Express uses as a mechanism to
> communicate with all downstream devices affected by an upstream link
> error/reset.

Among others... It is neccessary because again, we are defining a model
that matches everybody. So we want drivers to assume they have to wait
before everybody has been notified, even if a specific implementation
doesn't require it. On PCI Express, we could imagine just calling
restart right away after recover, that isn't a problem.

There may also be incidences with interrupt handling. Restart is the
only point where interrupts are guaranteed to be properly operational
(see my note).

> What mechanism (message??) is used to perform the bus and/or link
> level reset?  For PCI Express the reset is performed by the upstream
> port driver.  My API takes this into account.  Are you assuming the PCI
> device on the bus does the reset or will there be a PCI bus driver that
> will do the reset?  How will the PCI error handling code initiate a
> reset?

The "caller", that is the error management framework. I'm defining the
API at the driver level, not the implementation at the core level.

For example, on IBM pSeries with PCI-Express, we will probably not have
an AER driver. This will be all dealt by the firmware which will mimmic
that to the existing EEH error management. We'll have the same API to do
the reset that we have today for resetting a slot.

You may have noticed in general that I didn't either define who is
callign those callbacks. It's all implicit that this is done by platform
error management code. For example, on ppc64, even the recovery step
requires action from the platform since the slot has been physically
isolated. After we have notified all drivers  with the "error detected"
callback, if we decide we can try the "recover" step (all drivers
returned they could try it and we decided the error wasn't too fatal) we
will call the firmware to re-enable IOs on the slot and call the
"recover" step.

If my model is implemented on top of classic PCI, the "core" would just
clear SERR/PERR or any other error reporting facility on the host bridge
and other bridges down the path if necessary and call recover. If the
platform has no slot reset capability, the core would give up in any
non-recoverable case.

We could provide generic implementations for basic PCI (as described
above) and for PCI-Express maybe, but the platform is the one to choose
which implementation to use and to provide eventual alternative
implementations based on that platform additional capabilities.

> 5. Is the infrastructure being proposed for device level errors, PCI bus
> level errors, or PCI Express link level errors? The connection between
> the devices is what the errors are about.  We think that the focus
> should be bus/link level errors.  

Device level and link. It might be useful to provide a generic function
to extract the level of the error from the error token, but not
mandatory.

> Do I misunderstand the usage model of this API somehow?

Maybe :)

Ben.