PCI Error Recovery API Proposal (updated)

Tue Apr 5 17:15:11 EST 2005

Hi !

I've been away for a while, but here is my latest update of the proposal,
if we all agree with it, it will go to kernel/Documentation somewhere
and we'll start implementing the ppc64 side of it.

The error recovery API support is exposed by the driver in the form of
a structure of function pointers pointed to by a new field in struct
pci_driver. The absence of this pointer in pci_driver denotes an
"non-aware" driver, behaviour on these is platform dependant. Platforms
like ppc64 can try to simulate hotplug remove/add.

The definition of "pci_error_token" is not covered here. It is based on
Seto's work on the synchronous error detection. We still need to define
functions for extracting infos out of an opaque error token. This is
separate from this API.

This structure has the form:

struct pci_error_handlers
{
	int (*error_detected)(struct pci_dev *dev, pci_error_token error);
	int (*error_recover)(struct pci_dev *dev);
	int (*error_restart)(struct pci_dev *dev);
	int (*link_reset)(struct pci_dev *dev);
	int (*slot_reset)(struct pci_dev *dev);
};

A driver doesn't have to implement all of these callbacks. The only mandatory
one is error_detected. If a callback is not implemented, the corresponding
feature is considered unsupported. For example, if error_recover and
error_restart (they really go together, see desscription to understand why)
aren't there, then the driver is assumed as not doing any direct recovery and
requires a reset. If link_reset is not implemented, the card is assumed as
not caring about link resets, in which case, if recover is supported, the core
can try recover (but not slot_reset unless it really did reset the slot). If slot
reset is not supported, link reset can be called instead on a slot reset.

At first, the call will always be :

       1) error_detected()

	Error detected. This is sent once after an error has been detected. At
this point, the device might not be accessible anymore depending on the
platform (the slot will be isolated on ppc64). The driver may already
have "noticed" the error because of a failing IO, but this is the proper
"synchronisation point", that is, it gives a chance to the driver to
cleanup, waiting for pending stuffs (timers, whatever, etc...) to
complete, it can take semaphores, schedule, etc... everything but touch
the device. Within this function and after it returns, the driver
shouldn't do any new IOs. Called in task context. This is sort of a
"quiesce" point. See note about interrupts at the end of this doc.

	Result codes:
		- PCIERR_RESULT_CAN_RECOVER:
		  Return this if you think you might be able to recover
		  the HW by just banging IOs or if you want to be given
		  a chance to extract some diagnostic informations (see
		  below).
		- PCIERR_RESULT_NEED_RESET:
		  Return this if you think you can't recover unless the
		  slot is reset.
		- PCIERR_RESULT_DISCONNECT:
		  Return this if you think you won't recover at all,
		  (this will detach the driver ? or just leave it
		  dangling ? to be decided) 

So at this point, we have called error_detected() for all drivers
on the segment that had the error. On ppc64, the slot is isolated. What
happens now typically depends on the result from the drivers. If all
drivers on the segment/slot return PCIERR_RESULT_CAN_RECOVER, we would
re-enable IOs on the slot (or do nothing special if the platform doesn't
isolate slots) and call 2). If not and we can reset slots, we go to 4),
if neither, we have a dead slot. If it's an hotplug slot, we might
"simulate" reset by triggering HW unplug/replug tho.

	2) error_recover()

	This is the "early recovery" call. IOs are allowed again, but DMA is
not (hrm... to be discussed, I prefer not), with some restrictions. This
is NOT a callback for the driver to start operations again, only to
peek/poke at the device, extract diagnostic informations  if any, and
eventually do things like trigger a device local reset or such things,
but not restart operations. This is sent if all drivers on a segment
agree that they can try to recover and no automatic link reset was performed
by the HW. If the platform can't just re-enable IOs without a slot reset or a
link reset, it doesn't call this callback and goes directly to 3) or 4). All IOs
should be done _synchronously_ from withing this callback, errors triggered by
them will be returned via the normal pci_check_whatever() api, no new
error_detected() callback will be issued due to an error happening here. However,
such an error might cause IOs to be re-blocked for the whole segment, and thus
invalidate the recovery that other devices on the same segment might have done,
forcing the whole segment into one of the next states, that is link reset or
slot reset.

	Result codes:
		- PCIERR_RESULT_RECOVERED
		  Return this if you think your device is fully
		  functionnal and think you are ready to start
		  to do your normal driver job again. There is no
		  guarantee that because you returned that, you'll be
		  allowed to actually proceed as another driver on the
		  same segment might have failed and thus triggered a
		  slot reset on platforms that support it.

		- PCIERR_RESULT_NEED_RESET
		  Return this if you think your device is not
		  recoverable in it's current state and you need a slot
		  reset to proceed.

		- PCIERR_RESULT_DISCONNECT
		  Same as above. Total failure, no recovery even after
		  reset driver dead. (To be defined more precisely)

	3) link_reset()

	This is called after the link has been reset. This is typically a
PCI Express specific state at this point and is done wether a non fatal error
has been detected that can be "solved" by resetting the link. The driver is
informed here of that reset and should check if the device appears to be in
working condition. This function acts a bit like 2) error_recover(), that is
it is not supposed to restart normal driver IO operations right away, just
"probe" the device to check it's recoverability status. If all is right, then
the core will call error_restart() once all driver have ack'd link_reset().

	Result codes:
		(identical to error_recover)

	4) slot_reset()

	This is called after the slot has been hard reset (and PCI BARs
re-configured by the platform). If the platform supports PCI hotplug,
it can implement this by toggling power on the slot off/on. Drivers here
have a chance to re-initialize the hardware (re-download firmware etc...),
but drivers shouldn't restart normal IO processing operations at this point.
(see note about interrupts, they aren't guaranteed to be delivered until the
restart callback has been called). Upon success from this callback, the
patform will call error_restart() to complete the error handling and let
the driver restart normal IO request processing.

However, a driver can still return a critical failure from here in case
it just can't get it's device back from reset. There is just nothing we
can do about it tho. The driver will just be considered "dead" in this case.

	Result codes:
		- PCIERR_RESULT_DISCONNECT
		Same as above.

	5) error_restart()

	This is called if all drivers on the segment have returned
PCIERR_RESULT_RECOVERED from one of the 3 prevous callbacks. That basically
tells the driver to restart activity, everything is back & running. No result
code is taken into account here. If a new error happens, it will restart
a new error handling process.

That's it. I think this covers all the possibilities. The way those
callbacks are called is platform policy. A platform with no slot reset
capability for example may want to just "ignore" drivers that can't
recover (disconnect them) and try to let other cards on the same segment
recover. Keep in mind that in most real life cases, though, there will
be only one driver per segment.

Now, there is a note about interrupts. If you get an interrupt and your
device is dead or has been isolated, there is a problem :)

After much thinking, I decided to leave that to the platform. That is,
the recovery API only precies that:

 - There is no guarantee that interrupt delivery can proceed from any
device on the segment starting from the error detection and until the
restart callback is sent, at which point interrupts are expected to be
fully operational.

 - There is no guarantee that interrupt delivery is stopped, that is, ad
river that gets an interrupts after detecting an error, or that detects
and error within the interrupt handler such that it prevents proper
ack'ing of the interrupt (and thus removal of the source) should just
return IRQ_NOTHANDLED. It's up to the platform to deal with taht
condition, typically by masking the irq source during the duration of
the error handling. It is expected that the platform "knows" which
interrupts are routed to error-management capable slots and can deal
with temporarily disabling that irq number during error processing (this
isn't terribly complex). That means some IRQ latency for other devices
sharing the interrupt, but there is simply no other way. High end
platforms aren't supposed to share interrupts between many devices
anyway :) 

Ben.