PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI Error Recovery)

Tue Mar 15 16:32:20 EST 2005

> Is there a long-term philosphy for the Linux kernel on a question like
> this?  That is, when should changes add callbacks to structures, 
> as opposed to notifier-chain based events?  The callback is a bit
> simpler, and maybe a tiny bit faster, but its less flexible in the 
> long run (e.g. anyone can listen for the events, but only device 
> drivers can get callbacks). Comments, please?

Ok, let's propose what i think is a proper API and simple enough on the
driver side, if complexity there is, it's in the platform policy. That
should cover all the needs we discussed so far:

I think we need a callback in pci_driver, as I explained all along, with
a very simple semantic:

   int (*error_handler)(struct pci_dev *dev, int message);

At first, message will be :

       1) PCIERR_ERROR_DETECTED

	Error detected. This is sent once after an error has been detected. At
this point, the device might not be accessible anymore depending on the
platform (the slot will be isolated on ppc64). The driver may already
have "noticed" the error because of a failing IO, but this is the proper
"synchronisation point", that is, it gives a chance to the driver to
cleanup, waiting for pending stuffs (timers, whatever, etc...) to
complete, it can take semaphores, schedule, etc... everything but touch
the device. Within this function and after it returns, the driver
shouldn't do any new IOs. Called in task context. This is sort of a
"quiesce" point. See note about interrupts at the end of this doc.

	Result codes:
		- PCIERR_RESULT_CAN_RECOVER:
		  Return this if you think you might be able to recover
		  the HW by just banging IOs or if you want to be given
		  a chance to extract some diagnostic informations (see
		  below).
		- PCIERR_RESULT_NEED_RESET:
		  Return this if you think you can't recover unless the
		  slot is reset.
		- PCIERR_RESULT_DISCONNECT:
		  Return this if you think you won't recover at all,
		  (this will detach the driver ? or just leave it
		  dangling ? to be decided) 

So at this point, we have called PCIERR_ERROR_DETECTED for all drivers
on the segment that had the error. On ppc64, the slot is isolated. What
happens now typically depends on the result from the drivers. If all
drivers on the segment/slot return PCIERR_RESULT_CAN_RECOVER, we would
re-enable IOs on the slot (or do nothing special if the platform doesn't
isolate slots) and call 2). If not and we can reset slots, we go to 4),
if neither, we have a dead slot. If it's an hotplug slot, we might
"simulate" reset by triggering HW unplug/replug tho.

	2) PCIERR_ERROR_RECOVER

	This is the "early recovery" call. IOs are allowed again, but DMA is
not (hrm... to be discussed, I prefer not), with some restrictions. This
is NOT a callback for the driver to start operations again, only to
peek/poke at the device, extract diagnostic informations  if any, and
eventually do things like trigger a device local reset or such things,
but not restart operations. This is sent if all drivers on a segment
agree that they can try to recover. If the platform can't just re-enable
IOs without a slot reset, it doesn't call this callback and goes
directly to 4). All IOs should be done _synchronously_ from withing this
callback, errors triggered by them will be returned via the normal
pci_check_whatever() api, no new PCIERR_ERROR_DETECTED callback will be
issued due to an error happening here, though such an error might cause
IOs to be re-blocked for the whole segment (and thus invalidating the
recovery of other devices on the same segment).

	Result codes:
		- PCIERR_RESULT_RECOVERED
		  Return this if you think your device is fully
		  functionnal and think you are ready to start
		  to do your normal driver job again. There is no
		  guarantee that because you returned that, you'll be
		  allowed to actually proceed as another driver on the
		  same segment might have failed and thus triggered a
		  slot reset on platforms that support it.

		- PCIERR_RESULT_NEED_RESET
		  Return this if you think your device is not
		  recoverable in it's current state and you need a slot
		  reset to proceed.

		- PCIERR_RESULT_DISCONNECT
		  Same as above. Total failure, no recovery even after
		  reset driver dead. (To be defined more precisely)

	3) PCIERR_ERROR_RESTART

	This is called if all drivers on the segment have returned
PCIERR_RESULT_RECOVERED from the prevous callback. That basically tells
the driver to restart activity, everything is back & running. No result
code is taken into account here. If a new error happens, it will restart
a new error handling process.

	4) PCIERR_ERROR_RESET

	This is called after the slot has been reset (and PCI BARs
re-configured by the platform). As for PCIERR_ERROR_RESTART, drivers
here are just supposed to re-init the hardware and restart operations.
However, a driver can still return a critical failure from here in case
it just can't get it's device back from reset. There is just nothing we
can do about it tho.

	Result codes:
		- PCIERR_RESULT_DISCONNECT
		Same as above.

That's it. I think this covers all the possibilities. The way those
callbacks are called is platform policy. A platform with no slot reset
capability for example may want to just "ignore" drivers that can't
recover (disconnect them) and try to let other cards on the same segment
recover. Keep in mind that in most real life cases, though, there will
be only one driver per segment.

Now, there is a note about interrupts. If you get an interrupt and your
device is dead or has been isolated, there is a problem :)

After much thinking, I decided to leave that to the platform. That is,
the recovery API only precies that:

 - There is no guarantee that interrupt delivery can proceed from any
device on the segment starting from the error detection and until the
restart callback is sent, at which point interrupts are expected to be
fully operational.

 - There is no guarantee that interrupt delivery is stopped, that is, ad
river that gets an interrupts after detecting an error, or that detects
and error within the interrupt handler such that it prevents proper
ack'ing of the interrupt (and thus removal of the source) should just
return IRQ_NOTHANDLED. It's up to the platform to deal with taht
condition, typically by masking the irq source during the duration of
the error handling. It is expected that the platform "knows" which
interrupts are routed to error-management capable slots and can deal
with temporarily disabling that irq number during error processing (this
isn't terribly complex). That means some IRQ latency for other devices
sharing the interrupt, but there is simply no other way. High end
platforms aren't supposed to share interrupts between many devices
anyway :) 

Comments welcome. Linas, I'll give a try at coding something up in the
upcoming days unless you beat me to it.

Ben.