[PATCH/RFC] PCI Error Recovery

Mon Mar 14 23:33:03 EST 2005

Linas Vepstas wrote:
> "enum pci_device_io_state"; BenH was suggesting having
> more of these ... BenH do you want to propose a "final list"?
> 
(snip)
> +/* ---------------------------------------------------------------- */
> +/** PCI error recovery state.  Whenever the PCI bus state changes,
> + *  the io_state_change() callback will be called to notify the 
> + *  device driver os state changes.
> + */
> +
> +enum pci_device_io_state {
> +	pci_device_io_frozen = 1, /* I/O to device is blocked */
> +	pci_device_io_thawed,     /* I/O te device is (re-)enabled */
> +	pci_device_io_perm_failure, /* pci card is dead */
> +};

I'm not BenH... but I think it's of value to have the list of states.
(Even it seems that the list what originally you want isn't "state list"
  but "event list".)

IMHO, (according to current list) there will be 3 states at least:

  - NORMAL:
      Standard, usual, healthy state.
      Strictly speaking, this doesn't mean "everything works well."
      IOW - unreliable: "works but occasionally fails."
      You can access the device but checking the result is recommended.
  - ISOLATED:
      Physically connected but accesses are temporarily blocked.
      Devices would be unstable but maybe believed as recoverable.
      Error info on the platform or device would be inaccessible.
      The system could attempt to recover - change the state to NORMAL.
  - DEAD:
      Physically connected but accesses are permanently blocked.
      No recovery attempt is required any more.

How many other state will be there?

And, I guess you would need 3 types of event at least:

  - ERROR_DETECTED:
      An error was detected.
      Notified driver could test the device, collect advanced/extra error
      info and log it.
  - STATE_CHANGED:
      I/O state was changed.
      New state will be indicated in the param with this event.
  - TRY_RECOVER:
      OS requires possible device-specific-recovery to drivers.
      After gathering all results, OS will decide recovered or not.

Depending on arch's facility and implementation, behavior of system changes
terribly. For example, if we get an error when in NORMAL state:

case 1) NORMAL -> NORMAL
   State isn't changed. The error will be reported by some kind of exception,
   read() will return broken(or poisoned) data, and write will be ignored.
   Even if subsequent I/O also fails, we can continue access to the device.
   # ex. ia32
case 2) NORMAL -> ISOLATED/DEAD
   Even if it was temporary soft error, system isolates the affected bus and
   devices. All subsequent I/O will be blocked(or poisoned/ignored).
   # ex. ppc64
case 3) System reset
   Even if it was temporary soft error, system goes to reboot immediately.
   All subsequent/pending I/O will be dismissed.
   # ex. ia64 (too sensitive...so now I'm engaged in :-p

Therefore, you will (case 1)get a lot of ERROR_DETECTED events,
or (case 2)get a STATE_CHANGED event with param indicating "ISOLATED,"
or (case 3)get nothing. Again, currently most of arch don't use states
other than NORMAL...

Now your intent:
 > +	pci_device_io_frozen = 1, /* I/O to device is blocked */
 > +	pci_device_io_thawed,     /* I/O te device is (re-)enabled */
 > +	pci_device_io_perm_failure, /* pci card is dead */
would be realized by:
  event(STATE_CHANGED,ISOLATED) + event(TRY_RECOVER,*data)
  event(STATE_CHANGED,NORMAL)
  event(STATE_CHANGED,DEAD)
I think the latter style is more generic.

Do these ideas become a clue to go on?

Thanks,
H.Seto