PCI Error Recovery API Proposal (updated)

Thu Apr 7 04:48:52 EST 2005

On Tue Apr 5 01:43:51 2005 Benjamin Herrenschmidt wrote:
>The error recovery API support is exposed by the driver in the form of
>a structure of function pointers pointed to by a new field in struct
>pci_driver. The absence of this pointer in pci_driver denotes an
>"non-aware" driver, behaviour on these is platform dependant. Platforms
>like ppc64 can try to simulate hotplug remove/add.
>
>This structure has the form:
>
>struct pci_error_handlers
>{
>	int (*error_detected)(struct pci_dev *dev, pci_error_token
>		error);
>	int (*error_recover)(struct pci_dev *dev);
>	int (*error_restart)(struct pci_dev *dev);
>	int (*link_reset)(struct pci_dev *dev);
>	int (*slot_reset)(struct pci_dev *dev);
>};

Agree. When do you plan to have this structure in struct pci_driver?

>The definition of "pci_error_token" is not covered here. 

What is the default type of pci_error_token in API 1)? You said "within
this function and after it returns, the driver shouldn't do any new
IOs." AER code is required to pass error severity (fatal or nonfatal) to
a driver when calling API 1). I refer this error token should be defined
as an integer type, which is passed with either PCIERR_FATAL_DETECTED or
PCIERR_NONFATAL_DETECTED. Please let me know what you think? 

>	3) link_reset()
>
>	This is called after the link has been reset. This is typically
>a PCI Express specific state at this point and is done wether a non fatal
>error has been detected that can be "solved" by resetting the link. The
>driver is informed here of that reset and should check if the device
>appears to be in working condition. This function acts a bit like 2)
>error_recover(), that is it is not supposed to restart normal driver IO
>operations right away, just "probe" the device to check it's 
>recoverability status. If all is right, then the core will call
>error_restart() once all driver have ack'd link_reset().

API 3) is not like error_recover(). This is basically a PCI Express
specific when a fatal error has been reported to the Root Port. This
fatal error can be "solved" by resetting the link at upstream port
associated with a hierarchy in question. An upstream port driver is informed
here to reset its link to return to reliable. After a completion of link
reset, we go to 4) and 5). Please change your description accordingly. 

Thanks,
Long