PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI Error Recovery)

Wed Mar 16 09:14:57 EST 2005

On Tuesday, March 15, 2005 1:44 PM Linas Vepstas wrote:
> Also, I have not completely read (or understood) what Long Nguyen 
> just sent in... and haven't heard him make any remarks.  Sounds
> to me like his AER patch is a pcie-specific version of what we are
> talking about. It would be nice to hear for Long about his thoughts on
> this.

I apologize for taking it too long to respond. To give you some PCI
Express AER context in short general terms, PCI Express component, which
detects an error, sends an error message to the Root Port. The Root Port
processes this error message internally and generates an interrupt
signal at root port. The AER driver's ISR, which services this interrupt
signal, determine the error based on the error information logged by the
Root Port device in the Root Error Status Register and the Error Source
Identification Registers. Once the error is identified, the AER driver
uses the AER aware callback handle as defined below:

struct pcie_aer_handle {
   /* 
    * Notify the PCI Express device driver of an error sent by its
device.
    * Also, obtain error information from the driver to identify what
    * error type and what severity.
    */
   int (*notify) (unsigned short requestor_id, union aer_error *error);

   /*
    * Obtain TLP header log, which may be logged along with certain 
    * uncorrectable error.
    */
   int (*get_header) (unsigned short requestor_id, union aer_error
*error,
		struct header_log_regs *log);

   /*
    * Notify the driver to abort any existing transactions, prepare for

    * uncorrectable fatal error recovery. This occurs only if the PCI 
    * Express Port implements a link reset in its hardware.
    */
   int (*link_rec_prepare) (unsigned short requestor_id);

   /*
    * Notify the driver when link reset is completed and active.
    */	
   int (*link_rec_restart) (unsigned short requestor_id);

   /*
    * Notify the driver performing link reset. This occurs only if this
PCI 
    * Express Port implements a link reset in its hardware.
    */
   int (*link_reset) (unsigned short requestor_id);
};

to coordinate with the PCI Express AER aware driver to determine more
precisely what error type and what severity; so, the PCI Express AER
Root driver can log and report the error to user. If error type is
uncorrectable and error severity is fatal, the hardware link is no
longer reliable. To return the link to reliable requires the
implementation of link reset on the PCI Express port. If this condition
meets, then PCI Express AER driver does a link reset. Otherwise, PCI
Express AER Root driver assumes users own an error policy.

However, we acknowledge that LKML inputs prefer a generic interface for
error handling. I like the current proposal of three callback functions
in pci_driver.

void (*frozen) (struct pci_dev *);	/* called when dev is first
frozen */
void (*thawed) (struct pci_dev *);	/* called after card is reset */
void (*perm_failure) (struct pci_dev *);	/* called if card is
dead */

I can use frozen to replace link_rec_prepare and thawed to replace
link_rec_restart. In addition, I prefer to add two other functions as
below:

int (*notify) (struct pci_dev *, union error_src *); /* notify driver of
correctable/uncorrectable error occurred
on its device */ 
int (*reset) (struct pci_dev *);	/* called to reset a downstream
bus(s) 					 	if PCI Express Port
supports link 							reset */

with error_src data structure is defined as below:

union error_src {
	unsigned int type;	   /*AER_CORRECTABLE|AER_UNCORRECTABLE*/
	struct {
		unsigned int type;
/*AER_CORRECTABLE|AER_NONFATAL|AER_FATAL*/
		unsigned int flags;  /*TLP log valid and whether reset
is 							supported or not
*/
		unsigned int status; /*Particular Error Status*/
		struct header_log_regs *log; /*PCI Express TLB Header
log */
	}pcie_aer;
};

Please let us know what you think?

Thanks,
Long