PCI Error Recovery API Proposal (updated)

Thu Apr 7 09:16:08 EST 2005

> Agree. When do you plan to have this structure in struct pci_driver?

As soon as everybody agrees on them, that is soon I hope

> >The definition of "pci_error_token" is not covered here. 
> 
> What is the default type of pci_error_token in API 1)? You said "within
> this function and after it returns, the driver shouldn't do any new
> IOs." AER code is required to pass error severity (fatal or nonfatal) to
> a driver when calling API 1). I refer this error token should be defined
> as an integer type, which is passed with either PCIERR_FATAL_DETECTED or
> PCIERR_NONFATAL_DETECTED. Please let me know what you think? 

The token should be an opaque type with accessors. You could define a
pci_error_get_severity(token) to return the severity. The idea is to
define accessors which return an error when the data requested isn't
present in the error info. The actual content of the token is to be
defined. I was thinking about a type plus a union. I was hoping Seto
could provide something here ...

> >	3) link_reset()
> >
> >	This is called after the link has been reset. This is typically
> >a PCI Express specific state at this point and is done wether a non fatal
> >error has been detected that can be "solved" by resetting the link. The
> >driver is informed here of that reset and should check if the device
> >appears to be in working condition. This function acts a bit like 2)
> >error_recover(), that is it is not supposed to restart normal driver IO
> >operations right away, just "probe" the device to check it's 
> >recoverability status. If all is right, then the core will call
> >error_restart() once all driver have ack'd link_reset().
> 
> API 3) is not like error_recover(). This is basically a PCI Express
> specific when a fatal error has been reported to the Root Port. This
> fatal error can be "solved" by resetting the link at upstream port
> associated with a hierarchy in question. An upstream port driver is informed
> here to reset its link to return to reliable. After a completion of link
> reset, we go to 4) and 5). Please change your description accordingly. 

Wait ... Once you have reset the link, you call 3). At this point, the
card should be operational again right ? That is, the next callback
should be 5) not 4). Unless the driver here decides it can't recover and
need a full hard reset of the slot (which is a different thing) and thus
you end up power cycling the slot and go to 4).

That is, in this regard, the action of a driver in 3) is similar to the
action of a driver in "recover", in that sense that the link has been
reset, the card might not (depending on wether the link reset triggers a
card reset or not, this is device specific, the driver will know what to
do) and can recover from it. The next step to expect is 5). Did I get
something wrong ? 

Ben.