pci error recovery procedure

Mon Sep 4 15:47:30 EST 2006

On Sat, 2006-09-02 at 05:25, Linas Vepstas wrote:
> On Fri, Sep 01, 2006 at 11:33:49AM +0800, Zhang, Yanmin wrote:
> > On Fri, 2006-09-01 at 01:50, Linas Vepstas wrote:
> > > On Thu, Aug 31, 2006 at 03:10:12PM +0800, Zhang, Yanmin wrote:
> > > > Linas,
> > > > 
> > > > I am reviewing the error handlers of e1000 driver and got some ideas. My
> > > > startpoint is to simplify the err handler implementations for drivers, or
> > > > driver developers are *not willing* to add it if it's too complicated.
> > > 
> > > I don't see that its to complicated ... 
> > Originally, I didn't think so, but after I try to add err_handlers to some
> > drivers, I feel it's too complicated.
> 
> Which drivers are you working on?
I worked out error handlers for tg3 NIC driver. I'm also checking e1000 driver
to try to move all I/O operations from e1000_io_error_detected to e1000_io_slot_reset.

> 
> > > > 1) Callback mmio_enabled looks useless. Documentation/pci-error-recovery.txt
> > > > says the current powerpc implementation does not implement this callback.
> > > 
> > > I don't know if its useless or not. I have not needed it yet for the
> > > symbios, ipr and e1000 drivers, but its possible that some more
> > > sophisticated device may want it. I'm tempted to keep it a while 
> > > longer befoe discarding it.
> > > 
> > > The scenario is this: the device driver decides that, rather than asking
> > > for a full electical reset of the card, instead, it wants to perform 
> > > its own recovery. It can do this as follows:
> > > 
> > > a) enable MMIO
> > > b) issue reset command to adapter
> > > c) enable DMA.
> > > 
> > > If we enabled both DMA and MMIO at the same time, there are mnay cases
> > > where the card will immediately trap again -- for example, if its
> > > DMA'ing to some crazy address. Thus, typically, one wants DMA disabled 
> > > until after the card reset.
I think most drivers' error_detected callbacks return PCI_ERS_RESULT_NEED_RESET,
so the slot will be reset. Then, the example that one wants DMA disabled
until after the card reset is not reasonable.

>   Withouth the mmio_enabled() reset, there
> > > is no way of doing this.
> > The new error_resume, or the old slot_reset could take care of it. The specific
> > device driver knows all the details about how to initiate the devices. The 
> > error_resume could call the step a) b) c) sequencially while doing checking among
> > steps.
> 
> Again, consider the multi-function cards. On pSeries, I can  only enable 
> DMA on a er-slot basis, not a per-function basis. So if one driver
> enables DMA before some other driver has reset appropriately, everything
> breaks.
Does here 'reset' mean hardware slot reset? In error_detected, driver 
needs cancel all pending request and don't start any I/O operations. 
Then, if the slot is always reset, there will be no the problem. Function
handle_eeh_events always resets slot except hard failure. See beow more comments.

As you know, all functions of a device share the same bus number and 5 bit dev number.
They just have different 3 bit function number. We could deduce if functions are in the same
device (slot).

If mmio_enabled is not used currently, I think we could delete it firstly. Later on,
if a platform really need it, we could add it, so we could keep the simplied codes.

> 
> > If there is really a device having specific requirement to reinitiate it (very rarely),
> > it could use walkaround, such like schedule a WORKER. No need to provide a generic
> > mmio_enabled.
> 
> I don't understand. Enabling MMIO and enabling DMA both require specific
> commands to be sent to the PCI-host bridge. These commands are not a
> part of the PCI spec.
Thanks. Now I understand why you specified mmio_enabled and slot_reset. They are just
to map to pSeries platform hardware operation steps. I know little about pSeries hardware,
but is it possible to merge such hardware steps from software point of view?

I checked the source codes of pSeries eeh_driver. Function handle_eeh_events does nothing
between pci_walk_bus(frozen_bus, eeh_report_reset, NULL) and
pci_walk_bus(frozen_bus, eeh_report_resume, NULL), that is, it doesn't enable DMA after
slot_reset. handle_eeh_events always resets slot except hard failure. So, slot_reset
could be merged with resume.

>  
> 
> > > > 2) Callback slot_reset could be merged with resume. The new resume could be:
> > > > int (*error_resume)(struct pci_dev *dev); I checked e1000 and e100 drivers and
> > > > think there is no actual reason to have both slot_reset and resume.
> > > 
> > > The idea here was to handle multi-function cards.  On a multi-function card, 
> > > *all* devices need to indicate that they were able to reset. Once all devices 
> > > have been successfuly reset, then operation can be resumed. If the reset 
> > > of one function fails, then operation is not resumed for any f the
> > > functions.
> > I don't think we need slot_reset to coordinate multi-function devices. The new
> > error_resume could take care of multi-function card. 
> 
> How? 
> 
> > 'reset' here means driver
> > need do I/O to detect if the device (function) still works well. If a function
> > of a multi-function device couldn't reset while other functions could reset,
> > other functions could just go on to reinitiate. In the end, the error recovery
> > procedure (handle_eeh_events in PowerPC implementation) could check all the
> > returning values of error_resume. If there is a failure value, then removes
> > all the functions' pci_dev of the device from the bus.
> 
> I can only enable or disable an entire PCI slot, and not individual PCI
> functions. If there are some pins that are shorted, or parity errors or
> whatever, I can only turn off the whole card.
It doesn't matter with the simplification. I don't mean that a device function
should be disabled immediately after the error_resume of the function driver
returns.The disable operation could be delayed till all error_resume return. 

>  
> 
> > > > During
> > > > our last discussion on LKML, you said PowerPC will block further I/O if the platform captures
> > > > a pci error, so the all I/O in e1000_down will be blocked. Later on, e1000_io_slot_reset
> > > > will reenable pci device and initiate NIC. I guess late initiate might fail because prior
> > > > e1000_down I/O don't reach NIC.
> > > 
> > > Why would it fail? The e1000_down serves primarily to get the Linux
> > > kernel into a known state. It doesn't matter what happens to the card,
> > > since the next step will be to perform an electrical reset of the card.
> > Who will perform the electrical reset of the card? Function e1000_reset or the platform?
> 
> The platform. By "electrical reset", I mean "dropping the #RST pin low
> for 200mS". Only the platform can do this.
Thanks for your explanation. I assume after the electrical reset, all device
functions of the device slot will go back to the initial status before
attaching their drivers.

>  
> > If it's the platform, I agree with you, but if it's e1000_reset, it might not work because
> > e1000_reset uses a e1000-specific approach to reset the card.
> 
> The driver has to choices: it can ask for the electrical reset, by 
> returning PCI_ERS_RESULT_NEED_RESET. But if the driver doesn't need
> the electrical reset, then it can return PCI_ERS_RESULT_CAN_RECOVER,
> and issue whatever device-specific commands it needs to reset.
> > I'm not sure if the e1000_reset
> > will restore the NIC to fresh system power-on state. At least, from the source codes, e1000_reset
> > couldn't.
> 
> I have no idea. That's why this driver issues PCI_ERS_RESULT_NEED_RESET,
> which will get it into a fresh system power-on state. Its easy, its
> brute-force, it works.
I found a problem of e1000 driver when testing its error handlers. After the NIC is resumed,
its RX/TX packets numbers are crazy. Now, I think it's a bug of function e1000_reset, not
the error handlers. Sorry for bothering you on e1000.

I copy another email below, so we could keep the discussion in one thread.
On Fri, Sep 01, 2006 at 05:04:09PM +0800, Zhang, Yanmin wrote:
> > One more comment: The second parameter of error_detected also could be deleted
> > because recovery procedures will save error to pci_dev->error_state.
> 
> Yes, I beleive so.
Thanks.

> 
> > So, the err_handler pci_error_handlers could be:
> > struct pci_error_handlers
> > {
> >         pci_ers_result_t (*error_detected)(struct pci_dev *dev);
> >         pci_ers_result_t (*error_resume)(struct pci_dev *dev);
> > };
> 
> No, as per other email, we still need a multi-step process for
> multi-function cards,
As above discussion, reset slot could resolve it like you did for pSeries.

>  and for cards that may not want to get
> a full electrical reset.
So I think slot is reset only when a error_detected returns
PCI_ERS_RESULT_NEED_RESET.

> Finally, there might be platforms 
> that cannot perform a per-slot electrical reset, and would 
> therefore require drivers that can recover on thier own.
The new pci_error_handlers could process it easily. The driver's error_resume just
need schedule a driver-specific worker and returns PCI_ERS_RESULT_RECOVERED. The
worker could do recover on the driver own later on.

By checking drivers who support err_handler in the latest kernel, we could find
they all returns PCI_ERS_RESULT_NEED_RESET. They all could be converted to
use the new simplified pci_error_handlers. The new pci_error_handlers also gives
drivers flexibility to have more control on error recovery.

It's hard to look for a perfect solution. I mean, it's a trade-off. As long as
it could finish most functionality, the simpler, the better.

Yanmin