pci error recovery procedure

Linas Vepstas linas at austin.ibm.com
Wed Sep 6 05:17:39 EST 2006


On Mon, Sep 04, 2006 at 01:47:30PM +0800, Zhang, Yanmin wrote:
> > 
> > Again, consider the multi-function cards. On pSeries, I can  only enable 
> > DMA on a per-slot basis, not a per-function basis. So if one driver
> > enables DMA before some other driver has reset appropriately, everything
> > breaks.
> Does here 'reset' mean hardware slot reset? 

I should have said: If one driver of a multi-function card enables DMA before 
another driver has stabilized its harware, then everything breaks.

> Then, if the slot is always reset, there will be no the problem. 

But that assumes that a hardware #RST will always be done. The API
was designed to get away from this requirement.

> If mmio_enabled is not used currently, I think we could delete it firstly. Later on,
> if a platform really need it, we could add it, so we could keep the simplied codes.

It would be very difficult to add it later. And it would be especially
silly, given that someone would find this discussion in the mailing list 
archives.

> Thanks. Now I understand why you specified mmio_enabled and slot_reset. They are just
> to map to pSeries platform hardware operation steps. I know little about pSeries hardware,

The hardware was designed that way because the hardware engineers
thought that this is what the device driver writers would need. 
Thay are there to map to actual recovery steps that actual device
drivers might want to do.

> but is it possible to merge such hardware steps from software point of view?

The previous email explained why this would be a bad idea. 

> > The platform. By "electrical reset", I mean "dropping the #RST pin low
> > for 200mS". Only the platform can do this.
> Thanks for your explanation. I assume after the electrical reset, all device
> functions of the device slot will go back to the initial status before
> attaching their drivers.

Maybe. Depends on what yur BIOS does. On pSeries, I also need to
set up the adress BARs

> I found a problem of e1000 driver when testing its error handlers. After the NIC is resumed,
> its RX/TX packets numbers are crazy.

Hmm. There is a patch to prevent this from happening. I thought
it was applied a long time ago. e1000_update_stats() should include the
lines:

   if (pdev->error_state && pdev->error_state != pci_channel_io_normal)
      return;

which is enough to prevent crazy stats on my machine.

--linas



More information about the Linuxppc-dev mailing list