Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

Sat Apr 2 01:27:22 EST 2005

Grant Grundler wrote:
>>>You want everything moved back to the "queued" state or failed
>>>(flush pending IO so upper layers can retry if they want).
>>
>>Upper layer is the linux block device; my understanding is that it does
>>not retry, nor do the filesystems above that.  Passing errors upwards
>>seems to be pretty darned fatal.  My goal is to limit retries to the
>>driver.
> 
> 
> That's a bad idea. Been there done that.
> 
> Upper layers can be alot smarter about retries than the driver ever
> could be. While the driver knows more about the transport and why
> someting might fail, upper layers will know alternate pathes 
> to the same devices or to the same data on different devices.
> Upper layers also set the recovery policy for particular storage.
> 
> Trying to do recovery transperently in the drivers is going to also
> mess up other high level SW like Service Guard or LifeKeeper.
> They want to know when a path has failed, log it, and make sure
> someone gets sent to service the HW if threshholds are exceeded.
> 
> Let higher layers like dm, VxFS, LVM worry about recovery.

The sym2 driver should fail everything back with DID_ERROR.
In most cases, the scsi midlayer will retry if the upper layer allows
retries and you will get the behavior you desire. If retries are not
allowed, like for a tape device, the command will get failed back to the
upper layer driver.

-- 
Brian King
eServer Storage I/O
IBM Linux Technology Center