[PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

Fri Feb 25 10:14:55 EST 2005

Hi Hide,

I am very glad to hear from you.

On Thu, Feb 24, 2005 at 11:04:39PM +0900, Hidetoshi Seto was heard to remark:
> 
> It's also important to remember that PPC64 already has special
> infrastructure. PPC64 always uses quite cautious eeh_readX(),
> so it can detect every error almost synchronously in the affected
> context, and maybe can react to the error on the time of happening.
> AFAIK it's special. Most of archs don't like neither doing nervous
> check nor heavy firmcall in golden route such as read().

The reason ppc64 checks for a possible PCI error every time is because
this was the only thing we could think of without actually modifying any
device drivers.  **If** a device driver is modified, then the check for
errors can be made much less frequent.  However, we thought that most 
device driver maintainers would reject a ppc64-only patch, and so
we picked the simplest/dumbest thing that would work.

> called ia64 don't have such... well, anyway I think that recovery
> on PPC64 is blessed with such nice environment.

Thank you :)

> Unfortunately or fortunately, your approach to PCI error recovery

Let us distinguish the terms "error recovery" and "error detection"

-- "detection" is finding out that an error occured
-- "recovery" as is the seqence of steps taken to make the PCI 
   device useable again.

> Still now I use conservative designed API like:
>   {
>     iochk_clear(cookie,dev);
>       io,io,io...
>     if(iochk_read(cookie)) return -EAGAIN;
>   }
> It allows drivers to make IO-critical section. Based on tradition that
> error checking is too heavy to do so frequently, frequency of check
> is flexibly adjustable. For example, impatient driver will put io into
> the section as many as possible, to reduce the overhead of error check.
> Cautious driver will put only one io, to reduce the damage of an error.

Yes, this interface for "detection" would be good.  I could (Ben could?)
easily provide code up this kind of an interface, once we agree what the 
names and arguments of the subroutines are (iock_clear()?  pci_iochk_clear?
pci_ioblock_begin()/pci_ioblock_end() ?)  

The hard part is to start converting device drivers to use this
interface; the other hard part (for you) is to decide what to do about
the device drivers that have not converted to this interface.

> You have let me realize that:
> "The most cautious arch I know, PPC64, would not need to use this API."
> I had already code prototype of ia64 specific part with this API, so
> it's too bad if you are disappointed at them.

I am not disappointed.  Its a good idea in general.  We should talk
about the detection API details.  Care to propose these details?
 (i.e. what's "cookie", how do you get a cookie, etc?)

> Imagine - possible mix:
>   - RAS-aware driver registers callbacks to some struct on init

Yes.  Which structure? struct pci_driver?

>   - check before IOs (ex. block if bus recovery is processing...)

Many drivers do i/o in an interrupt context; we cannot block
that i/o without hanging the kernel.   What happens if iochk_clear()
blocks, waiting for the bus to reset, while the device driver tries to 
do i/o from a timer interrupt?  

>   - do IOs... (ex. shut up device on error etc.)
>   - check after IOs (ex. IO rendezvous, recover, return result...)

Yes.

>   - master-recovery-thread handles extra more...

How should the master recovery thread be invoked?

> Is this sounds good for generic purposes?

Yes.  I'd like to discuss specifics of the actual names and arguments
and descriptions of the subroutines as soon as possible.

> Ah... I might have wrote too much :-p

No.

--linas