[PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

Fri Feb 25 01:04:39 EST 2005

Linas Vepstas wrote:
> I *really* would like to hear from Seto or anyone else working
> on this for PCI Express.

Sorry to my late reply.
I've been stuck in other stuffs... and it took me a long time
to read these codes. It will be helpful to understand if you
could divide the patch into some parts, for example arch/kernel
stuff and drivers.

I also agree with Greg's remark, however I know that PCI recovery
will not be implemented without arch-specific codes, at least in
this time. So I think what we have to do is design some generic
front interfaces and implement specific background codes.

Your code seems good, three callbacks, master recovery thread...
they are great, I believe. But as you know still here is a basic
question: "Are they also good/enough for other platforms?"

It's also important to remember that PPC64 already has special
infrastructure. PPC64 always uses quite cautious eeh_readX(),
so it can detect every error almost synchronously in the affected
context, and maybe can react to the error on the time of happening.
AFAIK it's special. Most of archs don't like neither doing nervous
check nor heavy firmcall in golden route such as read().
And then, PPC64 has "automatic PCI-bus isolation" system, which
sounds very high-tech and efficient. Even expensive magical box
called ia64 don't have such... well, anyway I think that recovery
on PPC64 is blessed with such nice environment.

Unfortunately or fortunately, your approach to PCI error recovery
and mine are significantly different, maybe good to compare.
Still now I use conservative designed API like:
   {
     iochk_clear(cookie,dev);
       io,io,io...
     if(iochk_read(cookie)) return -EAGAIN;
   }
It allows drivers to make IO-critical section. Based on tradition that
error checking is too heavy to do so frequently, frequency of check
is flexibly adjustable. For example, impatient driver will put io into
the section as many as possible, to reduce the overhead of error check.
Cautious driver will put only one io, to reduce the damage of an error.

You have let me realize that:
"The most cautious arch I know, PPC64, would not need to use this API."
I had already code prototype of ia64 specific part with this API, so
it's too bad if you are disappointed at them.
But in the same time, I'm afraid that currently some arch would not
have both of proper chance and enough infrastructure to call callbacks.
Is it possible that my API can use as such infrastructure?

Imagine - possible mix:
   - RAS-aware driver registers callbacks to some struct on init
   - check before IOs (ex. block if bus recovery is processing...)
   - do IOs... (ex. shut up device on error etc.)
   - check after IOs (ex. IO rendezvous, recover, return result...)
   - master-recovery-thread handles extra more...
     :

Is this sounds good for generic purposes?

Ah... I might have wrote too much :-p
At last, I guess I'll effort but would not be able to reply so often.
However I'll be glad if you could keep me in cc and engage in this
discussion.

Thanks,
H.Seto