[PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]
Linas Vepstas
linas at austin.ibm.com
Fri Feb 25 10:14:55 EST 2005
Hi Hide,
I am very glad to hear from you.
On Thu, Feb 24, 2005 at 11:04:39PM +0900, Hidetoshi Seto was heard to remark:
>
> It's also important to remember that PPC64 already has special
> infrastructure. PPC64 always uses quite cautious eeh_readX(),
> so it can detect every error almost synchronously in the affected
> context, and maybe can react to the error on the time of happening.
> AFAIK it's special. Most of archs don't like neither doing nervous
> check nor heavy firmcall in golden route such as read().
The reason ppc64 checks for a possible PCI error every time is because
this was the only thing we could think of without actually modifying any
device drivers. **If** a device driver is modified, then the check for
errors can be made much less frequent. However, we thought that most
device driver maintainers would reject a ppc64-only patch, and so
we picked the simplest/dumbest thing that would work.
> called ia64 don't have such... well, anyway I think that recovery
> on PPC64 is blessed with such nice environment.
Thank you :)
> Unfortunately or fortunately, your approach to PCI error recovery
Let us distinguish the terms "error recovery" and "error detection"
-- "detection" is finding out that an error occured
-- "recovery" as is the seqence of steps taken to make the PCI
device useable again.
> Still now I use conservative designed API like:
> {
> iochk_clear(cookie,dev);
> io,io,io...
> if(iochk_read(cookie)) return -EAGAIN;
> }
> It allows drivers to make IO-critical section. Based on tradition that
> error checking is too heavy to do so frequently, frequency of check
> is flexibly adjustable. For example, impatient driver will put io into
> the section as many as possible, to reduce the overhead of error check.
> Cautious driver will put only one io, to reduce the damage of an error.
Yes, this interface for "detection" would be good. I could (Ben could?)
easily provide code up this kind of an interface, once we agree what the
names and arguments of the subroutines are (iock_clear()? pci_iochk_clear?
pci_ioblock_begin()/pci_ioblock_end() ?)
The hard part is to start converting device drivers to use this
interface; the other hard part (for you) is to decide what to do about
the device drivers that have not converted to this interface.
> You have let me realize that:
> "The most cautious arch I know, PPC64, would not need to use this API."
> I had already code prototype of ia64 specific part with this API, so
> it's too bad if you are disappointed at them.
I am not disappointed. Its a good idea in general. We should talk
about the detection API details. Care to propose these details?
(i.e. what's "cookie", how do you get a cookie, etc?)
> Imagine - possible mix:
> - RAS-aware driver registers callbacks to some struct on init
Yes. Which structure? struct pci_driver?
> - check before IOs (ex. block if bus recovery is processing...)
Many drivers do i/o in an interrupt context; we cannot block
that i/o without hanging the kernel. What happens if iochk_clear()
blocks, waiting for the bus to reset, while the device driver tries to
do i/o from a timer interrupt?
> - do IOs... (ex. shut up device on error etc.)
> - check after IOs (ex. IO rendezvous, recover, return result...)
Yes.
> - master-recovery-thread handles extra more...
How should the master recovery thread be invoked?
> Is this sounds good for generic purposes?
Yes. I'd like to discuss specifics of the actual names and arguments
and descriptions of the subroutines as soon as possible.
> Ah... I might have wrote too much :-p
No.
--linas
More information about the Linuxppc64-dev
mailing list