[PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

Fri Feb 25 14:35:50 EST 2005

Hi, Linas.

Linas Vepstas wrote:
> The reason ppc64 checks for a possible PCI error every time is because
> this was the only thing we could think of without actually modifying any
> device drivers.  **If** a device driver is modified, then the check for
> errors can be made much less frequent.  However, we thought that most 
> device driver maintainers would reject a ppc64-only patch, and so
> we picked the simplest/dumbest thing that would work.

Of course it will be happy for driver maintainers if everything goes well
without modifying. Maybe we can do up to a certain point, but we would
have to request modifying or making callbacks or something to drivers
to go above there (ex. device re-enabling).

> Let us distinguish the terms "error recovery" and "error detection"
> 
> -- "detection" is finding out that an error occured
> -- "recovery" as is the seqence of steps taken to make the PCI 
>    device useable again.

OK.

> The hard part is to start converting device drivers to use this
> interface; the other hard part (for you) is to decide what to do about
> the device drivers that have not converted to this interface.

In other words, we have to support normal/non-aware drivers to a degree.
The trouble in this time is that some arch just unwisely down the system
on an error such as PERR, so there was no chance to recover from the error.
PPC64 doesn't have such trouble any more.

BTW, how ppc64 drivers deal '~0'(all 1) data after bus isolation?
Does the weird data come up to the user application?

>>Imagine - possible mix:
>>  - RAS-aware driver registers callbacks to some struct on init
> 
> Yes.  Which structure? struct pci_driver?

pci_driver would be major candidate, I think.

>>  - check before IOs (ex. block if bus recovery is processing...)
> 
> Many drivers do i/o in an interrupt context; we cannot block
> that i/o without hanging the kernel.   What happens if iochk_clear()
> blocks, waiting for the bus to reset, while the device driver tries to 
> do i/o from a timer interrupt?  

Block was bad word... spin? it will be bad too.
Anyway, something will be required to control subsequent i/o.
How would you solve such problem?

>>  - master-recovery-thread handles extra more...
> 
> How should the master recovery thread be invoked?

I have no clear idea. How about daemonize?

> I'd like to discuss specifics of the actual names and arguments
> and descriptions of the subroutines as soon as possible.

Followings are latest "generic" part of my "iomap-check" code.
All comments are welcome.

Thanks,
H.Seto

-----

diff -Nur linux-2.6.10-iomap-0/include/asm-generic/iomap.h linux-2.6.10-iomap-1/include/asm-generic/iomap.h

--- linux-2.6.10-iomap-0/include/asm-generic/iomap.h    2005-02-15 15:27:27.000000000 +0900
+++ linux-2.6.10-iomap-1/include/asm-generic/iomap.h    2005-02-21 14:40:45.000000000 +0900
@@ -60,4 +60,20 @@
  extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max);
  extern void pci_iounmap(struct pci_dev *dev, void __iomem *);

+/*
+ * IOMAP_CHECK provides additional interfaces for drivers to detect
+ * some IO errors, supports drivers having ability to recover errors.
+ *
+ * All works around iomap-check depends on the design of "iocookie"
+ * structure. Every archtecture owning its iomap-check is free to
+ * define the actual design of iocookie to fit its special style.
+ */
+#ifndef HAVE_ARCH_IOMAP_CHECK
+typedef unsigned long iocookie;
+#endif
+
+extern void    iochk_init(void);
+extern void    iochk_clear(iocookie *cookie, struct pci_dev *dev);
+extern int     iochk_read(iocookie *cookie);
+
  #endif
diff -Nur linux-2.6.10-iomap-0/lib/iomap.c linux-2.6.10-iomap-1/lib/iomap.c
--- linux-2.6.10-iomap-0/lib/iomap.c    2005-02-15 15:27:27.000000000 +0900
+++ linux-2.6.10-iomap-1/lib/iomap.c    2005-02-21 14:38:17.000000000 +0900
@@ -210,3 +210,28 @@
  }
  EXPORT_SYMBOL(pci_iomap);
  EXPORT_SYMBOL(pci_iounmap);
+
+/*
+ * Note that default iochk_clear-read pair interfaces could be used
+ * just as a replacement of traditional local_irq_save-restore pair.
+ * Originally they don't have any effective error check, but some
+ * high-reliable platforms would provide useful information to you.
+ */
+#ifndef HAVE_ARCH_IOMAP_CHECK
+#include <asm/system.h>
+void iochk_init(void) { ; }
+
+void iochk_clear(iocookie *cookie, struct pci_dev *dev)
+{
+       local_irq_save(*cookie);
+}
+
+int iochk_read(iocookie *cookie)
+{
+       local_irq_restore(*cookie);
+               return 0;
+}
+EXPORT_SYMBOL(iochk_init);
+EXPORT_SYMBOL(iochk_clear);
+EXPORT_SYMBOL(iochk_read);
+#endif /* HAVE_ARCH_IOMAP_CHECK */