[RFC/PATCH] Re: How to block pci config-reads during device self-test?

Linas Vepstas linas at denx.de
Wed Jul 14 09:03:28 EST 2004


Hi,

On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King was heard to remark:
> I've been doing some talking with various hardware folks to see if
> there is a way to get this fixed and so far the answer I have been
> getting is no. It is how the hardware works and there isn't much that
> can be done about it ...

Actually, the hardware isn't broken ... its working as designed, more
or less per pci spec.  See below for techninical details.

> I don't like the idea of "learning to live with this". People do
> run into this problem and telling them it can't be fixed is not an
> acceptable answer.
>
> linas at austin.ibm.com wrote:
> >
> >Am having trouble with PCI config-space reads ... I have a device
> >(actualy Brian King has it) that can perform a built-in-self test
> >(BIST). However, if anything does a PCI config-read during BIST, then
> >the device does something crazy that makes the PCI controller chip
> >take it offline.

actually, the device didn't do anything 'crazy' actually, it just
sat there silently, ignoring the i/o.

> >I'm not sure what's doing the config-spcae reads ... seems
> >to be some user-space tool or daemon. I'm wondering if
> >there is any practical way to block such reads to a given
> >device until its self-test sequence is completed. I could
> >try to modify the architecture-specific pci files to do this
> >(arch/ppc64/kernel/pSeries_pci.c) but this seems a tad ugly ... is
> >there another way? or do we have to just learn to live with this
> >hardware?

Here's a patch w/ a recap description of the problem.

The following patch addresses a peculiar PCI i/o problem.
This patch is  specific to ppc64, although one could argue that
a better patch should probably be arch indep.   But there's also
a different way to fix this problem, which is why this is an RFC.

The problem:
Some device drivers use the PCI BIST (built-in self-test) to reset
the PCI card if the card hangs in some error condition.  While the
BIST is running, the card, almost by definiition, is unable to respond
to pci i/o.  If the pci controller tries to do any i/o, the card simply
won't respond, and the pci controller will do a 'master abort' a few
pci bus cycles later.  This is normal and expected behaviour for both
card and the controller.

The problem that I'm having is that I have a pci controller that
thinks master-abort==dead card, and so offlines the card.  Which
is actually reasonable too ... no one should be doing i/o during BIST
anyway. Unfortunately various user-space daemons & tools can
do pci config-space i/o at anytime (typically to perform a hardware
inventory) ... and invariably one of these user-land tools runs while
the card is doing a BIST; the pci controller then off-lines this "dead
card", and we are unhappy.

There are two possible solutions.  (1) block pci config-space i/o
during a BIST, and (2) let the device driver detect that the card
has been off-lined, and so reset the pci controller chip.

I'm starting to think (2) is better, but this patch implements (1).
Opinions?  Comments?

The patch below  uses rw semaphores to block i/o access during
BIST.  All i/o access is locked with a read semaphore: thus, access
remains fast-n-cheap.  If the driver needs to BIST, it will then need
to do a down_write()/up_write() surrounding the BIST.

--linas
-------------- next part --------------
===== arch/ppc64/kernel/pSeries_pci.c 1.39 vs edited =====
--- 1.39/arch/ppc64/kernel/pSeries_pci.c	Mon Jul 12 18:29:16 2004
+++ edited/arch/ppc64/kernel/pSeries_pci.c	Tue Jul 13 15:46:08 2004
@@ -76,14 +76,22 @@

 	addr = (dn->busno << 16) | (dn->devfn << 8) | where;
 	buid = dn->phb->buid;
+
+	if (in_interrupt()) {
+		/* Driver should unplug interrupts during BIST */
+		BUG_ON (down_read_trylock(&dn->bist_lock) != 1);
+	} else {
+		down_read (&dn->bist_lock);
+	}
 	if (buid) {
 		ret = rtas_call(ibm_read_pci_config, 4, 2, &returnval,
 				addr, buid >> 32, buid & 0xffffffff, size);
 	} else {
 		ret = rtas_call(read_pci_config, 2, 2, &returnval, addr, size);
 	}
-	*val = returnval;
+	up_read (&dn->bist_lock);

+	*val = returnval;
 	if (ret)
 		return PCIBIOS_DEVICE_NOT_FOUND;

@@ -129,11 +137,19 @@

 	addr = (dn->busno << 16) | (dn->devfn << 8) | where;
 	buid = dn->phb->buid;
+
+	if (in_interrupt()) {
+		/* Driver should unplug interrupts during BIST */
+		BUG_ON (down_read_trylock(&dn->bist_lock) != 1);
+	} else {
+		down_read (&dn->bist_lock);
+	}
 	if (buid) {
 		ret = rtas_call(ibm_write_pci_config, 5, 1, NULL, addr, buid >> 32, buid & 0xffffffff, size, (ulong) val);
 	} else {
 		ret = rtas_call(write_pci_config, 3, 1, NULL, addr, size, (ulong)val);
 	}
+	up_read (&dn->bist_lock);

 	if (ret)
 		return PCIBIOS_DEVICE_NOT_FOUND;
===== arch/ppc64/kernel/prom.c 1.89 vs edited =====
--- 1.89/arch/ppc64/kernel/prom.c	Tue Jul 13 15:26:58 2004
+++ edited/arch/ppc64/kernel/prom.c	Tue Jul 13 16:03:01 2004
@@ -31,6 +31,7 @@
 #include <linux/stringify.h>
 #include <linux/delay.h>
 #include <linux/initrd.h>
+#include <linux/rwsem.h>
 #include <asm/prom.h>
 #include <asm/rtas.h>
 #include <asm/lmb.h>
@@ -3021,6 +3022,7 @@
 		kfree(np);
 		return err;
 	}
+	init_rwsem (&np->bist_lock);

 	write_lock(&devtree_lock);
 	np->sibling = np->parent->child;
===== include/asm-ppc64/prom.h 1.20 vs edited =====
--- 1.20/include/asm-ppc64/prom.h	Thu May 27 20:37:20 2004
+++ edited/include/asm-ppc64/prom.h	Tue Jul 13 15:45:28 2004
@@ -16,6 +16,7 @@
  */
 #include <linux/proc_fs.h>
 #include <asm/atomic.h>
+#include <asm/rwsem.h>

 #define PTRRELOC(x)     ((typeof(x))((unsigned long)(x) - offset))
 #define PTRUNRELOC(x)   ((typeof(x))((unsigned long)(x) + offset))
@@ -151,6 +152,12 @@
 	int	busno;			/* for pci devices */
 	int	bussubno;		/* for pci devices */
 	int	devfn;			/* for pci devices */
+
+	/* Lock to prevent 3rd-party PCI config space i/o while device
+	 * driver is running BIST on the controller; driver should obtain
+	 * down_write()/up_write() in this lock for the duration of BIST. */
+	struct rw_semaphore bist_lock;
+
 #define DN_STATUS_BIST_FAILED (1<<0)
 	int	status;			/* Current device status (non-zero is bad) */
 	int	eeh_mode;		/* See eeh.h for possible EEH_MODEs */


More information about the Linuxppc64-dev mailing list