[PATCH 2.6.13-rc1 09/10] IOCHK interface for I/O error handling/detecting

Hidetoshi Seto seto.hidetoshi at jp.fujitsu.com
Wed Jul 6 15:20:15 EST 2005

[This is 9 of 10 patches, "iochk-09-cpeh.patch"]

- SAL behavior doesn't affect only MCA. There are other
   chances to call SAL_GET_STATE_INFO, that's when CMC, CPE,
   and INIT is happen.

    - CMC(Corrected Machine Check) is for non-fatal, processor
      local errors. Fortunately, calling SAL_GET_STATE_INFO for
      CMC only collect data from a processor issued it, without
      touching any bridge and its status. So, this is safe.

    - CPE(Corrected Platform Error) is for non-fatal, platform
      related errors. Even it says corrected, but calling
      SAL procedure for CPE touchs every bridge on the platform,
      and "correct" bridge status that's bad for iochk works.

    - INIT is a kind of system reset request, as far as I know.
      So restarting from INIT is out of design, also iochk after
      INIT is not required at this time.

   In short, only MCA and CPE have the problem of SAL behavior.
   One of the difference from MCA is that SAL will not gather
   data before OS actually request it.

       1) SAL gathers data and keep it internally
       2) OS gets control
       3) if OS requests, SAL returns data gathered at beginning.
       1) OS gets control
       2) OS request to SAL
       3) SAL gathers data and return it to OS

   Therefore, we can make CPE handler to care bridge states,
   to check states before calling SAL procedure.

Changes from previous one for
   - (non)

Signed-off-by: Hidetoshi Seto <seto.hidetoshi at jp.fujitsu.com>


  arch/ia64/kernel/mca.c      |   21 +++++++++++++++++++++
  arch/ia64/lib/iomap_check.c |   17 +++++++++++++++++
  2 files changed, 38 insertions(+)

Index: linux-2.6.13-rc1/arch/ia64/lib/iomap_check.c
--- linux-2.6.13-rc1.orig/arch/ia64/lib/iomap_check.c
+++ linux-2.6.13-rc1/arch/ia64/lib/iomap_check.c
@@ -19,6 +19,7 @@ static int have_error(struct pci_dev *de

  void notify_bridge_error(struct pci_dev *bridge);
  void clear_bridge_error(struct pci_dev *bridge);
+void save_bridge_error(void);

  void iochk_init(void)
@@ -145,6 +146,22 @@ void clear_bridge_error(struct pci_dev *

+void save_bridge_error(void)
+	iocookie *cookie;
+	if (list_empty(&iochk_devices))
+		return;
+	/* mark devices if its root bus bridge have errors */
+	list_for_each_entry(cookie, &iochk_devices, list) {
+		if (cookie->error)
+			continue;
+		if (have_error(cookie->host))
+			notify_bridge_error(cookie->host);
+	}
  EXPORT_SYMBOL(iochk_devices);	/* for MCA driver */
Index: linux-2.6.13-rc1/arch/ia64/kernel/mca.c
--- linux-2.6.13-rc1.orig/arch/ia64/kernel/mca.c
+++ linux-2.6.13-rc1/arch/ia64/kernel/mca.c
@@ -80,6 +80,8 @@
  #include <linux/pci.h>
  extern void notify_bridge_error(struct pci_dev *bridge);
+extern void save_bridge_error(void);
+extern spinlock_t iochk_lock;

  #if defined(IA64_MCA_DEBUG_INFO)
@@ -288,11 +290,30 @@ ia64_mca_cpe_int_handler (int cpe_irq, v
  	IA64_MCA_DEBUG("%s: received interrupt vector = %#x on CPU %d\n",
  		       __FUNCTION__, cpe_irq, smp_processor_id());

  	/* SAL spec states this should run w/ interrupts enabled */

  	/* Get the CPE error record and log it */
+	/*
+	 * Because SAL_GET_STATE_INFO for CPE might clear bridge states
+	 * in process of gathering error information from the system,
+	 * we should check the states before clearing it.
+	 * While OS and SAL are handling bridge status, we have to protect
+	 * the states from changing by any other I/Os running simultaneously,
+	 * so this should be handled w/ lock and interrupts disabled.
+	 */
+	spin_lock(&iochk_lock);
+	save_bridge_error();
+	ia64_mca_log_sal_error_record(SAL_INFO_TYPE_CPE);
+	spin_unlock(&iochk_lock);
+	/* Rests can go w/ interrupt enabled as usual */
+	local_irq_enable();

  	if (!cpe_poll_enabled && cpe_vector >= 0) {

More information about the Linuxppc64-dev mailing list