machine check exception

Wed Feb 5 01:50:01 EST 2003

Ok, with input from Dave Altobelli I rearranged the code slightly
(sorry, this version is relative to 2.4).  He suggested we not kill of
pids 0 and 1 which is a very good idea.  Note that we will panic instead
of signaling 0 or 1 which isn't more productive...but perhaps more
obvious as to what happened.  He also suggested that we only handle
uncorrectable ECC errors.

Here is the new code.  Again, the patch is not terribly readable but I
can send it to anyone who wants to see it.  I removed
power4_handle_mce() from the 2.5 code and replaced it with
recover_mce().  The panic path in 2.5 will be slightly different.

-todd

/* See if we can recover from a machine check exception.
  * This is only called on power4 (or above) and only via
  * the Firmware Non-Maskable Interrupts (fwnmi) handler
  * which provides the error analysis for us.
  *
  * Return 1 if corrected (or delivered a signal).
  * Return 0 if there is nothing we can do.
  */
static int
recover_mce(struct pt_regs *regs, struct rtas_error_log err)
{
	siginfo_t info;

	if (err.disposition == DISP_FULLY_RECOVERED) {
		/* Platform corrected itself */
		return 1;
	} else if ((regs->msr & MSR_RI) &&
		   user_mode(regs) &&
		   err.severity == SEVERITY_ERROR_SYNC &&
		   err.disposition == DISP_NOT_RECOVERED &&
		   err.target == TARGET_MEMORY &&
		   err.type == TYPE_ECC_UNCORR &&
		   !(current->pid == 0 || current->pid == 1)) {
		/* Kill off a user process with an ECC error */
		info.si_signo = SIGBUS;
		info.si_errno = 0;
		info.si_code = BUS_ECCERR;
		info.si_addr = (void *)regs->nip;
		printk(KERN_ERR "MCE: uncorrectable ecc error for pid %d\n",
current->pid);
		_exception(SIGBUS, &info, regs);
		return 1;
	}
	return 0;
}

/* Handle a machine check.
  *
  * Note that on Power 4 and beyond Firmware Non-Maskable Interrupts (fwnmi)
  * should be present.  If so the handler which called us tells us if the
  * error was recovered (never true if RI=0).
  *
  * On hardware prior to Power 4 these exceptions were asynchronous which
  * means we can't tell exactly where it occurred and so we can't recover.
  *
  * Note that the debugger should test RI=0 and warn the user that system
  * state has been corrupted.
  */
void
MachineCheckException(struct pt_regs *regs)
{
	struct rtas_error_log err, *errp;

	if (fwnmi_active) {
		errp = FWNMI_get_errinfo(regs);
		if (errp)
			err = *errp;
		FWNMI_release_errinfo();	/* frees errp */
		if (errp && recover_mce(regs, err))
			return;
	}

	if (debugger_fault_handler) {
		debugger_fault_handler(regs);
		return;
	}
	if (debugger)
		debugger(regs);

	printk("Machine check in kernel mode.\n");
	printk("Caused by (from SRR1=%lx): ", regs->msr);
	show_regs(regs);
#if defined(CONFIG_XMON) || defined(CONFIG_KGDB)
	debugger(regs);
#endif
#ifdef CONFIG_KDB
	if (kdb(KDB_REASON_FAULT, 0, regs))
		return ;
#endif
	print_backtrace((unsigned long *)regs->gpr[1]);
	panic("Unrecoverable machine check");
}

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/