[Skiboot] [RFC PATCH] opal/xstop: Use nvram param to enable/disable sw checkstop.

Balbir Singh bsingharora at gmail.com
Fri Dec 15 12:02:00 AEDT 2017

On Thu, 14 Dec 2017 21:06:09 +0530
Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> wrote:

> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
> Add a mechanism to enable/disable sw checkstop by looking at nvram option
> opal-sw-xstop=<enable/disable>.

Thanks for doing this, we've had a bit of a mid-air collision. The approach
is the same, but the code paths and names are different.

From b209b538c9e85244eb598ba3be93c342dcf9f46c Mon Sep 17 00:00:00 2001
From: Balbir Singh <bsingharora at gmail.com>
Date: Fri, 15 Dec 2017 11:48:17 +1100
Subject: [PATCH] core/platform: Disable software initiated checkstop

Software initiated checkstops are hard to debug and becoming
more prevelant with coherent memory devices being attached
via the NPU. We do the necessary logging as before and introduce
a new nvram parameter "reboot-on-plat-error", which can be
set to true if the old behavious is desired

nvram -p ibm,skiboot --update-config reboot-on-plat-error=true

The advantage of doing this in skiboot is that this works
for existing distros and allows for better handling where
the OS and driver (for coherent memory if present) can do
a better job of logging errors and reporting broken state
more accurately.

Signed-off-by: Balbir Singh <bsingharora at gmail.com>
 core/platform.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/core/platform.c b/core/platform.c
index 732f67e5..d3f75b63 100644
--- a/core/platform.c
+++ b/core/platform.c
@@ -91,7 +91,21 @@ static int64_t opal_cec_reboot2(uint32_t reboot_type, char *diag)
 			prerror("OPAL: failed to log an error\n");
 		disable_fast_reboot("Reboot due to Platform Error");
-		return xscom_trigger_xstop();
+		/*
+		 * Trigger a checkstop only if reboot-on-plat-error
+		 * is set to true. We've already logged the error
+		 * above, the platform has a record of the error.
+		 * Initiating a software checkstop makes it hard
+		 * to debug issues, specifically with coherent memory
+		 * devices attached via the NPU, where any change
+		 * in link status can trigger an HMI/MCE due to several
+		 * reasons. We want to drop back to the OS and debug
+		 */
+		if (!nvram_query_eq("reboot-on-plat-error", "true"))
+			return xscom_trigger_xstop();
+		else
 		disable_fast_reboot("full IPL reboot requested");
 		return opal_cec_reboot();

More information about the Skiboot mailing list