[Skiboot] [PATCH 1/2] opal: Fix hang in time_wait* calls on HMI for TB errors.

Mahesh J Salgaonkar mahesh at linux.vnet.ibm.com
Mon Sep 14 21:09:44 AEST 2015


From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>

On TOD/TB errors timebase register stops/freezes until HMI error recovery
gets TOD/TB back into running state. However, while HMI recovery is in
progress there are chances where some code path may invoke time_wait*()
calls which depends on running TB value. In an event of TB not moving,
time_wait* calls would keep looping resulting into a hang on that CPU.

On OpenPower systems we are seeing system hang on TOD/TB errors. The hang
is seen inside OPAL HMI handler while invoking prlog/perror(). The reason
is, on OpenPower systems prlog/perror() depends on LPC UART console
driver to flush log messages to the console. UART read/write calls invoke
time_wait_nopoll() inside opb_[read|write]() functions. When TB is in
stopped state this causes a hang in prlog/perror() calls.

This patch fixes this issue by modifying time_wait_[no]poll() to check
for TB validity and return immediately.

Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
---
 core/hmi.c      |    8 ++++++++
 core/timebase.c |   10 ++++++++++
 include/cpu.h   |    1 +
 3 files changed, 19 insertions(+)

diff --git a/core/hmi.c b/core/hmi.c
index cbd35e6..f4453c5 100644
--- a/core/hmi.c
+++ b/core/hmi.c
@@ -610,6 +610,12 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
 	pre_recovery_cleanup();
 
 	lock(&hmi_lock);
+	/*
+	 * Not all HMIs would move TB into invalid state. Set the TB state
+	 * looking at TFMR register. TFMR will tell us correct state of
+	 * TB register.
+	 */
+	this_cpu()->tb_invalid = !(mfspr(SPR_TFMR) & SPR_TFMR_TB_VALID);
 	printf("HMI: Received HMI interrupt: HMER = 0x%016llx\n", hmer);
 	if (hmi_evt)
 		hmi_evt->hmer = hmer;
@@ -697,6 +703,8 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
 	 */
 	mtspr(SPR_HMER, hmer);
 	hmi_exit();
+	/* Set the TB state looking at TFMR register before we head out. */
+	this_cpu()->tb_invalid = !(mfspr(SPR_TFMR) & SPR_TFMR_TB_VALID);
 	unlock(&hmi_lock);
 	return recover;
 }
diff --git a/core/timebase.c b/core/timebase.c
index b1d8196..4fcfae5 100644
--- a/core/timebase.c
+++ b/core/timebase.c
@@ -25,6 +25,11 @@ static void time_wait_poll(unsigned long duration)
 	unsigned long end = mftb() + duration;
 	unsigned long period = msecs_to_tb(5);
 
+	if (this_cpu()->tb_invalid) {
+		cpu_relax();
+		return;
+	}
+
 	while (tb_compare(mftb(), end) != TB_AAFTERB) {
 		/* Call pollers periodically but not continually to avoid
 		 * bouncing cachelines due to lock contention. */
@@ -57,6 +62,11 @@ void time_wait_nopoll(unsigned long duration)
 {
 	unsigned long end = mftb() + duration;
 
+	if (this_cpu()->tb_invalid) {
+		cpu_relax();
+		return;
+	}
+
 	while(tb_compare(mftb(), end) != TB_AAFTERB)
 		cpu_relax();
 }
diff --git a/include/cpu.h b/include/cpu.h
index d2c1825..03a51f9 100644
--- a/include/cpu.h
+++ b/include/cpu.h
@@ -85,6 +85,7 @@ struct cpu_thread {
 	uint32_t			*core_hmi_state_ptr;
 	/* Mask to indicate thread id in core. */
 	uint8_t				thread_mask;
+	bool				tb_invalid;
 };
 
 /* This global is set to 1 to allow secondaries to callin,



More information about the Skiboot mailing list