[Skiboot] [PATCH] Fix hang in pnv_platform_error_reboot path due to TOD failure.

Sat Mar 23 01:58:00 AEDT 2019

From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>

On TOD failure, with TB stuck, when linux heads down to
pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic
cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest
all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed.
But with panic cpu stuck inside OPAL, linux never recovers/reboot.

p0 c1 t0
NIA : 0x000000003001dd3c <.time_wait+0x64>
CFAR : 0x000000003001dce4 <.time_wait+0xc>
MSR : 0x9000000002803002
LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>

STACK: SP NIA
0x0000000031c236e0 0x0000000031c23760 (big-endian)
0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c>
0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150>
0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc>
0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc>
0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc>
0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4>
0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0>
0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134>
0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80>
0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94>
0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0>
0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4>
0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140>
0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c>
0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68>
0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8>
0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8>
0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88>
0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164>
0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c>

This is because, there is a while loop towards the end of
ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match
with "msg". It loops over time_wait_ms() until exit condition is met. In
normal scenario time_wait_ms() calls run pollers so that ipmi backend gets
a chance to check ipmi response and set sync_msg to NULL.

	while (sync_msg == msg)
                time_wait_ms(10);

But in the event when TB is in failed state time_wait_ms()->time_wait_poll()
returns immediately without calling pollers and hence we end up looping
forever. This patch fixes this hang by calling opal_run_pollers() in TB
failed state as well.

Fixes: 1764f2452 ("opal: Fix hang in time_wait* calls on HMI for TB errors.")
Cc: skiboot-stable at lists.ozlabs.org
Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
---
 core/timebase.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/core/timebase.c b/core/timebase.c
index 0ae11d253..b1a02a4a1 100644
--- a/core/timebase.c
+++ b/core/timebase.c
@@ -29,6 +29,16 @@ static void time_wait_poll(unsigned long duration)
 	unsigned long period = msecs_to_tb(5);
 
 	if (this_cpu()->tb_invalid) {
+		/*
+		 * Run pollers to allow some backends to process response.
+		 *
+		 * In TOD failure case where TOD is unrecoverable, running
+		 * pollers allows ipmi backend to deal with ipmi response
+		 * from bmc and helps ipmi_queue_msg_sync() to get un-stuck.
+		 * Thus it avoids linux kernel to hang during panic due to
+		 * TOD failure.
+		 */
+		opal_run_pollers();
 		cpu_relax();
 		return;
 	}