[Skiboot] [PATCH] Fix hang in pnv_platform_error_reboot path due to TOD failure.

Wed Mar 27 16:01:44 AEDT 2019

On 03/22/2019 08:28 PM, Mahesh J Salgaonkar wrote:
> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
> 
> On TOD failure, with TB stuck, when linux heads down to
> pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic
> cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest
> all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed.
> But with panic cpu stuck inside OPAL, linux never recovers/reboot.
> 
> p0 c1 t0
> NIA : 0x000000003001dd3c <.time_wait+0x64>
> CFAR : 0x000000003001dce4 <.time_wait+0xc>
> MSR : 0x9000000002803002
> LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
> 
> STACK: SP NIA
> 0x0000000031c236e0 0x0000000031c23760 (big-endian)
> 0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
> 0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c>
> 0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150>
> 0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc>
> 0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc>
> 0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc>
> 0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4>
> 0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0>
> 0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134>
> 0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80>
> 0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94>
> 0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0>
> 0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4>
> 0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140>
> 0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c>
> 0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68>
> 0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8>
> 0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8>
> 0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88>
> 0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164>
> 0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c>
> 
> This is because, there is a while loop towards the end of
> ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match
> with "msg". It loops over time_wait_ms() until exit condition is met. In
> normal scenario time_wait_ms() calls run pollers so that ipmi backend gets
> a chance to check ipmi response and set sync_msg to NULL.
> 
> 	while (sync_msg == msg)
>                  time_wait_ms(10);
> 
> But in the event when TB is in failed state time_wait_ms()->time_wait_poll()
> returns immediately without calling pollers and hence we end up looping
> forever. This patch fixes this hang by calling opal_run_pollers() in TB
> failed state as well.
> 
> Fixes: 1764f2452 ("opal: Fix hang in time_wait* calls on HMI for TB errors.")
> Cc: skiboot-stable at lists.ozlabs.org
> Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>

Looks good to me.

Reviewed-by: Vasant Hegde <hegdevasant at linux.vnet.ibm.com>

-Vasant