[PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

Gautham R Shenoy ego at linux.vnet.ibm.com
Fri Dec 7 23:03:46 AEDT 2018


On Fri, Dec 07, 2018 at 04:13:11PM +0530, Gautham R Shenoy wrote:
> Hi Thiago,
> 
> 
> Sure. I will test the patch and report back.

I added the following debug patch on top of your patch, and after an
hour's run, the system crashed. Appending the log at the end.

I suppose we still need to increase the number of tries since we wait
only for 2.5ms looping before giving up.

----------------------------x8-----------------------------------------
>From 9e9e1e459cdc63540a73575af359a7bc0d43e455 Mon Sep 17 00:00:00 2001
From: "Gautham R. Shenoy" <ego at linux.vnet.ibm.com>
Date: Fri, 7 Dec 2018 16:47:25 +0530
Subject: [PATCH] Add debug on top of Thiago's patch

Signed-off-by: Gautham R. Shenoy <ego at linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index fad3ceb..2545aab 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -214,6 +214,9 @@ static void pseries_cpu_die(unsigned int cpu)
 			msleep(1);
 		}
 	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+		u64 tb_before = mftb();
+		u64 tb_after;
+
 		/*
 		 * If the current state is not offline yet, it means that the
 		 * dying CPU (which is in pseries_mach_cpu_die) didn't have a
@@ -230,6 +233,11 @@ static void pseries_cpu_die(unsigned int cpu)
 				break;
 			udelay(100);
 		}
+
+		tb_after = mftb();
+
+		printk("[DEBUG] Waited for CPU %d to enter rtas: tries=%d, time=%llu\n", cpu, tries,
+			div_u64(tb_after - tb_before, tb_ticks_per_usec));
 	}
 
 	if (cpu_status != 0) {
-- 
1.9.4
----------------------------x8-----------------------------------------

cpu 112 (hwid 112) Ready to die...
[DEBUG] Waited for CPU 112 to enter rtas: tries=0, time=65
cpu 113 (hwid 113) Ready to die...
[DEBUG] Waited for CPU 113 to enter rtas: tries=0, time=1139
cpu 114 (hwid 114) Ready to die...
[DEBUG] Waited for CPU 114 to enter rtas: tries=0, time=1036
cpu 115 (hwid 115) Ready to die...
[DEBUG] Waited for CPU 115 to enter rtas: tries=0, time=133
cpu 116 (hwid 116) Ready to die...
[DEBUG] Waited for CPU 116 to enter rtas: tries=0, time=1231
cpu 117 (hwid 117) Ready to die...
[DEBUG] Waited for CPU 117 to enter rtas: tries=0, time=1231
cpu 118 (hwid 118) Ready to die...
[DEBUG] Waited for CPU 118 to enter rtas: tries=0, time=1231
cpu 119 (hwid 119) Ready to die...
[DEBUG] Waited for CPU 119 to enter rtas: tries=0, time=1131
cpu 104 (hwid 104) Ready to die...
[DEBUG] Waited for CPU 104 to enter rtas: tries=0, time=40  

******* RTAS CALL BUFFER CORRUPTION *******
393: rtas32_call_buff_ptr=
    0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
    0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
    0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
    0000 0060 0800 E07F ACA7 0000 0000 00C0 [...`............]
    2500 0000 0000 0000 0000 0000 0000 0000 [%...............]
    0000 0000 0000 0000 0000 0000 306E 7572 [............0nur]
    4800 0008 .... .... .... .... .... .... [H...........0nur]
394: rtas64_map_buff_ptr=
    0000 0000 5046 5743 0000 0000 4F44 4500 [....PFWC....ODE.]
    0000 0000 6000 0000 0000 0000 0000 0069 [....`..........i]
    0000 0000 0000 0000 0000 0000 0000 0000 [................]
    0000 0000 0000 0005 0000 0000 0000 0001 [................]
    0000 0000 1A00 0000 0000 0000 0000 0000 [................]
    0000 0000 8018 6398 0000 0000 0300 00C0 [......c.........]
    0000 0000 .... .... .... .... .... .... [......c.........]
cpu 105 (hwid 105) Ready to die...
Bad kernel stack pointer 1fafb6c0 at 0
Oops: Bad kernel stack pointer, sig: 6 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 105 PID: 0 Comm: swapper/105 Not tainted 4.20.0-rc5-thiago+ #45
NIP:  0000000000000000 LR: 0000000000000000 CTR: 00000000007829c8
REGS: c00000001e63bd30 TRAP: 0700   Not tainted  (4.20.0-rc5-thiago+)
MSR:  8000000000081000 <SF,ME>  CR: 28000004  XER: 00000010
CFAR: 000000001ec153f0 IRQMASK: 8000000000009033
GPR00: 0000000000000000 000000001fafb6c0 000000001ec236a0 0000000000000040
GPR04: 00000000000000c0 0000000000000080 00046c4fb4842557 00000000000000cd
GPR08: 000000000001f400 000000001ed035dc 0000000000000000 0000000000000000
GPR12: 0000000000000000 c00000001eb5e480 c0000003a1b53f90 000000001eea3e20
GPR16: 0000000000000000 c0000006fd845100 c00000000004c8b0 c0000000013d5300
GPR20: c0000006fd845300 0000000000000008 c0000000019d2cf8 c0000000013d6888
GPR24: 0000000000000069 c0000000013d688c 0000000000000002 c0000000013d688c
GPR28: c0000000019cecf0 0000000000000348 0000000000000000 0000000000000000
NIP [0000000000000000]           (null)
LR [0000000000000000]           (null)
Call Trace:
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 60000000 60000000 60000000 60000000
---[ end trace 1aa3b4936949457e ]---
Bad kernel stack pointer 1fafb4b0 at 1ec15004
rcu: INFO: rcu_sched detected stalls on CPUs/tasks: 
rcu:    88-...!: (0 ticks this GP) idle=2ce/1/0x4000000000000000 softirq=28076/28076 fqs=78
rcu:    (detected by 72, t=10866 jiffies, g=180529, q=2526)
Sending NMI from CPU 72 to CPUs 88:
CPU 88 didn't respond to backtrace IPI, inspecting paca.
irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 22978 (drmgr)
Back trace of paca->saved_r1 (0xc0000006f94ab750) (possibly stale):
Call Trace:
[c0000006f94ab750] [c0000006f94ab790] 0xc0000006f94ab790 (unreliable)
[c0000006f94ab930] [c0000000000373f8] va_rtas_call_unlocked+0xc8/0xe0
[c0000006f94ab950] [c000000000037a98] rtas_call+0x98/0x200
[c0000006f94ab9a0] [c0000000000d7d28] smp_query_cpu_stopped+0x58/0xe0
[c0000006f94aba20] [c0000000000d9dbc] pseries_cpu_die+0x1ec/0x240
[c0000006f94abad0] [c00000000004f284] __cpu_die+0x44/0x60
[c0000006f94abaf0] [c0000000000d8e10] dlpar_cpu_remove+0x160/0x340
[c0000006f94abbc0] [c0000000000d9184] dlpar_cpu_release+0x74/0x100
[c0000006f94abc10] [c000000000025a74] arch_cpu_release+0x44/0x70
[c0000006f94abc30] [c0000000009bd1bc] cpu_release_store+0x4c/0x80
[c0000006f94abc60] [c0000000009ae000] dev_attr_store+0x40/0x70
[c0000006f94abc80] [c000000000495810] sysfs_kf_write+0x70/0xb0
[c0000006f94abca0] [c00000000049453c] kernfs_fop_write+0x17c/0x250
[c0000006f94abcf0] [c0000000003ccb6c] __vfs_write+0x4c/0x1f0
[c0000006f94abd80] [c0000000003ccf74] vfs_write+0xd4/0x240
[c0000006f94abdd0] [c0000000003cd338] ksys_write+0x68/0x110
[c0000006f94abe20] [c00000000000b288] system_call+0x5c/0x70
rcu: rcu_sched kthread starved for 10709 jiffies! g180529 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=72
rcu: RCU grace-period kthread stack dump:
rcu_sched       I    0    11      2 0x00000808
Call Trace:
[c0000000061ab840] [c0000003a4a84800] 0xc0000003a4a84800 (unreliable)
[c0000000061aba20] [c00000000001e24c] __switch_to+0x2dc/0x430
[c0000000061aba80] [c000000000e5fb94] __schedule+0x3d4/0xa20
[c0000000061abb50] [c000000000e6022c] schedule+0x4c/0xc0
[c0000000061abb80] [c000000000e64ffc] schedule_timeout+0x1dc/0x4e0
[c0000000061abc80] [c0000000001af40c] rcu_gp_kthread+0xc3c/0x11f0
[c0000000061abdb0] [c00000000013c7c8] kthread+0x168/0x1b0
[c0000000061abe20] [c00000000000b658] ret_from_kernel_thread+0x5c/0x64







More information about the Linuxppc-dev mailing list