rcu_sched self-detected stall on CPU

Paul E. McKenney paulmck at kernel.org
Thu Apr 7 03:00:12 AEST 2022


On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
> Hi
> 
> I can reproduce it in a ppc virtual cloud server provided by Oregon
> State University.  Following is what I do:
> 1) curl -l https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
> -o linux-5.18-rc1.tar.gz
> 2) tar zxf linux-5.18-rc1.tar.gz
> 3) cp config linux-5.18-rc1/.config
> 4) cd linux-5.18-rc1
> 5) make vmlinux -j 8
> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
> -smp 2 (QEMU 4.2.1)
> 7) after 12 rounds, the bug got reproduced:
> (http://154.223.142.244/logs/20220406/qemu.log.txt)

Just to make sure, are you both seeing the same thing?  Last I knew,
Zhouyi was chasing an RCU-tasks issue that appears only in kernels
built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
I miss something?

Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period
kthread slept for three milliseconds, but did not wake up for more than
20 seconds.  This kthread would normally have awakened on CPU 1, but
CPU 1 looks to me to be very unhealthy, as can be seen in your console
output below (but maybe my idea of what is healthy for powerpc systems
is outdated).  Please see also the inline annotations.

Thoughts from the PPC guys?

							Thanx, Paul

------------------------------------------------------------------------

[   21.186912] rcu: INFO: rcu_sched self-detected stall on CPU
[   21.187331] rcu: 	1-...!: (4712629 ticks this GP) idle=2c1/0/0x3 softirq=8/8 fqs=0 
[   21.187529] 	(t=21000 jiffies g=-1183 q=3)
[   21.187681] rcu: rcu_sched kthread timer wakeup didn't happen for 20997 jiffies! g-1183 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402

	The grace-period kthread is still asleep (->state=0x402).
	This indicates that the three-jiffy timer has somehow been
	prevented from expiring for almost a full 21 seconds.  Of course,
	if timers don't work, RCU cannot work.

[   21.187770] rcu: 	Possible timer handling issue on cpu=1 timer-softirq=1
[   21.187927] rcu: rcu_sched kthread starved for 21001 jiffies! g-1183 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[   21.188019] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[   21.188087] rcu: RCU grace-period kthread stack dump:
[   21.188196] task:rcu_sched       state:I stack:    0 pid:   10 ppid:     2 flags:0x00000800
[   21.188453] Call Trace:
[   21.188525] [c0000000061e78a0] [c0000000061e78e0] 0xc0000000061e78e0 (unreliable)
[   21.188900] [c0000000061e7a90] [c000000000017210] __switch_to+0x250/0x310
[   21.189210] [c0000000061e7b00] [c0000000003ed660] __schedule+0x210/0x660
[   21.189315] [c0000000061e7b80] [c0000000003edb14] schedule+0x64/0x110
[   21.189387] [c0000000061e7bb0] [c0000000003f6648] schedule_timeout+0x1d8/0x390
[   21.189473] [c0000000061e7c80] [c00000000011111c] rcu_gp_fqs_loop+0x2dc/0x3d0
[   21.189555] [c0000000061e7d30] [c0000000001144ec] rcu_gp_kthread+0x13c/0x160
[   21.189633] [c0000000061e7dc0] [c0000000000c1770] kthread+0x110/0x120
[   21.189714] [c0000000061e7e10] [c00000000000c9e4] ret_from_kernel_thread+0x5c/0x64

	The above stack trace is expected behavior when the RCU
	grace-period kthread is waiting to do its next FQS scan.

[   21.189938] rcu: Stack dump where RCU GP kthread last ran:

	And here is the stalled CPU, which also happens to be the CPU
	that RCU last ran on:

[   21.189992] Task dump for CPU 1:
[   21.190059] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x00000804
[   21.190169] Call Trace:
[   21.190194] [c0000000061ef2d0] [c0000000000c9a40] sched_show_task+0x180/0x1c0 (unreliable)
[   21.190278] [c0000000061ef340] [c000000000116ca0] rcu_check_gp_kthread_starvation+0x16c/0x19c
[   21.190370] [c0000000061ef3c0] [c000000000114f7c] rcu_sched_clock_irq+0x7ec/0xaf0
[   21.190448] [c0000000061ef4b0] [c000000000120fdc] update_process_times+0xbc/0x140
[   21.190524] [c0000000061ef4f0] [c000000000136a24] tick_nohz_handler+0xf4/0x1b0
[   21.190608] [c0000000061ef540] [c00000000001c828] timer_interrupt+0x148/0x2d0
[   21.190699] [c0000000061ef590] [c0000000000098e8] decrementer_common_virt+0x208/0x210
[   21.190837] --- interrupt: 900 at arch_local_irq_restore+0x168/0x170

	Up through this point is just the stack trace of the the
	code doing the stack dump that the RCU CPU stall warning code
	asked for.

[   21.190941] NIP:  c000000000013608 LR: c0000000003f8114 CTR: c0000000000dc630

	This NIP does not look at all good to me.  But I freely confess
	that I am out of date on what Power machines do.

[   21.191031] REGS: c0000000061ef600 TRAP: 0900   Not tainted  (5.18.0-rc1)
[   21.191109] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22000202  XER: 00000000
[   21.191274] CFAR: 0000000000000000 IRQMASK: 0 
[   21.191274] GPR00: c00000000009c368 c0000000061ef8a0 c00000000116a700 0000000000000000 
[   21.191274] GPR04: 0000000000000000 0000000000000000 000000001ee30000 ffffffffffffffff 
[   21.191274] GPR08: 000000001ee30000 0000000000000000 0000000000008002 7265677368657265 
[   21.191274] GPR12: c0000000000dc630 c00000001ffe5800 0000000000000000 0000000000000000 
[   21.191274] GPR16: 0000000000000282 0000000000000000 0000000000000000 c0000000061eff00 
[   21.191274] GPR20: 0000000000000000 0000000000000001 c0000000061b9f80 c000000001195a10 
[   21.191274] GPR24: c000000001193a00 00000000fffb6cc4 000000000000000a c0000000010721e8 
[   21.191274] GPR28: c000000001076800 c000000001070380 c0000000010716d8 c0000000061b9f80 
[   21.191932] NIP [c000000000013608] arch_local_irq_restore+0x168/0x170
[   21.192024] LR [c0000000003f8114] __do_softirq+0xd4/0x2ec
[   21.192118] --- interrupt: 900
[   21.192158] [c0000000061ef8a0] [c0000000061b9f80] 0xc0000000061b9f80 (unreliable)
[   21.192227] [c0000000061ef9b0] [c00000000009c368] irq_exit+0xc8/0x110
[   21.192307] [c0000000061ef9d0] [c00000000001c858] timer_interrupt+0x178/0x2d0
[   21.192397] [c0000000061efa20] [c0000000000098e8] decrementer_common_virt+0x208/0x210
[   21.192495] --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c
[   21.192566] NIP:  c000000000072988 LR: c000000000074fa8 CTR: c000000000074f10
[   21.192615] REGS: c0000000061efa90 TRAP: 0900   Not tainted  (5.18.0-rc1)
[   21.192659] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28000202  XER: 00000000
[   21.192755] CFAR: 0000000000000000 IRQMASK: 0 
[   21.192755] GPR00: 0000000028000202 c0000000061efd30 c00000000116a700 0000000000000000 
[   21.192755] GPR04: c00000001fea0280 ffffffffffffffff 0000000001f40000 000000019d088fcf 
[   21.192755] GPR08: 000000001ee30000 c00000001ffe5400 0000000000000001 0000000100000000 
[   21.192755] GPR12: c000000000074f10 c00000001ffe5800 0000000000000000 0000000000000000 
[   21.192755] GPR16: 0000000000000000 0000000000000000 0000000000000000 c0000000061eff00 
[   21.192755] GPR20: c00000000003d440 0000000000000001 c000000001195b30 c000000001195a10 
[   21.192755] GPR24: 0000000000080000 c0000000061ba000 c000000001195a98 0000000000000001 
[   21.192755] GPR28: 0000000000000001 c0000000010716d0 c0000000010716d8 c0000000010716d0 
[   21.193290] NIP [c000000000072988] plpar_hcall_norets_notrace+0x18/0x2c
[   21.193363] LR [c000000000074fa8] pseries_lpar_idle+0x98/0x1b0
[   21.193428] --- interrupt: 900
[   21.193457] [c0000000061efd30] [0000000000000001] 0x1 (unreliable)
[   21.193512] [c0000000061efdb0] [c000000000018b54] arch_cpu_idle+0x44/0x180
[   21.193590] [c0000000061efde0] [c0000000003f75bc] default_idle_call+0x4c/0x7c
[   21.193679] [c0000000061efe00] [c0000000000e1384] do_idle+0x114/0x1e0
[   21.193747] [c0000000061efe60] [c0000000000e1664] cpu_startup_entry+0x34/0x40
[   21.193901] [c0000000061efe90] [c00000000003f044] start_secondary+0x624/0xa00
[   21.194002] [c0000000061eff90] [c00000000000cd54] start_secondary_prolog+0x10/0x14
[   21.194245] Task dump for CPU 1:
[   21.194284] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x00000804
[   21.194374] Call Trace:
[   21.194400] [c0000000061ef2b0] [c0000000000c9a40] sched_show_task+0x180/0x1c0 (unreliable)
[   21.194479] [c0000000061ef320] [c000000000116df8] rcu_dump_cpu_stacks+0x128/0x188
[   21.194567] [c0000000061ef3c0] [c000000000114f9c] rcu_sched_clock_irq+0x80c/0xaf0
[   21.194642] [c0000000061ef4b0] [c000000000120fdc] update_process_times+0xbc/0x140
[   21.194712] [c0000000061ef4f0] [c000000000136a24] tick_nohz_handler+0xf4/0x1b0
[   21.194828] [c0000000061ef540] [c00000000001c828] timer_interrupt+0x148/0x2d0
[   21.194942] [c0000000061ef590] [c0000000000098e8] decrementer_common_virt+0x208/0x210
[   21.195035] --- interrupt: 900 at arch_local_irq_restore+0x168/0x170
[   21.195104] NIP:  c000000000013608 LR: c0000000003f8114 CTR: c0000000000dc630
[   21.195152] REGS: c0000000061ef600 TRAP: 0900   Not tainted  (5.18.0-rc1)
[   21.195199] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22000202  XER: 00000000
[   21.195296] CFAR: 0000000000000000 IRQMASK: 0 
[   21.195296] GPR00: c00000000009c368 c0000000061ef8a0 c00000000116a700 0000000000000000 
[   21.195296] GPR04: 0000000000000000 0000000000000000 000000001ee30000 ffffffffffffffff 
[   21.195296] GPR08: 000000001ee30000 0000000000000000 0000000000008002 7265677368657265 
[   21.195296] GPR12: c0000000000dc630 c00000001ffe5800 0000000000000000 0000000000000000 
[   21.195296] GPR16: 0000000000000282 0000000000000000 0000000000000000 c0000000061eff00 
[   21.195296] GPR20: 0000000000000000 0000000000000001 c0000000061b9f80 c000000001195a10 
[   21.195296] GPR24: c000000001193a00 00000000fffb6cc4 000000000000000a c0000000010721e8 
[   21.195296] GPR28: c000000001076800 c000000001070380 c0000000010716d8 c0000000061b9f80 
[   21.195850] NIP [c000000000013608] arch_local_irq_restore+0x168/0x170
[   21.195944] LR [c0000000003f8114] __do_softirq+0xd4/0x2ec
[   21.196027] --- interrupt: 900
[   21.196056] [c0000000061ef8a0] [c0000000061b9f80] 0xc0000000061b9f80 (unreliable)
[   21.196119] [c0000000061ef9b0] [c00000000009c368] irq_exit+0xc8/0x110
[   21.196192] [c0000000061ef9d0] [c00000000001c858] timer_interrupt+0x178/0x2d0
[   21.196282] [c0000000061efa20] [c0000000000098e8] decrementer_common_virt+0x208/0x210
[   21.196373] --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c
[   21.196439] NIP:  c000000000072988 LR: c000000000074fa8 CTR: c000000000074f10
[   21.196489] REGS: c0000000061efa90 TRAP: 0900   Not tainted  (5.18.0-rc1)
[   21.196534] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28000202  XER: 00000000
[   21.196627] CFAR: 0000000000000000 IRQMASK: 0 
[   21.196627] GPR00: 0000000028000202 c0000000061efd30 c00000000116a700 0000000000000000 
[   21.196627] GPR04: c00000001fea0280 ffffffffffffffff 0000000001f40000 000000019d088fcf 
[   21.196627] GPR08: 000000001ee30000 c00000001ffe5400 0000000000000001 0000000100000000 
[   21.196627] GPR12: c000000000074f10 c00000001ffe5800 0000000000000000 0000000000000000 
[   21.196627] GPR16: 0000000000000000 0000000000000000 0000000000000000 c0000000061eff00 
[   21.196627] GPR20: c00000000003d440 0000000000000001 c000000001195b30 c000000001195a10 
[   21.196627] GPR24: 0000000000080000 c0000000061ba000 c000000001195a98 0000000000000001 
[   21.196627] GPR28: 0000000000000001 c0000000010716d0 c0000000010716d8 c0000000010716d0 
[   21.197168] NIP [c000000000072988] plpar_hcall_norets_notrace+0x18/0x2c
[   21.197239] LR [c000000000074fa8] pseries_lpar_idle+0x98/0x1b0
[   21.197305] --- interrupt: 900
[   21.197333] [c0000000061efd30] [0000000000000001] 0x1 (unreliable)
[   21.197390] [c0000000061efdb0] [c000000000018b54] arch_cpu_idle+0x44/0x180
[   21.197470] [c0000000061efde0] [c0000000003f75bc] default_idle_call+0x4c/0x7c
[   21.197556] [c0000000061efe00] [c0000000000e1384] do_idle+0x114/0x1e0
[   21.197620] [c0000000061efe60] [c0000000000e1664] cpu_startup_entry+0x34/0x40
[   21.197696] [c0000000061efe90] [c00000000003f044] start_secondary+0x624/0xa00
[   21.197820] [c0000000061eff90] [c00000000000cd54] start_secondary_prolog+0x10/0x1I4


More information about the Linuxppc-dev mailing list