offlining cpus breakage

Shreyas B Prabhu shreyas at linux.vnet.ibm.com
Wed Jan 14 22:03:00 AEDT 2015



On Wednesday 07 January 2015 03:07 PM, Alexey Kardashevskiy wrote:
> Hi!
> 
> "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel
> (sha1 bdec419):
> 
> NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0]
> 
> or
> 
> INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15
> 16 17 18 19 20 21 22 23 2
> 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616,
> q=1441)
> 
> and many others, all about lockups
> 
> I did bisecting and found out that reverting these helps:
> 
> 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle
> support for offline cpus
> 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle
> states management
> 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline
> CPUs to enter deep idle states
> 
> btw reverting just two of them produces a compile error.
> 
> It is pseries_le_defconfig, POWER8 machine:
> timebase        : 512000000
> platform        : PowerNV
> model           : palmetto
> machine         : PowerNV palmetto
> firmware        : OPAL v3
> 
> 

The bug scenario is as follows:

In fastsleep decrementer state is not maintained, thus a cpu entering
fastsleep offloads its timer to a different cpu (lets call this
broadcast cpu). Now in the event that this broadcast cpu is offlined, it
assigns a new cpu with the task to handle broadcasting.

If this new cpu is one of the cpus which had entered fastsleep, its
decrementer will have been in an invalid state. This cpu has been woken
up by a need resched ipi (to take up the task of broadcasting) as
opposed to a broadcast ipi. The decrementer state is fixed only on a
broadcast ipi and not on a need resched ipi. Because of this, its timers
don't fire. Consequently it cannot wake up any cpu relying on broadcast ipi.

This scenario of a cpu that takes up the task of broadcasting being in
fastsleep is a corner case. This almost never happens on machines with
more number of cores. This explains why Alexey was able to hit it easily
on palmetto.

We'll be posting out a fix for this soon.

Thanks,
Shreyas



More information about the Linuxppc-dev mailing list