[RFC] powerpc/pseries: Increase busy loop in pseries_cpu_die

Michael Ellerman mpe at ellerman.id.au
Tue Feb 7 13:10:22 AEDT 2017


Thiago Jung Bauermann <bauerman at linux.vnet.ibm.com> writes:

> When testing DLPAR CPU add/remove on a system under stress, pseries_cpu_die
> doesn't wait long enough for a CPU to die and the kernel ends up crashing:
>
> [  446.143648] cpu 152 (hwid 152) Ready to die...
> [  446.464057] cpu 153 (hwid 153) Ready to die...
> [  446.473525] cpu 154 (hwid 154) Ready to die...
> [  446.474077] cpu 155 (hwid 155) Ready to die...
> [  446.483529] cpu 156 (hwid 156) Ready to die...
> [  446.493532] cpu 157 (hwid 157) Ready to die...
> [  446.494078] cpu 158 (hwid 158) Ready to die...
> [  446.503527] cpu 159 (hwid 159) Ready to die...
> [  446.664534] cpu 144 (hwid 144) Ready to die...
> [  446.964113] cpu 145 (hwid 145) Ready to die...
> [  446.973525] cpu 146 (hwid 146) Ready to die...
> [  446.974094] cpu 147 (hwid 147) Ready to die...
> [  446.983944] cpu 148 (hwid 148) Ready to die...
> [  446.984062] cpu 149 (hwid 149) Ready to die...
> [  446.993518] cpu 150 (hwid 150) Ready to die...
> [  446.993543] Querying DEAD? cpu 150 (150) shows 2
> [  446.994098] cpu 151 (hwid 151) Ready to die...
> [  447.133726] cpu 136 (hwid 136) Ready to die...
> [  447.403532] cpu 137 (hwid 137) Ready to die...
> [  447.403772] cpu 138 (hwid 138) Ready to die...
> [  447.403839] cpu 139 (hwid 139) Ready to die...
> [  447.403887] cpu 140 (hwid 140) Ready to die...
> [  447.403937] cpu 141 (hwid 141) Ready to die...
> [  447.403979] cpu 142 (hwid 142) Ready to die...
> [  447.404038] cpu 143 (hwid 143) Ready to die...
> [  447.513546] cpu 128 (hwid 128) Ready to die...
> [  447.693533] cpu 129 (hwid 129) Ready to die...
> [  447.693999] cpu 130 (hwid 130) Ready to die...
> [  447.703530] cpu 131 (hwid 131) Ready to die...
> [  447.704087] Querying DEAD? cpu 132 (132) shows 2
> [  447.704102] cpu 132 (hwid 132) Ready to die...
> [  447.713534] cpu 133 (hwid 133) Ready to die...
> [  447.714064] Querying DEAD? cpu 134 (134) shows 2
> cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
>     pc: 000000001ec3072c
>     lr: 000000001ec2fee0
>     sp: 1faf6bd0
>    msr: 8000000102801000
>    dar: 212d6c1a2a20c
>  dsisr: 42000000
>   current = 0xc000000474c6d600
>   paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
>     pid   = 0, comm = swapper/134
> Linux version 4.8.0-34-generic (buildd at bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
> WARNING: exception is not recoverable, can't continue
>
> This was reproduced in v4.10-rc6 as well, but I don't have a crash log
> handy for that version right now. Sorry.

We shouldn't be crashing.

So we need to fix that.

We may also need to increase the timeout, though it's pretty gross TBH.

But step one is make sure we don't crash.

cheers


More information about the Linuxppc-dev mailing list