[rfc] powernv/kdump: Fix cases where the kdump kernel can get HMI's
Balbir Singh
bsingharora at gmail.com
Wed Dec 6 17:13:50 AEDT 2017
On Wed, Dec 6, 2017 at 4:07 PM, Haren Myneni <haren at linux.vnet.ibm.com> wrote:
> On 12/05/2017 08:29 PM, Balbir Singh wrote:
>> On Mon, Dec 4, 2017 at 2:10 PM, Nicholas Piggin <npiggin at gmail.com> wrote:
>>> On Mon, 4 Dec 2017 11:37:01 +1100
>>> Balbir Singh <bsingharora at gmail.com> wrote:
>>>
>>>> On Sun, Dec 3, 2017 at 1:36 PM, Nicholas Piggin <npiggin at gmail.com> wrote:
>>>>> Seems like a reasonable approach. Why do we only do this for
>>>>> powernv? It seems like a good idea in general to pull all
>>>>> offlined CPUs out and into the same state for all platforms
>>>>> and for all shutdown/restart/crash paths.
>>>>>
>>>>
>>>> The reason is largely wake-up related, do we expect offline CPUs to wake
>>>> up in the kdump kernel. Largely the infrastructure allows us to selectively
>>>> decide what platforms need this support. I did not want to break the world
>>>> by enabling it across platforms (pseries for example) without good reason.
>>>
>>> What happens if a pseries offlined CPU gets an exception for some reason
>>> though? It seems like it would return into pseries_mach_cpu_die of the
>>> old kernel which will go wrong.
>>>
>>> Maybe the platform has stronger guarantees that it won't wake up there,
>>> like requiring a specific hcall or something?
>>>
>>> I was just thinking trying to move all platforms in general to the same
>>> scheme would be preferable, unless there is a good reason not to. Just
>>> for sharing code and behaviour.
>>>
>>
>> I am all for it, can I propose we start with powernv, since I've tested that
>> and as I test I can start enabling other platforms with follow-on patches.
>>
>>>>
>>>>> Also I wonder if there is anything we should do on the other
>>>>> side of the equation for the kdump kernel to pull CPUs into a
>>>>> known state rather than rely on the crash kernel to do it for
>>>>> us. We might have a better ability to do that with system
>>>>> reset IPIs now.
>>>>>
>>>>
>>>> Yes, but do we need to do that or quickly dump the vmcore to a file
>>>> and exit?
>>>
>>> Well if the previous kernel did not shut them down properly, we need
>>> to do that. Don't we? My point is the previous kernel crashed somehow,
>>> we should be trying to fix everything up rather than hoping it crashed
>>> "nicely" for us.
>>>
>>> Yes we shouldn't disturb things as much as possible, but we've booted
>>> an entire new kernel in its own reserved memory, so I'm not sure if
>>> it's such a concern to try fixing up wayward CPUs.
>>
>> I think it might be a little late to fix them up, since their stack traces won't
>> show up in the crash. We can of-course revisit this if required. Consider
>> for example a crash I saw where the kernel crashed and held a spinlock
>> at the time of crash, other CPUs were stuck spinning on that lock and did
>> not report back on either side of the crash. I think we'd want our dump to
>> show that. In my case I'm waking up offline CPUs to prevent them from
>> waking up and doing processing that would otherwise break the system.
>> I'm open to doing the same thing on the other-side, but I think the logic
>> is more complex on the new kernel side
>
> We do not need collect stack traces for offline CPUs at the time of crash anyway. Even if these CPUs to be online, has to be after collecting the current CPU states and just before kdump boot.
>
> In the case of CPUs stuck with IRQs disabled, they will respond anyway with NMI. Before Nick's NMI patches, these cpus states were not collected with IPI.
>
> Why do we need to bring offline CPUs online in kdump boot? I thought we always boot kdump kernel with single CPU.
The reason is described in the patch (changelog)
Balbir Singh
More information about the Linuxppc-dev
mailing list