[PATCH 0/2] Disabling NMI watchdog during LPM's memory transfer

Thu Jun 9 19:09:54 AEST 2022

On 09/06/2022, 09:45:49, Michael Ellerman wrote:
> Nathan Lynch <nathanl at linux.ibm.com> writes:
>> Laurent Dufour <ldufour at linux.ibm.com> writes:
> ...
>>
>>> There are  ongoing investigations to clarify where and how this latency is
>>> happening. I'm not excluding any other issue in the Linux kernel, but right
>>> now, this looks to be the best option to prevent system crash during
>>> LPM.
>>
>> It will prevent the likely crash mode for enterprise distros with
>> default watchdog tunables that our internal test environments happen to
>> use. But if someone were to run the same scenario with softlockup_panic
>> enabled, or with the RCU stall timeout lower than the watchdog
>> threshold, the failure mode would be different.
>>
>> Basically I'm saying:
>> * Some users may actually want the OS to panic when it's in this state,
>>   because their applications can't work correctly.
>> * But if we're going to inhibit one watchdog, we should inhibit them
>>   all.
> 
> I'm sympathetic to both of your arguments.
> 
> But I think there is a key difference between the NMI watchdog and other
> watchdogs, which is that the NMI watchdog will use the unsafe NMI to
> interrupt other CPUs, and that can cause the system to crash when other
> watchdogs would just print a backtrace.
> 
> We had the same problem with the rcu_sched stall detector until we
> changed it to use the "safe" NMI, see:
>   5cc05910f26e ("powerpc/64s: Wire up arch_trigger_cpumask_backtrace()")
> 
> 
> So even if the NMI watchdog is disabled there are still the other
> watchdogs enabled, which should print backtraces by default, and if
> desired can also be configured to cause a panic.
> 
> Instead of disabling the NMI watchdog, can we instead increase the
> timeout (by how much?) during LPM, so that it is less likely to fire in
> normal usage, but is still there as a backup if the system is completely
> clogged.

That's probably doable, tweaking wd_smp_panic_timeout_tb and
wd_panic_timeout_tb when the LPM is in progress.

I'll add a new sysctl value, so administrator will have the capability to
change that and also fully disable the NMI watchdog during LPM if he want.

I've no idea what should be the default factor, I guess this will be a bit
empiric.

I'll rework my patch in that way.

cheers,
Laurent.