[PATCH v1] powerpc/watchdog: help remote CPUs to flush NMI printk output

Fri Nov 19 20:15:59 AEDT 2021

Excerpts from Daniel Axtens's message of November 12, 2021 4:08 pm:
> Hi,
> 
>> The printk layer at the moment does not seem to have a good way to force
>> flush printk messages that are created in NMI context, except in the
>> panic path.
>>
>> NMI-context printk messages normally get to the console with irq_work,
>> but that won't help if the CPU is stuck with irqs disabled, as can be
>> the case for hard lockup watchdog messages.
>>
>> The watchdog currently flushes the printk buffers after detecting a
>> lockup on remote CPUs, but they may not have processed their NMI IPI
>> yet by that stage, or they may have self-detected a lockup in which
>> case they won't go via this NMI IPI path.
>>
>> Improve the situation by having NMI-context mark a flag if it called
>> printk, and have watchdog timer interrupts check if that flag was set
>> and try to flush if it was. Latency is not a big problem because we
>> were already stuck for a while, just need to try to make sure the
>> messages eventually make it out.

Sorry just coming back to this now the printk fix was merged upstream.

> Initially I was surprised that this doesn't affect the printk code
> itself, just the powerpc code...

Yeah I'm actually not sure how other watchdogs handle this. If they
use nmi_trigger_cpumask_backtrace() then that does have a flush.

>> Cc: Laurent Dufour <ldufour at linux.ibm.com>
>> Signed-off-by: Nicholas Piggin <npiggin at gmail.com>
>> ---
>> This patch is actually based on top of this one which is planned to go
>> upstream in rc1/2. https://marc.info/?l=linux-kernel&m=163626070312052&w=2
>>
>> Prior to commit 93d102f094be that is fixed by the above, we had a printk
>> flush function with a different name but basically does the same job, so
>> this patch can be backported, just needs some care. I'm posting it for
>> review now for feedback so it's ready to go when the printk patch is
>> upstream.
>>
>> Thanks,
>> Nick
>>
>>  arch/powerpc/kernel/watchdog.c | 29 +++++++++++++++++++++++------
>>  1 file changed, 23 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
>> index b6533539386b..a7b6b0691203 100644
>> --- a/arch/powerpc/kernel/watchdog.c
>> +++ b/arch/powerpc/kernel/watchdog.c
>> @@ -86,6 +86,7 @@ static DEFINE_PER_CPU(u64, wd_timer_tb);
>>  /* SMP checker bits */
>>  static unsigned long __wd_smp_lock;
>>  static unsigned long __wd_reporting;
>> +static unsigned long __wd_nmi_output;
>>  static cpumask_t wd_smp_cpus_pending;
>>  static cpumask_t wd_smp_cpus_stuck;
>>  static u64 wd_smp_last_reset_tb;
>> @@ -154,6 +155,18 @@ static void wd_lockup_ipi(struct pt_regs *regs)
>>  	else
>>  		dump_stack();
>>  
>> +	/*
>> +	 * We printk()ed from NMI context, the contents may not get flushed
>> +	 * if we return to a context with interrupts disabled because
>> +	 * printk uses irq_work to schedule flushes of NMI output.
>> +	 * __wd_nmi_output says there has been output from NMI context, so
>> +	 * other CPUs are recruited to help flush it.
>> +	 *
>> +	 * xchg is not needed here (it could be a simple atomic store), but
>> +	 * it gives the memory ordering and atomicity required.
>> +	 */
>> +	xchg(&__wd_nmi_output, 1);
>> +
>>  	/* Do not panic from here because that can recurse into NMI IPI layer */
>>  }
> 
> I think, looking at this and the other site where __wd_nmi_output is
> set, that this works because you set the flag only when you are done
> printing from the non-panic lockup context on this CPU. I was initially
> worried that you set this flag part way through printing, and then it
> might get cleared by another CPU while you're still trying to print.
> However, in this function it's right at the end - there's nothing else
> left to do, and ...
> 
>>  DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
>> @@ -386,6 +401,8 @@ DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
>>  		print_irqtrace_events(current);
>>  		show_regs(regs);
>>  
>> +		xchg(&__wd_nmi_output, 1); // see wd_lockup_ipi
>> +
>>  		if (sysctl_hardlockup_all_cpu_backtrace)
>>  			trigger_allbutself_cpu_backtrace();
> 
> in this one, the only things that can happen afterwards are
>  - a panic, which does its own flushing, and
> 
> - trigger_allbutself_cpu_backtrace(), which seems to just send IPIs, not
>  do any printing of its own.

Yeah, on powerpc that actually ends up not using NMI IPIs because that's 
dangerous and core code uses it for less severe things than a hard 
lockup so we disabled it, so if you can take a regular IPI then you can
run irq work when it returns and print things. If we changed it to use
NMI IPIs I guess we might want to do something, maybe.

> All of which is fine, but I wonder if we need a more descriptive
> variable name or if the comment needs to specify that the flag should
> only be set at certain times.

I did try to add some explanation in that comment. I'll see if that can 
be made a bit more readable and expand on it a bit.

Thanks,
Nick