[PATCH] powerpc/stacktrace: fix raise_backtrace_ipi() logic

Wed Jun 2 21:08:20 AEST 2021

Nathan Lynch <nathanl at linux.ibm.com> writes:
> When smp_send_safe_nmi_ipi() indicates that the target CPU has
> responded to the IPI, skip the remote paca inspection
> fallback. Otherwise both the sending and target CPUs attempt the
> backtrace, usually creating a misleading ("didn't respond to backtrace
> IPI" is wrong) and interleaved mess:

Thanks for fixing my bugs for me :)

> [ 1658.929157][    C7] rcu: Stack dump where RCU GP kthread last ran:
> [ 1658.929223][    C7] Sending NMI from CPU 7 to CPUs 1:
> [ 1658.929303][    C1] NMI backtrace for cpu 1
> [ 1658.929303][    C7] CPU 1 didn't respond to backtrace IPI, inspecting paca.
> [ 1658.929362][    C1] CPU: 1 PID: 325 Comm: kworker/1:1H Tainted: G        W   E     5.13.0-rc2+ #46
> [ 1658.929405][    C7] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 325 (kworker/1:1H)
> [ 1658.929465][    C1] Workqueue: events_highpri test_work_fn [test_lockup]
> [ 1658.929549][    C7] Back trace of paca->saved_r1 (0xc0000000057fb400) (possibly stale):
> [ 1658.929592][    C1] NIP:  c00000000002cf50 LR: c008000000820178 CTR: c00000000002cfa0
>
> Verified using the test_lockup module, e.g.
>
> $ echo 5 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> $ insmod test_lockup.ko time_secs=1 iterations=10 state=R lock_rcu \
>   touch_softlockup all_cpus
>
> Fixes: 5cc05910f26e ("powerpc/64s: Wire up arch_trigger_cpumask_backtrace()")
> Signed-off-by: Nathan Lynch <nathanl at linux.ibm.com>
> ---
>  arch/powerpc/kernel/stacktrace.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
> index 1deb1bf331dd..e0ccc5a46d7e 100644
> --- a/arch/powerpc/kernel/stacktrace.c
> +++ b/arch/powerpc/kernel/stacktrace.c
> @@ -174,11 +174,14 @@ static void raise_backtrace_ipi(cpumask_t *mask)
>  {
>  	unsigned int cpu;
>  
> +	if (cpumask_test_cpu(smp_processor_id(), mask)) {
> +		handle_backtrace_ipi(NULL);
> +		cpumask_clear_cpu(smp_processor_id(), mask);
> +	}
> +
>  	for_each_cpu(cpu, mask) {
> -		if (cpu == smp_processor_id())
> -			handle_backtrace_ipi(NULL);
> -		else
> -			smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC);
> +		if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC))
> +			cpumask_clear_cpu(cpu, mask);

I think there's still a race here, but instead of causing us to emit a
spurious "didn't respond" trace, it could lead to us failing to emit a
proper trace when we should.

It's hard to follow this code, but mask above is backtrace_mask from
lib/nmi_backtrace.c, because of:

void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
				   bool exclude_self,
				   void (*raise)(cpumask_t *mask))
{
	int i, this_cpu = get_cpu();

	cpumask_copy(to_cpumask(backtrace_mask), mask);
                                ^^^^^^^^^^^^^^
	...

	if (!cpumask_empty(to_cpumask(backtrace_mask))) {
		pr_info("Sending NMI from CPU %d to CPUs %*pbl:\n",
			this_cpu, nr_cpumask_bits, to_cpumask(backtrace_mask));
		raise(to_cpumask(backtrace_mask));
                                 ^^^^^^^^^^^^^^

And raise there is raise_backtrace_ipi() (the function we're patching).

On the receiving CPU we end up executing:

int smp_handle_nmi_ipi(struct pt_regs *regs)
{
	...
	nmi_ipi_lock_start(&flags);
	if (cpumask_test_cpu(me, &nmi_ipi_pending_mask)) {
		cpumask_clear_cpu(me, &nmi_ipi_pending_mask);
		fn = READ_ONCE(nmi_ipi_function);
                ...
	}
	nmi_ipi_unlock_end(&flags);

	if (fn)
		fn(regs);


The key detail being that we drop the nmi lock before calling fn, which
means the calling CPU can return back to raise_backtrace_ipi() before fn
is called.

In our case fn is handle_backtrace_ipi() which just calls nmi_cpu_backtrace().

Which does:

bool nmi_cpu_backtrace(struct pt_regs *regs)
{
	int cpu = smp_processor_id();

	if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
        	...
		pr_warn("NMI backtrace for cpu %d\n", cpu);
		if (regs)
			show_regs(regs);
		else
			dump_stack();


ie. if the CPU has been cleared from backtrace_mask it doesn't emit a
stack trace.

So we could end up with the following interleaving:


CPU0								CPU1
====								====
if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, ...
   								// smp_handle_nmi_ipi()
								fn = READ_ONCE(nmi_ipi_function);
								...
								nmi_ipi_unlock_end(&flags);
// in smp_send_safe_nmi_ipi()
nmi_ipi_lock();
while (!cpumask_empty(&nmi_ipi_pending_mask)) {
...
nmi_ipi_unlock_end(&flags);
return ret;

	cpumask_clear_cpu(cpu, mask);
			       					fn(regs)
								// -> nmi_cpu_backtrace()
								if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {


But like I said, it's not easy to follow, so maybe I missed something
along the way.


To solve it I think we want to avoid clearing a CPU from the mask unless
we know that the IPI failed for that CPU. That way there's no risk of
suppressing a trace from a CPU that successfully handles the IPI, and we
know we've waited 5 seconds for CPUs that fail to handle the IPI.

I don't think we want to allocate a whole new cpumask to track which
CPUs have failed to respond, but I don't think we need to. We can just
synchronously handle them.

Something like below.

cheers

diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index 1deb1bf331dd..980e87f7ae7a 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -172,17 +172,19 @@ static void handle_backtrace_ipi(struct pt_regs *regs)
 
 static void raise_backtrace_ipi(cpumask_t *mask)
 {
+	struct paca_struct *p;
 	unsigned int cpu;
 
 	for_each_cpu(cpu, mask) {
-		if (cpu == smp_processor_id())
+		if (cpu == smp_processor_id()) {
 			handle_backtrace_ipi(NULL);
-		else
-			smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC);
-	}
+			continue;
+		}
 
-	for_each_cpu(cpu, mask) {
-		struct paca_struct *p = paca_ptrs[cpu];
+		if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC))
+			continue;
+
+		p = paca_ptrs[cpu];
 
 		cpumask_clear_cpu(cpu, mask);