[PATCH] powerpc/stacktrace: fix raise_backtrace_ipi() logic
Michael Ellerman
mpe at ellerman.id.au
Wed Jun 2 21:08:20 AEST 2021
Nathan Lynch <nathanl at linux.ibm.com> writes:
> When smp_send_safe_nmi_ipi() indicates that the target CPU has
> responded to the IPI, skip the remote paca inspection
> fallback. Otherwise both the sending and target CPUs attempt the
> backtrace, usually creating a misleading ("didn't respond to backtrace
> IPI" is wrong) and interleaved mess:
Thanks for fixing my bugs for me :)
> [ 1658.929157][ C7] rcu: Stack dump where RCU GP kthread last ran:
> [ 1658.929223][ C7] Sending NMI from CPU 7 to CPUs 1:
> [ 1658.929303][ C1] NMI backtrace for cpu 1
> [ 1658.929303][ C7] CPU 1 didn't respond to backtrace IPI, inspecting paca.
> [ 1658.929362][ C1] CPU: 1 PID: 325 Comm: kworker/1:1H Tainted: G W E 5.13.0-rc2+ #46
> [ 1658.929405][ C7] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 325 (kworker/1:1H)
> [ 1658.929465][ C1] Workqueue: events_highpri test_work_fn [test_lockup]
> [ 1658.929549][ C7] Back trace of paca->saved_r1 (0xc0000000057fb400) (possibly stale):
> [ 1658.929592][ C1] NIP: c00000000002cf50 LR: c008000000820178 CTR: c00000000002cfa0
>
> Verified using the test_lockup module, e.g.
>
> $ echo 5 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> $ insmod test_lockup.ko time_secs=1 iterations=10 state=R lock_rcu \
> touch_softlockup all_cpus
>
> Fixes: 5cc05910f26e ("powerpc/64s: Wire up arch_trigger_cpumask_backtrace()")
> Signed-off-by: Nathan Lynch <nathanl at linux.ibm.com>
> ---
> arch/powerpc/kernel/stacktrace.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
> index 1deb1bf331dd..e0ccc5a46d7e 100644
> --- a/arch/powerpc/kernel/stacktrace.c
> +++ b/arch/powerpc/kernel/stacktrace.c
> @@ -174,11 +174,14 @@ static void raise_backtrace_ipi(cpumask_t *mask)
> {
> unsigned int cpu;
>
> + if (cpumask_test_cpu(smp_processor_id(), mask)) {
> + handle_backtrace_ipi(NULL);
> + cpumask_clear_cpu(smp_processor_id(), mask);
> + }
> +
> for_each_cpu(cpu, mask) {
> - if (cpu == smp_processor_id())
> - handle_backtrace_ipi(NULL);
> - else
> - smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC);
> + if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC))
> + cpumask_clear_cpu(cpu, mask);
I think there's still a race here, but instead of causing us to emit a
spurious "didn't respond" trace, it could lead to us failing to emit a
proper trace when we should.
It's hard to follow this code, but mask above is backtrace_mask from
lib/nmi_backtrace.c, because of:
void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
bool exclude_self,
void (*raise)(cpumask_t *mask))
{
int i, this_cpu = get_cpu();
cpumask_copy(to_cpumask(backtrace_mask), mask);
^^^^^^^^^^^^^^
...
if (!cpumask_empty(to_cpumask(backtrace_mask))) {
pr_info("Sending NMI from CPU %d to CPUs %*pbl:\n",
this_cpu, nr_cpumask_bits, to_cpumask(backtrace_mask));
raise(to_cpumask(backtrace_mask));
^^^^^^^^^^^^^^
And raise there is raise_backtrace_ipi() (the function we're patching).
On the receiving CPU we end up executing:
int smp_handle_nmi_ipi(struct pt_regs *regs)
{
...
nmi_ipi_lock_start(&flags);
if (cpumask_test_cpu(me, &nmi_ipi_pending_mask)) {
cpumask_clear_cpu(me, &nmi_ipi_pending_mask);
fn = READ_ONCE(nmi_ipi_function);
...
}
nmi_ipi_unlock_end(&flags);
if (fn)
fn(regs);
The key detail being that we drop the nmi lock before calling fn, which
means the calling CPU can return back to raise_backtrace_ipi() before fn
is called.
In our case fn is handle_backtrace_ipi() which just calls nmi_cpu_backtrace().
Which does:
bool nmi_cpu_backtrace(struct pt_regs *regs)
{
int cpu = smp_processor_id();
if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
...
pr_warn("NMI backtrace for cpu %d\n", cpu);
if (regs)
show_regs(regs);
else
dump_stack();
ie. if the CPU has been cleared from backtrace_mask it doesn't emit a
stack trace.
So we could end up with the following interleaving:
CPU0 CPU1
==== ====
if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, ...
// smp_handle_nmi_ipi()
fn = READ_ONCE(nmi_ipi_function);
...
nmi_ipi_unlock_end(&flags);
// in smp_send_safe_nmi_ipi()
nmi_ipi_lock();
while (!cpumask_empty(&nmi_ipi_pending_mask)) {
...
nmi_ipi_unlock_end(&flags);
return ret;
cpumask_clear_cpu(cpu, mask);
fn(regs)
// -> nmi_cpu_backtrace()
if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
But like I said, it's not easy to follow, so maybe I missed something
along the way.
To solve it I think we want to avoid clearing a CPU from the mask unless
we know that the IPI failed for that CPU. That way there's no risk of
suppressing a trace from a CPU that successfully handles the IPI, and we
know we've waited 5 seconds for CPUs that fail to handle the IPI.
I don't think we want to allocate a whole new cpumask to track which
CPUs have failed to respond, but I don't think we need to. We can just
synchronously handle them.
Something like below.
cheers
diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index 1deb1bf331dd..980e87f7ae7a 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -172,17 +172,19 @@ static void handle_backtrace_ipi(struct pt_regs *regs)
static void raise_backtrace_ipi(cpumask_t *mask)
{
+ struct paca_struct *p;
unsigned int cpu;
for_each_cpu(cpu, mask) {
- if (cpu == smp_processor_id())
+ if (cpu == smp_processor_id()) {
handle_backtrace_ipi(NULL);
- else
- smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC);
- }
+ continue;
+ }
- for_each_cpu(cpu, mask) {
- struct paca_struct *p = paca_ptrs[cpu];
+ if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC))
+ continue;
+
+ p = paca_ptrs[cpu];
cpumask_clear_cpu(cpu, mask);
More information about the Linuxppc-dev
mailing list