[PATCH] ppc64/fadump: fix inaccurate CPU state info in vmcore generated with panic
Michael Ellerman
mpe at ellerman.id.au
Thu Nov 11 17:14:03 AEDT 2021
Hari Bathini <hbathini at linux.ibm.com> writes:
> In panic path, fadump is triggered via a panic notifier function.
> Before calling panic notifier functions, smp_send_stop() gets called,
> which stops all CPUs except the panic'ing CPU. Commit 8389b37dffdc
> ("powerpc: stop_this_cpu: remove the cpu from the online map.") and
> again commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
> started marking CPUs as offline while stopping them. So, if a kernel
> has either of the above commits, vmcore captured with fadump via panic
> path would show only the panic'ing CPU as online. Sample output of
> crash-utility with such vmcore:
>
> # crash vmlinux vmcore
> ...
> KERNEL: vmlinux
> DUMPFILE: vmcore [PARTIAL DUMP]
> CPUS: 1
> DATE: Wed Nov 10 09:56:34 EST 2021
> UPTIME: 00:00:42
> LOAD AVERAGE: 2.27, 0.69, 0.24
> TASKS: 183
> NODENAME: XXXXXXXXX
> RELEASE: 5.15.0+
> VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021
> MACHINE: ppc64le (2500 Mhz)
> MEMORY: 8 GB
> PANIC: "Kernel panic - not syncing: sysrq triggered crash"
> PID: 3394
> COMMAND: "bash"
> TASK: c0000000150a5f80 [THREAD_INFO: c0000000150a5f80]
> CPU: 1
> STATE: TASK_RUNNING (PANIC)
>
> crash> p -x __cpu_online_mask
> __cpu_online_mask = $1 = {
> bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
> }
> crash>
> crash>
> crash> p -x __cpu_active_mask
> __cpu_active_mask = $2 = {
> bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
> }
> crash>
>
> While this has been the case since fadump was introduced, the issue
> was not identified for two probable reasons:
>
> - In general, the bulk of the vmcores analyzed were from crash
> due to exception.
>
> - The above did change since commit 8341f2f222d7 ("sysrq: Use
> panic() to force a crash") started using panic() instead of
> deferencing NULL pointer to force a kernel crash. But then
> commit de6e5d38417e ("powerpc: smp_send_stop do not offline
> stopped CPUs") stopped marking CPUs as offline till kernel
> commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
> reverted that change.
>
> To avoid vmcore from showing only one CPU as online in panic path,
> skip marking non panic'ing CPUs as offline while stopping them
> during fadump crash.
Is this really worth the added complexity/bug surface?
Why does it matter if the vmcore shows only one CPU online?
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index c23ee842c4c3..20555d5d5966 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -61,6 +61,7 @@
> #include <asm/cpu_has_feature.h>
> #include <asm/ftrace.h>
> #include <asm/kup.h>
> +#include <asm/fadump.h>
>
> #ifdef DEBUG
> #include <asm/udbg.h>
> @@ -626,7 +627,8 @@ static void nmi_stop_this_cpu(struct pt_regs *regs)
> /*
> * IRQs are already hard disabled by the smp_handle_nmi_ipi.
> */
> - set_cpu_online(smp_processor_id(), false);
> + if (!(oops_in_progress && should_fadump_crash()))
> + set_cpu_online(smp_processor_id(), false);
>
> spin_begin();
> while (1)
> @@ -650,7 +652,8 @@ static void stop_this_cpu(void *dummy)
> * to know other CPUs are offline before it breaks locks to flush
> * printk buffers, in case we panic()ed while holding the lock.
> */
> - set_cpu_online(smp_processor_id(), false);
> + if (!(oops_in_progress && should_fadump_crash()))
> + set_cpu_online(smp_processor_id(), false);
The comment talks about printk_safe_flush_on_panic(), and this change
would presumably break that. Except that printk_safe_flush_on_panic() no
longer exists.
So do we need to set the CPU online here at all?
ie. could we revert bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
now that printk_safe_flush_on_panic() no longer exists?
cheers
More information about the Linuxppc-dev
mailing list