[resend-without-rfc] powernv/kdump: Fix cases where the kdump kernel can get HMI's
Nicholas Piggin
npiggin at gmail.com
Fri Dec 8 15:53:00 AEDT 2017
On Fri, 8 Dec 2017 14:35:33 +1100
Balbir Singh <bsingharora at gmail.com> wrote:
> Certain HMI's such as malfunction error propagate through
> all threads/core on the system. If a thread was offline
> prior to us crashing the system and jumping to the kdump
> kernel, bad things happen when it wakes up due to an HMI
> in the kdump kernel.
>
> There are several possible ways to solve this problem
>
> 1. Put the offline cores in a state such that they are
> not woken up for machine check and HMI errors. This
> does not work, since we might need to wake up offline
> threads occasionally to handle TB errors
> 2. Ignore HMI errors, setup HMEER to mask HMI errors,
> but this still leads the window open for any MCEs
> and masking them for the duration of the dump might
> be a concern
> 3. Wake up offline CPUs, as in send them to crash_ipi_callback
> (not wake them up as in mark them online as seen by
> the scheduler). kexec does a wake_online_cpus() call,
> this patch does something similar, but instead sends
> an IPI and forces them to crash_ipi_callback
>
> Care is taken to enable this only for powenv platforms
> via crash_wake_offline (a global value set at setup
> time). The crash code sends out IPI's to all CPU's
> which then move to crash_ipi_callback and kexec_smp_wait().
> We don't grab the pt_regs for offline CPU's.
>
> Signed-off-by: Balbir Singh <bsingharora at gmail.com>
> ---
>
> Nick reviewed the patches and asked if
>
> 1. We need to do anything on the otherside of the kernel?
> The answer is not clear at this point, but I don't want
> to block this patch as it fixes a critical problem with
> kdump in SMT=2/1 mode
> 2. We should do this for other platforms
> The answer is same as above, other platforms require testing
> and I can selectively enable them as needed as I test them
Yeah I didn't intend those as a nack for the patch... It's
a bit annoying to have these selections between online cpus
and present cpus depending on kdump.
We don't want to do a full CPU online in the kdump path of
course, but what if the crash code has a call that can IPI
offline CPUs to get them into the crash callback, rather than
put it in the general NMI IPI code?
> @@ -187,6 +188,14 @@ static void pnv_smp_cpu_kill_self(void)
> WARN_ON(lazy_irq_pending());
>
> /*
> + * For kdump kernels, we process the ipi and jump to
> + * crash_ipi_callback. For more details see the description
> + * at crash_wake_offline
> + */
> + if (kdump_in_progress())
> + crash_ipi_callback(NULL);
> +
> + /*
> * If the SRR1 value indicates that we woke up due to
> * an external interrupt, then clear the interrupt.
> * We clear the interrupt before checking for the
I think you need to do this _after_ clearing the interrupt,
otherwise you get a lost wakeup window, don't you?
Thanks,
Nick
More information about the Linuxppc-dev
mailing list