[PATCH] powernv: Avoid checkstop on HMI and MCE
Michael Ellerman
mpe at ellerman.id.au
Wed Oct 25 21:16:30 AEDT 2017
Michael Neuling <mikey at neuling.org> writes:
> On an unrecoverable HMI or MCE only generate an checkstop (via
> PLATFORM ERROR opal reboot call) when panic_on_oops is set.
>
> We currently generate an checkstop as an attempt for the FSP to grab a
> dump and then reboot us. Unfortunately this never works and no one
Never? WT#.
> I've talked to has ever seen a resulting dump, let alone got useful
> information from it.
>
> Even worse, the checkstop gets in the way of debugging real
> problems. If we hit a software bug that results in this, we get no
> opportunity to debug it live. Similarly if the bug is due to hardware
> that is not in the dump (say PCI or NVLINK GPU), we get no information
> in the dump about that hardware.
>
> So let's remove it unless someone sets panic_on_oops.
Nick just rewrote pnv_platform_error_reboot(), so please talk to him to
make sure you're not stepping on each other.
> diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c b/arch/powerpc/platforms/powernv/opal-hmi.c
> index c9e1a4ff29..23780970d0 100644
> --- a/arch/powerpc/platforms/powernv/opal-hmi.c
> +++ b/arch/powerpc/platforms/powernv/opal-hmi.c
> @@ -284,6 +285,11 @@ static void hmi_event_handler(struct work_struct *work)
> print_hmi_event_info(hmi_evt);
> }
>
> + if (!panic_on_oops) {
> + die("Unrecoverable HMI exception", NULL, SIGBUS);
> + return;
I don't think we should return.
Otherwise we risk persisting corrupt data to disk and so on.
If we're getting unrecoverable HMI/MCEs that are not actually indicative
of something bad happening then we need to filter those out somewhere.
cheers
More information about the Linuxppc-dev
mailing list