[PATCH] powernv: Avoid checkstop on HMI and MCE

Michael Ellerman mpe at ellerman.id.au
Wed Oct 25 21:16:30 AEDT 2017


Michael Neuling <mikey at neuling.org> writes:

> On an unrecoverable HMI or MCE only generate an checkstop (via
> PLATFORM ERROR opal reboot call) when panic_on_oops is set.
>
> We currently generate an checkstop as an attempt for the FSP to grab a
> dump and then reboot us. Unfortunately this never works and no one

Never? WT#.

> I've talked to has ever seen a resulting dump, let alone got useful
> information from it.
>
> Even worse, the checkstop gets in the way of debugging real
> problems. If we hit a software bug that results in this, we get no
> opportunity to debug it live. Similarly if the bug is due to hardware
> that is not in the dump (say PCI or NVLINK GPU), we get no information
> in the dump about that hardware.
>
> So let's remove it unless someone sets panic_on_oops.

Nick just rewrote pnv_platform_error_reboot(), so please talk to him to
make sure you're not stepping on each other.

> diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c b/arch/powerpc/platforms/powernv/opal-hmi.c
> index c9e1a4ff29..23780970d0 100644
> --- a/arch/powerpc/platforms/powernv/opal-hmi.c
> +++ b/arch/powerpc/platforms/powernv/opal-hmi.c
> @@ -284,6 +285,11 @@ static void hmi_event_handler(struct work_struct *work)
>  			print_hmi_event_info(hmi_evt);
>  		}
>  
> +		if (!panic_on_oops) {
> +			die("Unrecoverable HMI exception", NULL, SIGBUS);
> +			return;

I don't think we should return.

Otherwise we risk persisting corrupt data to disk and so on.

If we're getting unrecoverable HMI/MCEs that are not actually indicative
of something bad happening then we need to filter those out somewhere.

cheers


More information about the Linuxppc-dev mailing list