[PATCH] powernv: Avoid checkstop on HMI and MCE

Michael Neuling mikey at neuling.org
Wed Oct 25 21:59:42 AEDT 2017


On Wed, 2017-10-25 at 12:16 +0200, Michael Ellerman wrote:
> Michael Neuling <mikey at neuling.org> writes:
> 
> > On an unrecoverable HMI or MCE only generate an checkstop (via
> > PLATFORM ERROR opal reboot call) when panic_on_oops is set.
> > 
> > We currently generate an checkstop as an attempt for the FSP to grab a
> > dump and then reboot us. Unfortunately this never works and no one
> 
> Never? WT#.

Well no one I've talked but I'm posting this so someone will stand up and say
they want it.

> > I've talked to has ever seen a resulting dump, let alone got useful
> > information from it.
> > 
> > Even worse, the checkstop gets in the way of debugging real
> > problems. If we hit a software bug that results in this, we get no
> > opportunity to debug it live. Similarly if the bug is due to hardware
> > that is not in the dump (say PCI or NVLINK GPU), we get no information
> > in the dump about that hardware.
> > 
> > So let's remove it unless someone sets panic_on_oops.
> 
> Nick just rewrote pnv_platform_error_reboot(), so please talk to him to
> make sure you're not stepping on each other.

OK, will do.

> > diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c
> > b/arch/powerpc/platforms/powernv/opal-hmi.c
> > index c9e1a4ff29..23780970d0 100644
> > --- a/arch/powerpc/platforms/powernv/opal-hmi.c
> > +++ b/arch/powerpc/platforms/powernv/opal-hmi.c
> > @@ -284,6 +285,11 @@ static void hmi_event_handler(struct work_struct *work)
> >  			print_hmi_event_info(hmi_evt);
> >  		}
> >  
> > +		if (!panic_on_oops) {
> > +			die("Unrecoverable HMI exception", NULL, SIGBUS);
> > +			return;
> 
> I don't think we should return.
> 
> Otherwise we risk persisting corrupt data to disk and so on.

ok

> If we're getting unrecoverable HMI/MCEs that are not actually indicative
> of something bad happening then we need to filter those out somewhere.

We hit this with some new HMIs for NVLINK and the Vector Load one, so we need to
handle them, and we have code that does (or is coming).

In the mean while, it's very hard to debug them once we xstop.

Mikey


More information about the Linuxppc-dev mailing list