[PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

Michael Ellerman mpe at ellerman.id.au
Thu Aug 9 00:56:00 AEST 2018


Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>
> Introduce recovery action for recovered memory errors (MCEs). There are
> soft memory errors like SLB Multihit, which can be a result of a bad
> hardware OR software BUG. Kernel can easily recover from these soft errors
> by flushing SLB contents. After the recovery kernel can still continue to
> function without any issue. But in some scenario's we may keep getting
> these soft errors until the root cause is fixed. To be able to analyze and
> find the root cause, best way is to gather enough data and system state at
> the time of MCE. Hence this patch introduces a sysctl knob where user can
> decide either to continue after recovery or panic the kernel to capture the
> dump.

I'm not convinced we want this.

As we've discovered it's often not possible to reconstruct what happened
based on a dump anyway.

The key thing you need is the content of the SLB and that's not included
in a dump.

So I think we should dump the SLB content when we get the MCE (which
this series does) and any other useful info, and then if we can recover
we should.

cheers


More information about the Linuxppc-dev mailing list