[PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

Fri Aug 10 21:04:26 AEST 2018

Michal Suchánek <msuchanek at suse.de> writes:
> On Wed, 8 Aug 2018 21:07:11 +0530
> "Aneesh Kumar K.V" <aneesh.kumar at linux.ibm.com> wrote:
>> On 08/08/2018 08:26 PM, Michael Ellerman wrote:
>> > Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:  
>> >> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>> >>
>> >> Introduce recovery action for recovered memory errors (MCEs).
>> >> There are soft memory errors like SLB Multihit, which can be a
>> >> result of a bad hardware OR software BUG. Kernel can easily
>> >> recover from these soft errors by flushing SLB contents. After the
>> >> recovery kernel can still continue to function without any issue.
>> >> But in some scenario's we may keep getting these soft errors until
>> >> the root cause is fixed. To be able to analyze and find the root
>> >> cause, best way is to gather enough data and system state at the
>> >> time of MCE. Hence this patch introduces a sysctl knob where user
>> >> can decide either to continue after recovery or panic the kernel
>> >> to capture the dump.  
>> > 
>> > I'm not convinced we want this.
>> > 
>> > As we've discovered it's often not possible to reconstruct what
>> > happened based on a dump anyway.
>> > 
>> > The key thing you need is the content of the SLB and that's not
>> > included in a dump.
>> > 
>> > So I think we should dump the SLB content when we get the MCE (which
>> > this series does) and any other useful info, and then if we can
>> > recover we should.
>> 
>> The reasoning there is what if we got multi-hit due to some
>> corruption in slb_cache_ptr. ie. some part of kernel is wrongly
>> updating the paca data structure due to wrong pointer. Now that is
>> far fetched, but then possible right?. Hence the idea that, if we
>> don't have much insight into why a slb multi-hit occur from the dmesg
>> which include slb content, slb_cache contents etc, there should be an
>> easy way to force a dump that might assist in further debug.
>
> Nonetheless this turns all MCEs into crashes. Are there any MCEs that
> could happen during normal operation and should be handled by default?

An MCE should always be an indication of an abnormal condition, but
the exact set of things that are reported as MCEs is CPU specific, and
potentially even configurable at the hardware level.

However we only "handle" certain types of MCEs, so if we get an MCE for
something we don't understand then we'll panic already.

SLB multi-hit / parity error is one that we do handle (on bare metal),
because there is a well defined recovery action.

cheers