[PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

Thu Aug 9 18:09:45 AEST 2018

On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote:
> On Thu, 09 Aug 2018 16:34:07 +1000
> Michael Ellerman <mpe at ellerman.id.au> wrote:
> 
> > "Aneesh Kumar K.V" <aneesh.kumar at linux.ibm.com> writes:
> > > On 08/08/2018 08:26 PM, Michael Ellerman wrote:  
> > >> Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:  
> > >>> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
> > >>>
> > >>> Introduce recovery action for recovered memory errors (MCEs). There are
> > >>> soft memory errors like SLB Multihit, which can be a result of a bad
> > >>> hardware OR software BUG. Kernel can easily recover from these soft errors
> > >>> by flushing SLB contents. After the recovery kernel can still continue to
> > >>> function without any issue. But in some scenario's we may keep getting
> > >>> these soft errors until the root cause is fixed. To be able to analyze and
> > >>> find the root cause, best way is to gather enough data and system state at
> > >>> the time of MCE. Hence this patch introduces a sysctl knob where user can
> > >>> decide either to continue after recovery or panic the kernel to capture the
> > >>> dump.  
> > >> 
> > >> I'm not convinced we want this.
> > >> 
> > >> As we've discovered it's often not possible to reconstruct what happened
> > >> based on a dump anyway.
> > >> 
> > >> The key thing you need is the content of the SLB and that's not included
> > >> in a dump.
> > >> 
> > >> So I think we should dump the SLB content when we get the MCE (which
> > >> this series does) and any other useful info, and then if we can recover
> > >> we should.  
> > >
> > > The reasoning there is what if we got multi-hit due to some corruption 
> > > in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca 
> > > data structure due to wrong pointer. Now that is far fetched, but then 
> > > possible right?. Hence the idea that, if we don't have much insight into 
> > > why a slb multi-hit occur from the dmesg which include slb content, 
> > > slb_cache contents etc, there should be an easy way to force a dump that 
> > > might assist in further debug.  
> > 
> > If you're debugging something complex that you can't determine from the
> > SLB dump then you should be running a debug kernel anyway. And if
> > anything you want to drop into xmon and sit there, preserving the most
> > state, rather than taking a dump.
> 
> I'm not saying for a dump specifically, just some form of crash. And we
> really should have an option to xmon on panic, but that's another story.

That's fine during development or in a lab, not something we could
enforce in a customer environment, could we?

> I think HA/failover kind of environments use options like this too. If
> anything starts going bad they don't want to try limping along but stop
> ASAP.

Right. And in this particular case, can we guarantee no corruption
(leading to or post the multihit recovery) when running a customer workload,
is the question...

Ananth