[Skiboot] [RFC][PATCH] hmi: clear xscom and unknown bits from HMER

Nicholas Piggin npiggin at gmail.com
Wed Jun 28 17:04:15 AEST 2017


On Wed, 28 Jun 2017 12:02:55 +0530
Mahesh Jagannath Salgaonkar <mahesh at linux.vnet.ibm.com> wrote:

> On 06/28/2017 10:11 AM, Nicholas Piggin wrote:
> > On Wed, 28 Jun 2017 09:00:05 +0530
> > Mahesh Jagannath Salgaonkar <mahesh at linux.vnet.ibm.com> wrote:
> >   
> >> On 06/27/2017 06:02 PM, Benjamin Herrenschmidt wrote:  
> >>> On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote:    
> >>>> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:    
> >>>>> It has been observed the xscom bit in HMER gets stuck (as-yet    
> >>>>
> >>>> We see that stuck because opal never clears it after scom read/write.
> >>>> The bit is cleared just before the next scom read/write. I am not sure
> >>>> what it was left uncleared until next scom read/write kicks in.    
> >>>
> >>> Because we don't care ?     
> >>
> >> looking at the code it looks like we didn't care. I sent out a patch
> >> that clears them once scom operation is complete.
> >>  
> >>> It should not be enabled in HMEER...    
> >>
> >> Yes, we don't enable them in HMEER.
> >>  
> >>>>    
> >>>>> unkonwn root cause -- HMEER should disable those exceptions).
> >>>>> This causes HMIs to be continually taken.
> >>>>>
> >>>>> HMI: Received HMI interrupt: HMER = 0x0040000000000000
> >>>>>
> >>>>> Add some attempt to handle this by clearing the HMER and HMEER.
> >>>>>
> >>>>> Try to clear HMER for other unknown HMIs (alternative is to not
> >>>>> recover).    
> >>>>
> >>>> I think we should be just ok with clearing out and masking them again.    
> >>>
> >>> Right but we need to understand why we are taking the HMI in the first
> >>> place since it's not enabled in HMEER unless something's wrong there.
> >>> Is that reproduceable ?    
> >>
> >> We did debug it yesterday and found the reason. Akshay sent out a patch
> >> that fixes the issue. http://patchwork.ozlabs.org/patch/781434/  
> > 
> > Given that this bug was caused by Linux, and not due to an actual
> > HMI (and therefore would not be fixed by clearing the HMER/HMEER
> > bits), I wonder if this patch is still warranted. HMEER could be
> > messed up somehow, so maybe a simplified version that just notes
> > the unexpected HMI and masks out HMEER.
> > 
> > Any opinions?  
> 
> Yeah I agree with having simplified version so that it will help us to
> detect if we at all mess up with HMEER in future.

How about something very simple?

[PATCH] hmi: report HMEER and fail when encountering unknown HMI

In the interest of minimising corruption and improving debugging, when
encountering an unknown HMI, report HMEER and HMER with PR_ERR, and do
not recover.

This is an improvement on existing behaviour, which in the best case
will just continue to take HMIs because the exception remains asserted
and this will cause the system to become unstable.

Signed-off-by: Nicholas Piggin <npiggin at gmail.com>
---
 core/hmi.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/core/hmi.c b/core/hmi.c
index 84f2c2d6..3329d055 100644
--- a/core/hmi.c
+++ b/core/hmi.c
@@ -823,6 +823,15 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
 		}
 	}
 
+	if (hmer) {
+		uint64_t hmeer = mfspr(SPR_HMEER);
+
+		prlog(PR_ERR, "HMI: Unexpected exception HMEER = 0x%016llx "
+			      "HMER = 0x%016llx (not recovering)\n",
+			      hmeer, hmer);
+		recover = 0;
+	}
+
 	if (recover == 0)
 		disable_fast_reboot("Unrecoverable HMI");
 	/*
-- 
2.11.0



More information about the Skiboot mailing list