[Skiboot] [RFC][PATCH] hmi: clear xscom and unknown bits from HMER

Tue Jun 27 22:32:45 AEST 2017

On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote:
> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:
> > It has been observed the xscom bit in HMER gets stuck (as-yet
> 
> We see that stuck because opal never clears it after scom read/write.
> The bit is cleared just before the next scom read/write. I am not sure
> what it was left uncleared until next scom read/write kicks in.

Because we don't care ? It should not be enabled in HMEER...
> 
> > unkonwn root cause -- HMEER should disable those exceptions).
> > This causes HMIs to be continually taken.
> > 
> > HMI: Received HMI interrupt: HMER = 0x0040000000000000
> > 
> > Add some attempt to handle this by clearing the HMER and HMEER.
> > 
> > Try to clear HMER for other unknown HMIs (alternative is to not
> > recover).
> 
> I think we should be just ok with clearing out and masking them again.

Right but we need to understand why we are taking the HMI in the first
place since it's not enabled in HMEER unless something's wrong there.
Is that reproduceable ?

> > 
> > There seems to be no point in continually taking an HMI that will
> > never be handled. By not handling it we already implicitly are
> > trying to "continue" without solving anything aren't we?
> 
> We do handle the ones that could cause harm to system functioning. Rest
> we mask it. Other than xscom related bits we also mask bit 6, 16 and 17
> which does not look harmful. I think we should just mask them again in
> HMEER if we get HMIs for the bits that we already masked.

Ben.

> > 
> > ---
> >  core/hmi.c          | 26 ++++++++++++++++++++++++++
> >  hw/xscom.c          |  5 +----
> >  include/processor.h |  7 +++++++
> >  3 files changed, 34 insertions(+), 4 deletions(-)
> > 
> > diff --git a/core/hmi.c b/core/hmi.c
> > index 84f2c2d6..7ab5810d 100644
> > --- a/core/hmi.c
> > +++ b/core/hmi.c
> > @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
> >  		}
> >  	}
> > 
> > +	if (hmer & SPR_HMER_XSCOM_MASK) {
> > +		hmer &= ~SPR_HMER_XSCOM_MASK;
> > +		if (hmi_evt) {
> > +			hmi_evt->severity = OpalHMI_SEV_NO_ERROR;
> > +			hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE;
> > +			queue_hmi_event(hmi_evt, recover);
> > +		}
> > +		sync();
> > +		mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL |
> > +							SPR_HMER_XSCOM_DONE))
> > +		isync();
> > +
> > +		prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n");
> > +	}
> > +
> > +	if (hmer) {
> > +		hmer = 0;
> > +		if (hmi_evt) {
> > +			hmi_evt->severity = OpalHMI_SEV_WARNING;
> > +			hmi_evt->type = 0; /* Anything sane we can put here? */
> > +			queue_hmi_event(hmi_evt, recover);
> > +		}
> 
> This one is also unexpected, should we clear and mask this as well ?
> Otherwise we would keep getting this HMI and warnings would flood host
> kernel.
> 
> Thanks,
> -Mahesh.