[Skiboot] [RFC][PATCH] hmi: clear xscom and unknown bits from HMER

Tue Jun 27 15:39:12 AEST 2017

On Tue, 27 Jun 2017 10:46:34 +0530
Mahesh Jagannath Salgaonkar <mahesh at linux.vnet.ibm.com> wrote:

> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:
> > It has been observed the xscom bit in HMER gets stuck (as-yet  
> 
> We see that stuck because opal never clears it after scom read/write.
> The bit is cleared just before the next scom read/write. I am not sure
> what it was left uncleared until next scom read/write kicks in.

Right, but did we work out why it's taking a HMI or getting enabled
in the HMEER?

> > unkonwn root cause -- HMEER should disable those exceptions).
> > This causes HMIs to be continually taken.
> > 
> > HMI: Received HMI interrupt: HMER = 0x0040000000000000
> > 
> > Add some attempt to handle this by clearing the HMER and HMEER.
> > 
> > Try to clear HMER for other unknown HMIs (alternative is to not
> > recover).  
> 
> I think we should be just ok with clearing out and masking them again.
> 
> > 
> > There seems to be no point in continually taking an HMI that will
> > never be handled. By not handling it we already implicitly are
> > trying to "continue" without solving anything aren't we?  
> 
> We do handle the ones that could cause harm to system functioning. Rest
> we mask it. Other than xscom related bits we also mask bit 6, 16 and 17
> which does not look harmful. I think we should just mask them again in
> HMEER if we get HMIs for the bits that we already masked.

Okay. My thinking is that if there was a hw or sw error that causes
some unknown HME, would it be better to stop and crash ASAP?

Either way seems better than our current approach of trying to
continue without masking. So it's your call.

> 
> > 
> > ---
> >  core/hmi.c          | 26 ++++++++++++++++++++++++++
> >  hw/xscom.c          |  5 +----
> >  include/processor.h |  7 +++++++
> >  3 files changed, 34 insertions(+), 4 deletions(-)
> > 
> > diff --git a/core/hmi.c b/core/hmi.c
> > index 84f2c2d6..7ab5810d 100644
> > --- a/core/hmi.c
> > +++ b/core/hmi.c
> > @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
> >  		}
> >  	}
> > 
> > +	if (hmer & SPR_HMER_XSCOM_MASK) {
> > +		hmer &= ~SPR_HMER_XSCOM_MASK;
> > +		if (hmi_evt) {
> > +			hmi_evt->severity = OpalHMI_SEV_NO_ERROR;
> > +			hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE;
> > +			queue_hmi_event(hmi_evt, recover);
> > +		}
> > +		sync();
> > +		mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL |
> > +							SPR_HMER_XSCOM_DONE))
> > +		isync();
> > +
> > +		prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n");
> > +	}
> > +
> > +	if (hmer) {
> > +		hmer = 0;
> > +		if (hmi_evt) {
> > +			hmi_evt->severity = OpalHMI_SEV_WARNING;
> > +			hmi_evt->type = 0; /* Anything sane we can put here? */
> > +			queue_hmi_event(hmi_evt, recover);
> > +		}  
> 
> This one is also unexpected, should we clear and mask this as well ?

Probably yes. Either that or set recover = 0. What do you think?

Thanks,
Nick