[Skiboot] [RFC][PATCH] hmi: clear xscom and unknown bits from HMER

Mahesh Jagannath Salgaonkar mahesh at linux.vnet.ibm.com
Wed Jun 28 13:30:05 AEST 2017


On 06/27/2017 06:02 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote:
>> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:
>>> It has been observed the xscom bit in HMER gets stuck (as-yet
>>
>> We see that stuck because opal never clears it after scom read/write.
>> The bit is cleared just before the next scom read/write. I am not sure
>> what it was left uncleared until next scom read/write kicks in.
> 
> Because we don't care ? 

looking at the code it looks like we didn't care. I sent out a patch
that clears them once scom operation is complete.

> It should not be enabled in HMEER...

Yes, we don't enable them in HMEER.

>>
>>> unkonwn root cause -- HMEER should disable those exceptions).
>>> This causes HMIs to be continually taken.
>>>
>>> HMI: Received HMI interrupt: HMER = 0x0040000000000000
>>>
>>> Add some attempt to handle this by clearing the HMER and HMEER.
>>>
>>> Try to clear HMER for other unknown HMIs (alternative is to not
>>> recover).
>>
>> I think we should be just ok with clearing out and masking them again.
> 
> Right but we need to understand why we are taking the HMI in the first
> place since it's not enabled in HMEER unless something's wrong there.
> Is that reproduceable ?

We did debug it yesterday and found the reason. Akshay sent out a patch
that fixes the issue. http://patchwork.ozlabs.org/patch/781434/

Thanks,
-Mahesh.

> 
>>>
>>> There seems to be no point in continually taking an HMI that will
>>> never be handled. By not handling it we already implicitly are
>>> trying to "continue" without solving anything aren't we?
>>
>> We do handle the ones that could cause harm to system functioning. Rest
>> we mask it. Other than xscom related bits we also mask bit 6, 16 and 17
>> which does not look harmful. I think we should just mask them again in
>> HMEER if we get HMIs for the bits that we already masked.
> 
> Ben.
> 
>>>
>>> ---
>>>  core/hmi.c          | 26 ++++++++++++++++++++++++++
>>>  hw/xscom.c          |  5 +----
>>>  include/processor.h |  7 +++++++
>>>  3 files changed, 34 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/core/hmi.c b/core/hmi.c
>>> index 84f2c2d6..7ab5810d 100644
>>> --- a/core/hmi.c
>>> +++ b/core/hmi.c
>>> @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
>>>  		}
>>>  	}
>>>
>>> +	if (hmer & SPR_HMER_XSCOM_MASK) {
>>> +		hmer &= ~SPR_HMER_XSCOM_MASK;
>>> +		if (hmi_evt) {
>>> +			hmi_evt->severity = OpalHMI_SEV_NO_ERROR;
>>> +			hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE;
>>> +			queue_hmi_event(hmi_evt, recover);
>>> +		}
>>> +		sync();
>>> +		mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL |
>>> +							SPR_HMER_XSCOM_DONE))
>>> +		isync();
>>> +
>>> +		prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n");
>>> +	}
>>> +
>>> +	if (hmer) {
>>> +		hmer = 0;
>>> +		if (hmi_evt) {
>>> +			hmi_evt->severity = OpalHMI_SEV_WARNING;
>>> +			hmi_evt->type = 0; /* Anything sane we can put here? */
>>> +			queue_hmi_event(hmi_evt, recover);
>>> +		}
>>
>> This one is also unexpected, should we clear and mask this as well ?
>> Otherwise we would keep getting this HMI and warnings would flood host
>> kernel.
>>
>> Thanks,
>> -Mahesh.
> 



More information about the Skiboot mailing list