[Skiboot] [RFC][PATCH] hmi: clear xscom and unknown bits from HMER
Mahesh Jagannath Salgaonkar
mahesh at linux.vnet.ibm.com
Wed Jun 28 13:30:05 AEST 2017
On 06/27/2017 06:02 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote:
>> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:
>>> It has been observed the xscom bit in HMER gets stuck (as-yet
>>
>> We see that stuck because opal never clears it after scom read/write.
>> The bit is cleared just before the next scom read/write. I am not sure
>> what it was left uncleared until next scom read/write kicks in.
>
> Because we don't care ?
looking at the code it looks like we didn't care. I sent out a patch
that clears them once scom operation is complete.
> It should not be enabled in HMEER...
Yes, we don't enable them in HMEER.
>>
>>> unkonwn root cause -- HMEER should disable those exceptions).
>>> This causes HMIs to be continually taken.
>>>
>>> HMI: Received HMI interrupt: HMER = 0x0040000000000000
>>>
>>> Add some attempt to handle this by clearing the HMER and HMEER.
>>>
>>> Try to clear HMER for other unknown HMIs (alternative is to not
>>> recover).
>>
>> I think we should be just ok with clearing out and masking them again.
>
> Right but we need to understand why we are taking the HMI in the first
> place since it's not enabled in HMEER unless something's wrong there.
> Is that reproduceable ?
We did debug it yesterday and found the reason. Akshay sent out a patch
that fixes the issue. http://patchwork.ozlabs.org/patch/781434/
Thanks,
-Mahesh.
>
>>>
>>> There seems to be no point in continually taking an HMI that will
>>> never be handled. By not handling it we already implicitly are
>>> trying to "continue" without solving anything aren't we?
>>
>> We do handle the ones that could cause harm to system functioning. Rest
>> we mask it. Other than xscom related bits we also mask bit 6, 16 and 17
>> which does not look harmful. I think we should just mask them again in
>> HMEER if we get HMIs for the bits that we already masked.
>
> Ben.
>
>>>
>>> ---
>>> core/hmi.c | 26 ++++++++++++++++++++++++++
>>> hw/xscom.c | 5 +----
>>> include/processor.h | 7 +++++++
>>> 3 files changed, 34 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/core/hmi.c b/core/hmi.c
>>> index 84f2c2d6..7ab5810d 100644
>>> --- a/core/hmi.c
>>> +++ b/core/hmi.c
>>> @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
>>> }
>>> }
>>>
>>> + if (hmer & SPR_HMER_XSCOM_MASK) {
>>> + hmer &= ~SPR_HMER_XSCOM_MASK;
>>> + if (hmi_evt) {
>>> + hmi_evt->severity = OpalHMI_SEV_NO_ERROR;
>>> + hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE;
>>> + queue_hmi_event(hmi_evt, recover);
>>> + }
>>> + sync();
>>> + mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL |
>>> + SPR_HMER_XSCOM_DONE))
>>> + isync();
>>> +
>>> + prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n");
>>> + }
>>> +
>>> + if (hmer) {
>>> + hmer = 0;
>>> + if (hmi_evt) {
>>> + hmi_evt->severity = OpalHMI_SEV_WARNING;
>>> + hmi_evt->type = 0; /* Anything sane we can put here? */
>>> + queue_hmi_event(hmi_evt, recover);
>>> + }
>>
>> This one is also unexpected, should we clear and mask this as well ?
>> Otherwise we would keep getting this HMI and warnings would flood host
>> kernel.
>>
>> Thanks,
>> -Mahesh.
>
More information about the Skiboot
mailing list