[Skiboot] [RFC][PATCH] hmi: clear xscom and unknown bits from HMER

Wed Jun 28 20:40:45 AEST 2017

On 06/28/2017 12:34 PM, Nicholas Piggin wrote:
> On Wed, 28 Jun 2017 12:02:55 +0530
> Mahesh Jagannath Salgaonkar <mahesh at linux.vnet.ibm.com> wrote:
> 
>> On 06/28/2017 10:11 AM, Nicholas Piggin wrote:
>>> On Wed, 28 Jun 2017 09:00:05 +0530
>>> Mahesh Jagannath Salgaonkar <mahesh at linux.vnet.ibm.com> wrote:
>>>   
>>>> On 06/27/2017 06:02 PM, Benjamin Herrenschmidt wrote:  
>>>>> On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote:    
>>>>>> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:    
>>>>>>> It has been observed the xscom bit in HMER gets stuck (as-yet    
>>>>>>
>>>>>> We see that stuck because opal never clears it after scom read/write.
>>>>>> The bit is cleared just before the next scom read/write. I am not sure
>>>>>> what it was left uncleared until next scom read/write kicks in.    
>>>>>
>>>>> Because we don't care ?     
>>>>
>>>> looking at the code it looks like we didn't care. I sent out a patch
>>>> that clears them once scom operation is complete.
>>>>  
>>>>> It should not be enabled in HMEER...    
>>>>
>>>> Yes, we don't enable them in HMEER.
>>>>  
>>>>>>    
>>>>>>> unkonwn root cause -- HMEER should disable those exceptions).
>>>>>>> This causes HMIs to be continually taken.
>>>>>>>
>>>>>>> HMI: Received HMI interrupt: HMER = 0x0040000000000000
>>>>>>>
>>>>>>> Add some attempt to handle this by clearing the HMER and HMEER.
>>>>>>>
>>>>>>> Try to clear HMER for other unknown HMIs (alternative is to not
>>>>>>> recover).    
>>>>>>
>>>>>> I think we should be just ok with clearing out and masking them again.    
>>>>>
>>>>> Right but we need to understand why we are taking the HMI in the first
>>>>> place since it's not enabled in HMEER unless something's wrong there.
>>>>> Is that reproduceable ?    
>>>>
>>>> We did debug it yesterday and found the reason. Akshay sent out a patch
>>>> that fixes the issue. http://patchwork.ozlabs.org/patch/781434/  
>>>
>>> Given that this bug was caused by Linux, and not due to an actual
>>> HMI (and therefore would not be fixed by clearing the HMER/HMEER
>>> bits), I wonder if this patch is still warranted. HMEER could be
>>> messed up somehow, so maybe a simplified version that just notes
>>> the unexpected HMI and masks out HMEER.
>>>
>>> Any opinions?  
>>
>> Yeah I agree with having simplified version so that it will help us to
>> detect if we at all mess up with HMEER in future.
> 
> How about something very simple?
> 
> [PATCH] hmi: report HMEER and fail when encountering unknown HMI
> 
> In the interest of minimising corruption and improving debugging, when
> encountering an unknown HMI, report HMEER and HMER with PR_ERR, and do
> not recover.
> 
> This is an improvement on existing behaviour, which in the best case
> will just continue to take HMIs because the exception remains asserted
> and this will cause the system to become unstable.
> 
> Signed-off-by: Nicholas Piggin <npiggin at gmail.com>
> ---
>  core/hmi.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/core/hmi.c b/core/hmi.c
> index 84f2c2d6..3329d055 100644
> --- a/core/hmi.c
> +++ b/core/hmi.c
> @@ -823,6 +823,15 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
>  		}
>  	}
> 
> +	if (hmer) {

Should we check for messed up HMEER before treating it as unexpected HMI
? By the time we reach here, we would have cleared all the bits in the
hmer variable that we care about. For remaining bits that are still on
in hmer, if corresponding bit in HMEER is also set then we for sure know
that this is an unexpected HMI and HMEER has been messed up.

How about this check ?

	if (hmer & mfspr(SPR_HMEER)) {

> +		uint64_t hmeer = mfspr(SPR_HMEER);
> +
> +		prlog(PR_ERR, "HMI: Unexpected exception HMEER = 0x%016llx "
> +			      "HMER = 0x%016llx (not recovering)\n",
> +			      hmeer, hmer);
> +		recover = 0;
> +	}
> +
>  	if (recover == 0)
>  		disable_fast_reboot("Unrecoverable HMI");
>  	/*
> 

Thanks,
-Mahesh.