[v3 PATCH 4/5] powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.

Wed Jun 13 14:06:54 AEST 2018

On 06/13/2018 09:36 AM, Michael Ellerman wrote:
> "Aneesh Kumar K.V" <aneesh.kumar at linux.ibm.com> writes:
>> On 06/12/2018 07:17 PM, Michael Ellerman wrote:
>>> Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
>>>> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
>>>> index 2edc673be137..e56759d92356 100644
>>>> --- a/arch/powerpc/platforms/pseries/ras.c
>>>> +++ b/arch/powerpc/platforms/pseries/ras.c
>>>> @@ -422,6 +422,31 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
>>>>    	return 0; /* need to perform reset */
>>>>    }
>>>>    
>>>> +static int mce_handle_error(struct rtas_error_log *errp)
>>>> +{
>>>> +	struct pseries_errorlog *pseries_log;
>>>> +	struct pseries_mc_errorlog *mce_log;
>>>> +	int disposition = rtas_error_disposition(errp);
>>>> +	uint8_t error_type;
>>>> +
>>>> +	pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
>>>> +	if (pseries_log == NULL)
>>>> +		goto out;
>>>> +
>>>> +	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
>>>> +	error_type = rtas_mc_error_type(mce_log);
>>>> +
>>>> +	if ((disposition == RTAS_DISP_NOT_RECOVERED) &&
>>>> +			(error_type == PSERIES_MC_ERROR_TYPE_SLB)) {
>>>> +		slb_dump_contents();
>>>> +		slb_flush_and_rebolt();
>>>
>>> Aren't we back in virtual mode here?
>>>
>>> Don't we need to do the flush in real mode before turning the MMU back
>>> on. Otherwise we'll just take another multi-hit?
>>
>> slb_flush_and_rebolt does slbia, which keeps slb index 0. So kernel code
>> should not get another slb miss. We also make sure we don't touch stack
>> in slb_flush_and_rebolt(). So we flush everything and put vmalloc and
>> stack back. That should be ok with MMU on?
> 
> I don't think so.
> 
> Imagine we take a multi-hit accessing the paca. The machine check is
> delivered in real mode, so we can run and access the paca by it's real
> address. But as soon as we turn the MMU back on, we'll take another
> multi-hit when we access the paca.
> 
> If I'm reading the code right we are turning the MMU back on essentially
> straight away when we rfid to machine_check_common().
> 

yes for linear mapped first 1TB we will take a multi-hit again

-aneesh