[PATCH 05/13] powerpc/mce: Allow notifier callback to handle MCE

Fri Jun 21 17:05:08 AEST 2019

On 6/21/19 6:27 AM, Santosh Sivaraj wrote:
> From: Reza Arbab <arbab at linux.ibm.com>
> 
> If a notifier returns NOTIFY_STOP, consider the MCE handled, just as we
> do when machine_check_early() returns 1.
> 
> Signed-off-by: Reza Arbab <arbab at linux.ibm.com>
> ---
>  arch/powerpc/include/asm/asm-prototypes.h |  2 +-
>  arch/powerpc/kernel/exceptions-64s.S      |  3 +++
>  arch/powerpc/kernel/mce.c                 | 28 ++++++++++++++++-------
>  3 files changed, 24 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
> index f66f26ef3ce0..49ee8f08de2a 100644
> --- a/arch/powerpc/include/asm/asm-prototypes.h
> +++ b/arch/powerpc/include/asm/asm-prototypes.h
> @@ -72,7 +72,7 @@ void machine_check_exception(struct pt_regs *regs);
>  void emulation_assist_interrupt(struct pt_regs *regs);
>  long do_slb_fault(struct pt_regs *regs, unsigned long ea);
>  void do_bad_slb_fault(struct pt_regs *regs, unsigned long ea, long err);
> -void machine_check_notify(struct pt_regs *regs);
> +long machine_check_notify(struct pt_regs *regs);
>  
>  /* signals, syscalls and interrupts */
>  long sys_swapcontext(struct ucontext __user *old_ctx,
> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> index 2e56014fca21..c83e38a403fd 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -460,6 +460,9 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
>  
>  	addi	r3,r1,STACK_FRAME_OVERHEAD
>  	bl	machine_check_notify
> +	ld	r11,RESULT(r1)
> +	or	r3,r3,r11
> +	std	r3,RESULT(r1)
>  
>  	ld	r12,_MSR(r1)
>  BEGIN_FTR_SECTION
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index 0ab171b41ede..912efe58e0b1 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -647,16 +647,28 @@ long hmi_exception_realmode(struct pt_regs *regs)
>  	return 1;
>  }
>  
> -void machine_check_notify(struct pt_regs *regs)
> +long machine_check_notify(struct pt_regs *regs)
>  {
> -	struct machine_check_event evt;
> +	int index = __this_cpu_read(mce_nest_count) - 1;
> +	struct machine_check_event *evt;
> +	int rc;
>  
> -	if (!get_mce_event(&evt, MCE_EVENT_DONTRELEASE))
> -		return;
> +	if (index < 0 || index >= MAX_MC_EVT)
> +		return 0;
> +
> +	evt = this_cpu_ptr(&mce_event[index]);
>  
> -	blocking_notifier_call_chain(&mce_notifier_list, 0, &evt);
> +	rc = blocking_notifier_call_chain(&mce_notifier_list, 0, evt);
> +	if (rc & NOTIFY_STOP_MASK) {
> +		evt->disposition = MCE_DISPOSITION_RECOVERED;
> +		regs->msr |= MSR_RI;

What is the reason for setting MSR_RI ? I don't think this is a good
idea. MSR_RI = 0 means system got MCE interrupt when SRR0 and SRR1
contents were live and was overwritten by MCE interrupt. Hence this
interrupt is unrecoverable irrespective of whether machine check handler
recovers from it or not.

Thanks,
-Mahesh.