[PATCH 1/3] powerpc/64s: fix handling of non-synchronous machine checks
Nicholas Piggin
npiggin at gmail.com
Tue Feb 28 19:43:34 AEDT 2017
On Tue, 28 Feb 2017 11:27:29 +0530
Mahesh Jagannath Salgaonkar <mahesh at linux.vnet.ibm.com> wrote:
> On 02/28/2017 07:30 AM, Nicholas Piggin wrote:
> > A synchronous machine check is an exception raised by the attempt to
> > execute the current instruction. If the error can't be corrected, it
> > can make sense to SIGBUS the currently running process.
> >
> > In other cases, the error condition is not related to the current
> > instruction, so killing the current process is not the right thing to
> > do.
> >
> > Today, all machine checks are MCE_SEV_ERROR_SYNC, so this has no
> > practical change. It will be used to handle POWER9 asynchronous
> > machine checks.
> >
> > Signed-off-by: Nicholas Piggin <npiggin at gmail.com>
> > ---
> > arch/powerpc/platforms/powernv/opal.c | 21 ++++++---------------
> > 1 file changed, 6 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> > index 86d9fde93c17..e0f856bfbfe8 100644
> > --- a/arch/powerpc/platforms/powernv/opal.c
> > +++ b/arch/powerpc/platforms/powernv/opal.c
> > @@ -395,7 +395,6 @@ static int opal_recover_mce(struct pt_regs *regs,
> > struct machine_check_event *evt)
> > {
> > int recovered = 0;
> > - uint64_t ea = get_mce_fault_addr(evt);
> >
> > if (!(regs->msr & MSR_RI)) {
> > /* If MSR_RI isn't set, we cannot recover */
> > @@ -404,26 +403,18 @@ static int opal_recover_mce(struct pt_regs *regs,
> > } else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
> > /* Platform corrected itself */
> > recovered = 1;
> > - } else if (ea && !is_kernel_addr(ea)) {
> > + } else if (evt->severity == MCE_SEV_FATAL) {
> > + /* Fatal machine check */
> > + pr_err("Machine check interrupt is fatal\n");
> > + recovered = 0;
>
> Setting recovered = 0 would trigger kernel panic. Should we panic the
> kernel for asynchronous errors ?
If it's not recoverable, I don't see what other option we have. SRR0 is
meaningless for async machine checks. So it's much the same thing we do
as if we don't have a process to kill or were running in kernel when a
synchronous MCE occurred.
Thanks,
Nick
More information about the Linuxppc-dev
mailing list