[PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

Christophe LEROY christophe.leroy at c-s.fr
Sat Oct 13 19:56:24 AEDT 2018



Le 13/10/2018 à 10:48, Nicholas Piggin a écrit :
> On Sat, 13 Oct 2018 08:29:48 +0000
> Christophe Leroy <christophe.leroy at c-s.fr> wrote:
> 
>> On 10/11/2018 02:31 PM, Christophe LEROY wrote:
>>>
>>>
>>> Le 09/10/2018 à 13:16, Nicholas Piggin a écrit :
>>>> On Tue, 9 Oct 2018 09:36:18 +0000
>>>> Christophe Leroy <christophe.leroy at c-s.fr> wrote:
>>>>   
>>>>> On 10/09/2018 05:30 AM, Nicholas Piggin wrote:
>>>>>> On Tue, 9 Oct 2018 06:46:30 +0200
>>>>>> Christophe LEROY <christophe.leroy at c-s.fr> wrote:
>>>>>>> Le 09/10/2018 à 06:32, Nicholas Piggin a écrit :
>>>>>>>> On Mon, 8 Oct 2018 17:39:11 +0200
>>>>>>>> Christophe LEROY <christophe.leroy at c-s.fr> wrote:
>>>>>>>>> Hi Nick,
>>>>>>>>>
>>>>>>>>> Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :
>>>>>>>>>> Use nmi_enter similarly to system reset interrupts. This uses NMI
>>>>>>>>>> printk NMI buffers and turns off various debugging facilities that
>>>>>>>>>> helps avoid tripping on ourselves or other CPUs.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Nicholas Piggin <npiggin at gmail.com>
>>>>>>>>>> ---
>>>>>>>>>>       arch/powerpc/kernel/traps.c | 9 ++++++---
>>>>>>>>>>       1 file changed, 6 insertions(+), 3 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/arch/powerpc/kernel/traps.c
>>>>>>>>>> b/arch/powerpc/kernel/traps.c
>>>>>>>>>> index 2849c4f50324..6d31f9d7c333 100644
>>>>>>>>>> --- a/arch/powerpc/kernel/traps.c
>>>>>>>>>> +++ b/arch/powerpc/kernel/traps.c
>>>>>>>>>> @@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs
>>>>>>>>>> *regs)
>>>>>>>>>>       void machine_check_exception(struct pt_regs *regs)
>>>>>>>>>>       {
>>>>>>>>>> -    enum ctx_state prev_state = exception_enter();
>>>>>>>>>>           int recover = 0;
>>>>>>>>>> +    bool nested = in_nmi();
>>>>>>>>>> +    if (!nested)
>>>>>>>>>> +        nmi_enter();
>>>>>>>>>
>>>>>>>>> This alters preempt_count, then when die() is called
>>>>>>>>> in_interrupt() returns true allthough the trap didn't happen in
>>>>>>>>> interrupt, so oops_end() panics for "fatal exception in interrupt"
>>>>>>>>> instead of gently sending SIGBUS the faulting app.
>>>>>>>>
>>>>>>>> Thanks for tracking that down.
>>>>>>>>> Any idea on how to fix this ?
>>>>>>>>
>>>>>>>> I would say we have to deliver the sigbus by hand.
>>>>>>>>
>>>>>>>>         if ((user_mode(regs)))
>>>>>>>>             _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
>>>>>>>>         else
>>>>>>>>             die("Machine check", regs, SIGBUS);
>>>>>>>
>>>>>>> And what about all the other things done by 'die()' ?
>>>>>>>
>>>>>>> And what if it is a kernel thread ?
>>>>>>>
>>>>>>> In one of my boards, I have a kernel thread regularly checking the HW,
>>>>>>> and if it gets a machine check I expect it to gently stop and the die
>>>>>>> notification to be delivered to all registered notifiers.
>>>>>>>
>>>>>>> Until before this patch, it was working well.
>>>>>>
>>>>>> I guess the alternative is we could check regs->trap for machine
>>>>>> check in the die test. Complication is having to account for MCE
>>>>>> in an interrupt handler.
>>>>>>
>>>>>>           if (in_interrupt()) {
>>>>>>                    if (!IS_MCHECK_EXC(regs) || (irq_count() -
>>>>>> (NMI_OFFSET + HARDIRQ_OFFSET)))
>>>>>>                        panic("Fatal exception in interrupt");
>>>>>>           }
>>>>>>
>>>>>> Something like that might work for you? We needs a ppc64 macro for the
>>>>>> MCE, and can probably add something like in_nmi_from_interrupt() for
>>>>>> the second part of the test.
>>>>>
>>>>> Don't know, I'm away from home on business trip so I won't be able to
>>>>> test anything before next week. However it looks more or less like a
>>>>> hack, doesn't it ?
>>>>
>>>> I thought it seemed okay (with the right functions added). Actually it
>>>> could be a bit nicer to do this, then it works generally :
>>>>
>>>>            if (in_interrupt()) {
>>>>                     if (!in_nmi() || in_nmi_from_interrupt())
>>>>                         panic("Fatal exception in interrupt");
>>>>            }
>>>>   
>>>>>
>>>>> What about the following ?
>>>>
>>>> Hmm, in some ways maybe it's nicer. One complication is I would like the
>>>> same thing to be available for platform specific machine check
>>>> handlers, so then you need to pass is_in_interrupt to them. Which you
>>>> can do without any problem... But is it cleaner than the above?
>>>
>>> For me it looks cleaner than twiddle the preempt_count depending on
>>> whether we were or not already in nmi() .
>>>
>>> Let's draft something and see what it looks like.
>>
>> Ok, finaly I went to your solution, see below, as it avoids having to
>> modify all subarch and platform specific machine check handlers.
>>
>> Unfortunately it doesn't solves the issue, it only delays it:
>>
>> oops_end() calls do_exit(), which has the following test:
>>
>> 	if (unlikely(in_interrupt()))
>> 		panic("Aiee, killing interrupt handler!");
>>
>>
>> So at the time being I still have no idea how to fix that, have you ?
> 
> Huh, I'm not sure. x86's MCE handling looks like it does this:
> 
>                  /*
>                   * We might have interrupted pretty much anything.  In
>                   * fact, if we're a machine check, we can even interrupt
>                   * NMI processing.  We don't want in_nmi() to return true,
>                   * but we need to notify RCU.
>                   */
>                  rcu_nmi_enter();
> 
> But I don't see why they don't want the full NMI treatment there. I
> thought the whole point was to do everything so you would get e.g.,
> the NMI-safe printk and so on.
> 
> The reason the in_interrupt checks work below is because the synchronous
> trap handlers e.g., for BUG do not enter interrupt context so the
> question is about they context they interrupted. Maybe the right way to
> go is nmi_exit just before deciding to oops.

Yes I arrived at the same conclusion. I tested it just now and it works 
for me. Thanks.

Christophe

> 
> Perhaps we could ask lkml.
> 
> Thanks,
> Nick
> 


More information about the Linuxppc-dev mailing list