MCE handler gets NIP wrong on MPC8378

Thu Feb 20 09:39:47 AEDT 2020

On 02/19/2020 at 4:21 PM Christophe Leroy <christophe.leroy at c-s.fr> wrote:
> > Radu Rendec <radu.rendec at gmail.com> a écrit :
> >> On 02/19/2020 at 10:11 AM Radu Rendec <radu.rendec at gmail.com> wrote:
> >>> On 02/18/2020 at 1:08 PM Christophe Leroy <christophe.leroy at c-s.fr> wrote:
> >>>> Le 18/02/2020 à 18:07, Radu Rendec a écrit :
> >>>> > The saved NIP seems to be broken inside machine_check_exception() on
> >>>> > MPC8378, running Linux 4.9.191. The value is 0x900 most of the times,
> >>>> > but I have seen other weird values.
> >>>> >
> >>>> > I've been able to track down the entry code to head_32.S (vector 0x200),
> >>>> > but I'm not sure where/how the NIP value (where the exception occurred)
> >>>> > is captured.
> >>>>
> >>>> NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and
> >>>> saved into _NIP(r11) in transfer_to_handler in entry_32.S
> >>>>
> >>>> Can something clobber r12 at some point ?
> >>>>
> >>>
> >>> I did something even simpler: I added the following
> >>>
> >>>      lis r12,0x1234
> >>>
> >>> ... right after
> >>>
> >>>      mfspr r12,SPRN_SRR0
> >>>
> >>> ... and now the NIP value I see in the crash dump is 0x12340000. This
> >>> means r12 is not clobbered and most likely the NIP value I normally see
> >>> is the actual SRR0 value.
> >>
> >> I apologize for the noise. I just found out accidentally that the saved
> >> NIP value is correct if interrupts are disabled at the time when the
> >> faulty access that triggers the MCE occurs. This seems to happen
> >> consistently.
> >>
> >> By "interrupts are disabled" I mean local_irq_save/local_irq_restore, so
> >> it's basically enough to wrap ioread32 to get the NIP value right.
> >>
> >> Does this make any sense? Maybe it's not a silicon bug after all, or
> >> maybe it is and I just found a workaround. Could this happen on other
> >> PowerPC CPUs as well?
> >
> > Interesting.
> >
> > 0x900 is the adress of the timer interrupt.
> >
> > Would the MCE occur just after the timer interrupt ?

I doubt that. I'm using a small test module to artificially trigger the
MCE. Basically it's just this (the full code is in my original post):

        bad_addr_base = ioremap(0xf0000000, 0x100);
        x = ioread32(bad_addr_base);

I find it hard to believe that every time I load the module the lwbrx
instruction that triggers the MCE is executed exactly after the timer
interrupt (or that the timer interrupt always occurs close to the lwbrx
instruction).

> >
> > Can you tell how are configured your IO busses, etc ... ?

Nothing special. The device tree is mostly similar to mpc8379_rdb.dts,
but I can provide the actual dts if you think it's relevant.

> And what's the value of SERSR after the machine check ?

I'm assuming you're talking about the IPIC SERSR register. I modified
machine_check_exception and added a call to ipic_get_mcp_status, which
seems to read IPIC_SERSR. The value is 0, both with interrupts enabled
and disabled (which makes sense, since disabling/enabling interrupts is
local to the CPU core).

> Do you use the local bus monitoring driver ?

I don't. In fact, I'm not even aware of it. What driver is that?

Best regards,
Radu