MCE handler gets NIP wrong on MPC8378

Thu Feb 20 06:46:10 AEDT 2020

On 02/19/2020 at 10:11 AM Radu Rendec <radu.rendec at gmail.com> wrote:
> On 02/18/2020 at 1:08 PM Christophe Leroy <christophe.leroy at c-s.fr> wrote:
> > Le 18/02/2020 à 18:07, Radu Rendec a écrit :
> > > The saved NIP seems to be broken inside machine_check_exception() on
> > > MPC8378, running Linux 4.9.191. The value is 0x900 most of the times,
> > > but I have seen other weird values.
> > >
> > > I've been able to track down the entry code to head_32.S (vector 0x200),
> > > but I'm not sure where/how the NIP value (where the exception occurred)
> > > is captured.
> >
> > NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and
> > saved into _NIP(r11) in transfer_to_handler in entry_32.S
> >
> > Can something clobber r12 at some point ?
> >
>
> I did something even simpler: I added the following
>
>       lis r12,0x1234
>
> ... right after
>
>       mfspr r12,SPRN_SRR0
>
> ... and now the NIP value I see in the crash dump is 0x12340000. This
> means r12 is not clobbered and most likely the NIP value I normally see
> is the actual SRR0 value.

I apologize for the noise. I just found out accidentally that the saved
NIP value is correct if interrupts are disabled at the time when the
faulty access that triggers the MCE occurs. This seems to happen
consistently.

By "interrupts are disabled" I mean local_irq_save/local_irq_restore, so
it's basically enough to wrap ioread32 to get the NIP value right.

Does this make any sense? Maybe it's not a silicon bug after all, or
maybe it is and I just found a workaround. Could this happen on other
PowerPC CPUs as well?

Best regards,
Radu