MCE handler gets NIP wrong on MPC8378

Radu Rendec radu.rendec at gmail.com
Thu Feb 20 02:11:27 AEDT 2020


On 02/18/2020 at 1:08 PM Christophe Leroy <christophe.leroy at c-s.fr> wrote:
> Le 18/02/2020 à 18:07, Radu Rendec a écrit :
> > The saved NIP seems to be broken inside machine_check_exception() on
> > MPC8378, running Linux 4.9.191. The value is 0x900 most of the times,
> > but I have seen other weird values.
> >
> > I've been able to track down the entry code to head_32.S (vector 0x200),
> > but I'm not sure where/how the NIP value (where the exception occurred)
> > is captured.
>
> NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and
> saved into _NIP(r11) in transfer_to_handler in entry_32.S

Thank you so much for the information, it is extremely helpful!

> Can something clobber r12 at some point ?
>
> Maybe add the following at some place to trap when it happens ?
>
> tweqi r12, 0x900
>
> If you put it just after reading SRR0, and just before writing into
> NIP(r11), you'll see if its wrong from the begining or if it is
> overwriten later.

I did something even simpler: I added the following

        lis r12,0x1234

... right after

        mfspr r12,SPRN_SRR0

... and now the NIP value I see in the crash dump is 0x12340000. This
means r12 is not clobbered and most likely the NIP value I normally see
is the actual SRR0 value.

Just to be sure that SRR0 is not clobbered before it's even saved to r12
(very unlikely though) I changed the code to save SRR0 to r8 at the very
beginning of the handler (first instruction, at address 0x200) and then
load r12 from r8 later. This of course clobbers r8, but it's good for
testing. Now in the crash dump I see 0x900 in both NIP and r8.

So I think I ruled out any problem in the Linux MCE handler. MPC8378 has
an e300 core and I double checked with the e300 core reference manual
(e300coreRM.pdf from NXP). I couldn't find anything weird there either.
Quoting from the RM:

| 5.5.2.1 Machine Check Interrupt Enabled (MSR[ME] = 1)
|
| When a machine check interrupt is taken, registers are updated as
| shown in Table 5-14.
|
| Table 5-14. Machine Check Interrupt—Register Settings
|
| SRR0 Set to the address of the next instruction that would have been
|      completed in the interrupted instruction stream. Neither this
|      instruction nor any others beyond it will have been completed.
|      All preceding instructions will have been completed.

At this point I'm assuming a silicon bug, although I couldn't find
anything interesting in the Errata provided by NXP.

Best regards,
Radu


More information about the Linuxppc-dev mailing list