MCE handler gets NIP wrong on MPC8378

Wed Feb 26 11:01:30 AEDT 2020

On 02/20/2020 at 12:48 PM Christophe Leroy <christophe.leroy at c-s.fr> wrote:
> Le 20/02/2020 à 18:34, Radu Rendec a écrit :
> > On 02/20/2020 at 11:25 AM Christophe Leroy <christophe.leroy at c-s.fr> wrote:
> >> Le 20/02/2020 à 17:02, Radu Rendec a écrit :
> >>> On 02/20/2020 at 3:38 AM Christophe Leroy <christophe.leroy at c-s.fr> wrote:
> >>>> On 02/19/2020 10:39 PM, Radu Rendec wrote:
> >>>>> On 02/19/2020 at 4:21 PM Christophe Leroy <christophe.leroy at c-s.fr> wrote:
> >>>>>>> Interesting.
> >>>>>>>
> >>>>>>> 0x900 is the adress of the timer interrupt.
> >>>>>>>
> >>>>>>> Would the MCE occur just after the timer interrupt ?
> >>>>>
> >>>>> I doubt that. I'm using a small test module to artificially trigger the
> >>>>> MCE. Basically it's just this (the full code is in my original post):
> >>>>>
> >>>>>            bad_addr_base = ioremap(0xf0000000, 0x100);
> >>>>>            x = ioread32(bad_addr_base);
> >>>>>
> >>>>> I find it hard to believe that every time I load the module the lwbrx
> >>>>> instruction that triggers the MCE is executed exactly after the timer
> >>>>> interrupt (or that the timer interrupt always occurs close to the lwbrx
> >>>>> instruction).
> >>>>
> >>>> Can you try to see how much time there is between your read and the MCE ?
> >>>> The below should allow it, you'll see first value in r13 and the other
> >>>> in r14 (mce.c is your test code)
> >>>>
> >>>> Also provide the timebase frequency as reported in /proc/cpuinfo
> >>>
> >>> I just ran a test: r13 is 0xda8e0f91 and r14 is 0xdaae0f9c.
> >>>
> >>> # cat /proc/cpuinfo
> >>> processor       : 0
> >>> cpu             : e300c4
> >>> clock           : 800.000004MHz
> >>> revision        : 1.1 (pvr 8086 1011)
> >>> bogomips        : 200.00
> >>> timebase        : 100000000
> >>>
> >>> The difference between r14 and r13 is 0x20000b. Assuming TB is
> >>> incremented with 'timebase' frequency, that means 20.97 milliseconds
> >>> (although the e300 manual says TB is "incremented once every four core
> >>> input clock cycles").
> >>
> >> I wouldn't be surprised that the internal CPU clock be twice the input
> >> clock.
> >>
> >> So that's long enough to surely get a timer interrupt during every bad
> >> access.
> >>
> >> Now we have to understand why SRR1 contains the address of the timer
> >> exception entry and not the address of the bad access.
> >>
> >> The value of SRR1 confirms that it comes from 0x900 as MSR[IR] and [DR]
> >> are cleared when interrupts are enabled.
> >>
> >> Maybe you should file a support case at NXP. They are usually quite
> >> professionnal at responding.
> >
> > I already did (quite some time ago), but it started off as "why does the
> > MCE occur in the first place". That part has already been figured out,
> > but unfortunately I don't have a viable solution to it. Like you said,
> > now the focus has shifted to understanding why the SRR0 value is not
> > what we expect.
>
> Yes now the point is to understand why it starts processing the timer
> interrupt at 0x900 (with IR and DR cleared as observed in SRR1) just
> before taking the Machine Check.
>
> Allthough the execution of the decrementer interrupt is queue for after
> the completion of the failing memory access, I'd expect the Machine
> Check to take priority.
>
> Note that I have never observed such a behaviour on MPC8321 which has an
> e300c2 core.

I apologize for the silence during the past few days, I've been diverted
with something else. This is the feedback that I got from NXP:

| The e300 core uses SRR0/1 for both non-critical interrupts and machine
| check interrupts and if they happen simultaneously a problem can occur
| where the return address from the first exception is lost when handling
| the second exception concurrently. This only occurs in the rare case
| when the software ISR hasn't had the time to save SRR0/1 to the sw stack.
|
| If the ability to nest interrupts is desired, software then saves off
| enough state (i.e. the contents of SRR0, SRR1, etc) that will allow it
| to recover (i.e. resume handling the current interrupt) if another
| interrupt occurs.

So basically what they describe is a race condition between the MCE and
a regular interrupt, where the regular interrupt (the timer interrupt,
in our case) kicks in after the MCE handler is entered into but before
it saves SRR0. This not only requires very precise timing, but would
also end up with a saved SRR0 value that points back somewhere inside
the MCE handler.

But I've thought about something else. We already timed it and we know
it consistently takes around 20 ms between the faulty read and the MCE
handler execution. I'm thinking that the faulty read is essentially a
failed transaction on the internal bus, because no peripheral replies
to the access on the bad address. The 20 ms is probably the bus timeout.
How does this scenario look to you?

- The faulty read starts to execute. A new internal bus transaction is
  started, the bad address is put on the bus and the CPU waits for a
  peripheral to reply.
- The timer interrupt kicks in. The CPU saves NIP to SRR0 and NIP
  becomes 0x900. But the CPU cannot start executing immediately from
  address 0x900 because the bus is blocked.
- Nobody replies and eventually the bus transaction fails. An MCE is
  triggered to handle the failed bus transaction.
- The MCE has higher priority than the timer interrupt, so it's handled
  immediately. The CPU saves NIP to SRR0 and NIP becomes 0x200.
- The CPU starts executing the MCE handler with 0x900 in SRR0.

This is pure speculation and I have absolutely no idea about the e300
core internal architecture. But it's my best guess. I've sent something
similar to NXP support. Let's see what they come up with.

By the way, I have successfully tested a fix that uses __do_inl instead
of ioread32 and disables interrupts around the __do_inl call. If I was
even close with my speculation above, then I guess the only thing we
could fix in the kernel would be to modify __do_inl and co. to disable
interrupts around the potentially dangerous access. The benefit would be
that the MCE could be recovered from. For ioread32, there is no real
benefit in doing that (other than printing the correct NIP address in
the crash dump) because it doesn't instrument the exception tables
anyway so it's non-recoverable.

Best regards,
Radu