Machine Check in P2010(e500v2)

Fri Sep 22 04:53:16 AEST 2017

> -----Original Message-----
> From: Joakim Tjernlund [mailto:Joakim.Tjernlund at infinera.com]
> Sent: Wednesday, September 20, 2017 11:45 AM
> To: linuxppc-dev at lists.ozlabs.org; Leo Li <leoyang.li at nxp.com>; York Sun
> <york.sun at nxp.com>
> Subject: Re: Machine Check in P2010(e500v2)
> 
> On Sat, 2017-09-09 at 14:45 +0200, Joakim Tjernlund wrote:
> > On Fri, 2017-09-08 at 22:27 +0000, Leo Li wrote:
> > > > -----Original Message-----
> > > > From: Joakim Tjernlund [mailto:Joakim.Tjernlund at infinera.com]
> > > > Sent: Friday, September 08, 2017 7:51 AM
> > > > To: linuxppc-dev at lists.ozlabs.org; Leo Li <leoyang.li at nxp.com>;
> > > > York Sun <york.sun at nxp.com>
> > > > Subject: Re: Machine Check in P2010(e500v2)
> > > >
> > > > On Fri, 2017-09-08 at 11:54 +0200, Joakim Tjernlund wrote:
> > > > > On Thu, 2017-09-07 at 18:54 +0000, Leo Li wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Joakim Tjernlund
> > > > > > > [mailto:Joakim.Tjernlund at infinera.com]
> > > > > > > Sent: Thursday, September 07, 2017 3:41 AM
> > > > > > > To: linuxppc-dev at lists.ozlabs.org; Leo Li
> > > > > > > <leoyang.li at nxp.com>; York Sun <york.sun at nxp.com>
> > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > >
> > > > > > > On Thu, 2017-09-07 at 00:50 +0200, Joakim Tjernlund wrote:
> > > > > > > > On Wed, 2017-09-06 at 21:13 +0000, Leo Li wrote:
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Joakim Tjernlund
> > > > > > > > > > [mailto:Joakim.Tjernlund at infinera.com]
> > > > > > > > > > Sent: Wednesday, September 06, 2017 3:54 PM
> > > > > > > > > > To: linuxppc-dev at lists.ozlabs.org; Leo Li
> > > > > > > > > > <leoyang.li at nxp.com>; York Sun <york.sun at nxp.com>
> > > > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > > >
> > > > > > > > > > On Wed, 2017-09-06 at 20:28 +0000, Leo Li wrote:
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Joakim Tjernlund
> > > > > > > > > > > > [mailto:Joakim.Tjernlund at infinera.com]
> > > > > > > > > > > > Sent: Wednesday, September 06, 2017 3:17 PM
> > > > > > > > > > > > To: linuxppc-dev at lists.ozlabs.org; Leo Li
> > > > > > > > > > > > <leoyang.li at nxp.com>; York Sun <york.sun at nxp.com>
> > > > > > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 2017-09-06 at 19:31 +0000, Leo Li wrote:
> > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > From: York Sun
> > > > > > > > > > > > > > Sent: Wednesday, September 06, 2017 10:38 AM
> > > > > > > > > > > > > > To: Joakim Tjernlund
> > > > > > > > > > > > > > <Joakim.Tjernlund at infinera.com>;
> > > > > > > > > > > > > > linuxppc- dev at lists.ozlabs.org; Leo Li
> > > > > > > > > > > > > > <leoyang.li at nxp.com>
> > > > > > > > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Scott is no longer with Freescale/NXP. Adding Leo.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote:
> > > > > > > > > > > > > > > So after some debugging I found this bug:
> > > > > > > > > > > > > > > @@ -996,7 +998,7 @@ int
> > > > > > > > > > > > > > > fsl_pci_mcheck_exception(struct pt_regs
> > > > > > > > > >
> > > > > > > > > > *regs)
> > > > > > > > > > > > > > >          if (is_in_pci_mem_space(addr)) {
> > > > > > > > > > > > > > >                  if (user_mode(regs)) {
> > > > > > > > > > > > > > >                          pagefault_disable();
> > > > > > > > > > > > > > > -                       ret = get_user(regs->nip, &inst);
> > > > > > > > > > > > > > > +                       ret = get_user(inst,
> > > > > > > > > > > > > > > + (__u32 __user *)regs->nip);
> > > > > > > > > > > > > > >                          pagefault_enable();
> > > > > > > > > > > > > > >                  } else {
> > > > > > > > > > > > > > >                          ret =
> > > > > > > > > > > > > > > probe_kernel_address(regs->nip, inst);
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > However, the kernel still locked up after fixing that.
> > > > > > > > > > > > > > > Now I wonder why this fixup is there in the first place?
> > > > > > > > > > > > > > > The routine will not really fixup the insn,
> > > > > > > > > > > > > > > just return 0xffffffff for the failing read
> > > > > > > > > > > > > > > and then advance the
> > > >
> > > > process NIP.
> > > > > > > > > > > > >
> > > > > > > > > > > > > You are right.  The code here only gives
> > > > > > > > > > > > > 0xffffffff to the load instructions and
> > > > > > > > > > > >
> > > > > > > > > > > > continue with the next instruction when the load
> > > > > > > > > > > > instruction is causing the machine check.  This
> > > > > > > > > > > > will prevent a system lockup when reading from
> > > > > > > > > > > > PCI/RapidIO device
> > > >
> > > > which is link down.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I don't know what is actual problem in your case.
> > > > > > > > > > > > > Maybe it is a write
> > > > > > > > > > > >
> > > > > > > > > > > > instruction instead of read?   Or the code is in a infinite loop
> > > >
> > > > waiting for
> > > > > > >
> > > > > > > a
> > > > > > > > > >
> > > > > > > > > > valid
> > > > > > > > > > > > read result?  Are you able to do some further
> > > > > > > > > > > > debugging with the NIP correctly printed?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > According to the MC it is a Read and the NIP also
> > > > > > > > > > > > leads to a read in the
> > > > > > > > > >
> > > > > > > > > > program.
> > > > > > > > > > > > ATM, I have disabled the fixup but I will enable that again.
> > > > > > > > > > > > Question, is it safe add a small printk when this
> > > > > > > > > > > > MC happens(after fixing up)? I need to see that it
> > > > > > > > > > > > has happened as the error is somewhat
> > > > > > > > > >
> > > > > > > > > > random.
> > > > > > > > > > >
> > > > > > > > > > > I think it is safe to add printk as the current
> > > > > > > > > > > machine check handlers are also
> > > > > > > > > >
> > > > > > > > > > using printk.
> > > > > > > > > >
> > > > > > > > > > I hope so, but if the fixup fires there is no printk
> > > > > > > > > > at all so I was a bit
> > > >
> > > > unsure.
> > > > > > > > > > Don't like this fixup though, is there not a better
> > > > > > > > > > way than faking a read to user space(or kernel for that matter) ?
> > > > > > > > >
> > > > > > > > > I don't have a better idea.  Without the fixup, the
> > > > > > > > > offending load instruction
> > > > > > >
> > > > > > > will never finish if there is anything wrong with the
> > > > > > > backing device and freeze the whole system.  Do you have any
> suggestion in mind?
> > > > > > > > >
> > > > > > > >
> > > > > > > > But it never finishes the load, it just fakes a load of
> > > > > > > > 0xfffffffff, for user space I rather have it signal a
> > > > > > > > SIGBUS but that does not seem to work either, at least not
> > > > > > > > for us but that could be a bug in general MC code
> > > > > > >
> > > > > > > maybe.
> > > > > > > > This fixup might be valid for kernel only as it has never
> > > > > > > > worked for user space
> > > > > > >
> > > > > > > due to the bug I found.
> > > > > > > >
> > > > > > > > Where can I read about this errata ?
> > > > > > >
> > > > > > > I have look high and low an cannot find an errata which maps to this
> fixup.
> > > > > > > The closest I get is A-005125 which seems to have another
> > > > > > > workaround, I cannot find any evidence that this workaround
> > > > > > > has been
> > > >
> > > > applied in Linux, can you?
> > > > > >
> > > > > > This is not A-005125.  There was an erratum for this issue
> > > > > > with older silicons
> > > >
> > > > (e.g. erratum PCI-ex 3 for MPC8572).
> > > > > > " When its link goes down, the PCI Express controller clears
> > > > > > all outstanding transactions with an error indicator and sends
> > > > > > a link down exception to the interrupt controller if
> > > > > > PEX_PME_MES_DISR[LDDD] = 0. If, however, any transactions are
> > > > > > sent to the controller after the link down event, they are
> > > > > > accepted by the controller and wait for the link to come back
> > > > > > up before starting any timeout counters (for
> > > >
> > > > example, completion timeout). There is no mechanism to cancel the
> > > > new transactions short of a device HRESET. "
> > > > > >
> > > > > > But it was removed in newer silicon like P2020/P2010 probably
> > > > > > because a
> > > >
> > > > Machine Check will be triggered in this situation to deal with the
> > > > stalled instruction and no longer considered it as a hardware issue.
> > > > > >
> > > > >
> > > > > Maybe this fixup should be configurable then?
> > >
> > > No.  My point is that the problem was no longer considered a hardware issue
> because of the machine check mechanism is in place to handle it.  If there is no
> handling of this special case, we would still experience a system hang if this
> situation really occurs.
> > >
> > > > >
> > > > > > The A-005125 is dealt with in u-boot.
> > > >
> > > > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > > > Flists.de
> > > > nx.de%2Fpipermail%2Fu-boot%2F2013-
> > > >
> August%2F161185.html&data=01%7C01%7Cleoyang.li%40nxp.com%7Ccb8a93e
> > > >
> 0090e48eb53a008d4f6b84235%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0&
> > > >
> sdata=8sR4yoXA4adqMHz6TY%2BvmYpfCBTcYEZHjPuANjz%2F1EQ%3D&reserve
> > > > d=0
> > > > >
> > > > > Yes, I found it eventually :)
> > > > >
> > > > > However, I cannot return to normal execution. I can follow the
> > > > > code to returning from
> > > > > machine_check_exception() and moving into ASM handler for
> > > > > returning from a ME but then I am a bit lost. It does not seem
> > > > > to be any problem executing, it feels more like a SW bug dealing
> > > > > with machine checks. Don't
> > > >
> > > > known how to diagnose this further and could use some pointers.
> > >
> > > Is the execution returned to the user application?  I doubt the system hang is
> caused by the machine check handling.
> > > You can try to comment out the machine check handling code and check
> > > if there is any improvement and see if this is related to the machine check
> handling.
> >
> > It tries to return to user app but I cannot see what happens as the
> > system lock up when the MC returns.
> > How do you mean comment out MC handling? The simplest path is the PCI
> > fixup which will just do regs->nip += 4; and then return to user
> > space. That still does not work as as soon MC handling returns, the system is
> locked up.
> >
> > >
> > > Machine check is a serious situation and not always possible to be recovered
> from.
> >
> > This one should at least not kill the whole system. It is a simple bus
> > error in user space and the app should get SIGBUS and the the system should
> carry on.
> >
> > > I would focus more on debugging why the machine check is triggered by the
> user space application.
> > > Can you locate what code is causing this machine check from user space?
> > > Is it accessing some hardware related space which is not ready?
> > > Or is it accessing address that it shouldn't have accessed?
> >
> > of course, this is ongoing and getting closer a solution. The MC
> > looking the machine completely does not make this any easier though.
> > These are 2 separate things, fixing the cause and not having a simple bus error
> lock up the machine.
> > I am focusing on fixing the lockup.
> >
> > I have been following the execution in the kernel and I always end up
> > in the ASM returning from the MC.
> > The other day we got a similar PCI MC(bus error) on T1042
> > CPU(e5500/e500mc) and there the system survived. The one thing I see
> > different there is that MSR RI is set when entering MC, why is that?
> >
> >  Jocke
> 
> Got some more info now, this is a new errata I think, adding EDAC to the mix
> yields:
> [   28.372574] LTSSM:16
> [   28.377197] Machine check in kernel mode.
> [   28.381201] Caused by (from MCSR=10008, MCAR:0x8003e000): Bus - Read
> Data Bus Error
> [   28.388861] Oops: Machine check, sig: 7 [#1]
> [   28.393125] P2010 E500v2
> [   28.395651] Modules linked in: linux_bcm_knet(PO) linux_user_bde(PO)
> linux_kernel_bde(PO)
> [   28.403842] CPU: 0 PID: 485 Comm: emxp2_hw_bl Tainted: P           O    4.1.43+
> #19
> [   28.411499] task: db13a0f0 ti: df17c000 task.ti: df17c000
> [   28.416894] NIP: 10a66954 LR: 10a66a88 CTR: 0f9e7f44
> [   28.421855] REGS: df17df10 TRAP: 0204   Tainted: P           O     (4.1.43+)
> [   28.428901] MSR: 0002d000 <CE,EE,PR,ME>  CR: 44002428  XER: 20000000
> [   28.435267] DEAR: b73cc000 ESR: 00000000
> GPR00: 10a66a88 bfc21bc0 b7eee4a0 136eb4a0 00000000 00000000 00000000
> 00000000
> GPR08: 0002d000 0003e000 b738e000 00000000 24002422 11db7334 00000000
> 00000000
> GPR16: 10f8b054 10f895e5 10f8a8bf 0000b541 0000b541 11ddd380 00000011
> 00000001
> GPR24: 01a9985e 136f1010 07000000 136eb4a0 00006000 07006000 00000000
> 00000000
> [   28.467506] NIP [10a66954] 0x10a66954
> [   28.471162] LR [10a66a88] 0x10a66a88
> [   28.474730] Call Trace:
> [   28.477170] ---[ end trace b25436dea505b49d ]---
> [   28.481781]
> [   28.483267] PCIe error(s) detected
> [   28.486662] PCIe ERR_DR register: 0x00800000
> [   28.490927] PCIe ERR_CAP_STAT register: 0x00000023
> [   28.495713] PCIe ERR_CAP_R0 register: 0x00000000
> [   28.500324] PCIe ERR_CAP_R1 register: 0x00000000
> [   28.504936] PCIe ERR_CAP_R2 register: 0x00000000
> [   28.509548] PCIe ERR_CAP_R3 register: 0x00000000
> 
> I logged LTSSM and it is 16(link up) and Ref. manual says this about ERR_DR =
> 0x00800000:
> 
> PCIe ERR_DR: PCT bit
> PCI Express completion time-out. A completion time-out condition was detected
> for a non-posted, outbound PCI Express transaction. An error response is sent
> back to the requestor. Note that a completion timeout counter only starts when
> the non-posted request was able to send to the link partner.
> -
> A completion time-out on the PCI Express link was detected. Note that a
> completion timeout error is a fatal error. If a completion timeout error is
> detected, the system has become unstable. Hot reset is recommended to
> restore stability of the system.
> 
> This error is not described in any errata I can find, how to workaround this?

Adding some PCIe experts to the loop.

Regards,
Leo