Machine Check in P2010(e500v2)

Fri Sep 8 19:54:25 AEST 2017

On Thu, 2017-09-07 at 18:54 +0000, Leo Li wrote:
> > -----Original Message-----
> > From: Joakim Tjernlund [mailto:Joakim.Tjernlund at infinera.com]
> > Sent: Thursday, September 07, 2017 3:41 AM
> > To: linuxppc-dev at lists.ozlabs.org; Leo Li <leoyang.li at nxp.com>; York Sun
> > <york.sun at nxp.com>
> > Subject: Re: Machine Check in P2010(e500v2)
> > 
> > On Thu, 2017-09-07 at 00:50 +0200, Joakim Tjernlund wrote:
> > > On Wed, 2017-09-06 at 21:13 +0000, Leo Li wrote:
> > > > > -----Original Message-----
> > > > > From: Joakim Tjernlund [mailto:Joakim.Tjernlund at infinera.com]
> > > > > Sent: Wednesday, September 06, 2017 3:54 PM
> > > > > To: linuxppc-dev at lists.ozlabs.org; Leo Li <leoyang.li at nxp.com>;
> > > > > York Sun <york.sun at nxp.com>
> > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > 
> > > > > On Wed, 2017-09-06 at 20:28 +0000, Leo Li wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Joakim Tjernlund [mailto:Joakim.Tjernlund at infinera.com]
> > > > > > > Sent: Wednesday, September 06, 2017 3:17 PM
> > > > > > > To: linuxppc-dev at lists.ozlabs.org; Leo Li
> > > > > > > <leoyang.li at nxp.com>; York Sun <york.sun at nxp.com>
> > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > 
> > > > > > > On Wed, 2017-09-06 at 19:31 +0000, Leo Li wrote:
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: York Sun
> > > > > > > > > Sent: Wednesday, September 06, 2017 10:38 AM
> > > > > > > > > To: Joakim Tjernlund <Joakim.Tjernlund at infinera.com>;
> > > > > > > > > linuxppc- dev at lists.ozlabs.org; Leo Li
> > > > > > > > > <leoyang.li at nxp.com>
> > > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > > 
> > > > > > > > > Scott is no longer with Freescale/NXP. Adding Leo.
> > > > > > > > > 
> > > > > > > > > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote:
> > > > > > > > > > So after some debugging I found this bug:
> > > > > > > > > > @@ -996,7 +998,7 @@ int fsl_pci_mcheck_exception(struct
> > > > > > > > > > pt_regs
> > > > > 
> > > > > *regs)
> > > > > > > > > >          if (is_in_pci_mem_space(addr)) {
> > > > > > > > > >                  if (user_mode(regs)) {
> > > > > > > > > >                          pagefault_disable();
> > > > > > > > > > -                       ret = get_user(regs->nip, &inst);
> > > > > > > > > > +                       ret = get_user(inst, (__u32
> > > > > > > > > > + __user *)regs->nip);
> > > > > > > > > >                          pagefault_enable();
> > > > > > > > > >                  } else {
> > > > > > > > > >                          ret =
> > > > > > > > > > probe_kernel_address(regs->nip, inst);
> > > > > > > > > > 
> > > > > > > > > > However, the kernel still locked up after fixing that.
> > > > > > > > > > Now I wonder why this fixup is there in the first place?
> > > > > > > > > > The routine will not really fixup the insn, just return
> > > > > > > > > > 0xffffffff for the failing read and then advance the process NIP.
> > > > > > > > 
> > > > > > > > You are right.  The code here only gives 0xffffffff to the
> > > > > > > > load instructions and
> > > > > > > 
> > > > > > > continue with the next instruction when the load instruction
> > > > > > > is causing the machine check.  This will prevent a system
> > > > > > > lockup when reading from PCI/RapidIO device which is link down.
> > > > > > > > 
> > > > > > > > I don't know what is actual problem in your case.  Maybe it
> > > > > > > > is a write
> > > > > > > 
> > > > > > > instruction instead of read?   Or the code is in a infinite loop waiting for
> > 
> > a
> > > > > 
> > > > > valid
> > > > > > > read result?  Are you able to do some further debugging with
> > > > > > > the NIP correctly printed?
> > > > > > > > 
> > > > > > > 
> > > > > > > According to the MC it is a Read and the NIP also leads to a
> > > > > > > read in the
> > > > > 
> > > > > program.
> > > > > > > ATM, I have disabled the fixup but I will enable that again.
> > > > > > > Question, is it safe add a small printk when this MC
> > > > > > > happens(after fixing up)? I need to see that it has happened
> > > > > > > as the error is somewhat
> > > > > 
> > > > > random.
> > > > > > 
> > > > > > I think it is safe to add printk as the current machine check
> > > > > > handlers are also
> > > > > 
> > > > > using printk.
> > > > > 
> > > > > I hope so, but if the fixup fires there is no printk at all so I was a bit unsure.
> > > > > Don't like this fixup though, is there not a better way than
> > > > > faking a read to user space(or kernel for that matter) ?
> > > > 
> > > > I don't have a better idea.  Without the fixup, the offending load instruction
> > 
> > will never finish if there is anything wrong with the backing device and freeze the
> > whole system.  Do you have any suggestion in mind?
> > > > 
> > > 
> > > But it never finishes the load, it just fakes a load of 0xfffffffff,
> > > for user space I rather have it signal a SIGBUS but that does not seem
> > > to work either, at least not for us but that could be a bug in general MC code
> > 
> > maybe.
> > > This fixup might be valid for kernel only as it has never worked for user space
> > 
> > due to the bug I found.
> > > 
> > > Where can I read about this errata ?
> > 
> > I have look high and low an cannot find an errata which maps to this fixup.
> > The closest I get is A-005125 which seems to have another workaround, I cannot
> > find any evidence that this workaround has been applied in Linux, can you?
> 
> This is not A-005125.  There was an erratum for this issue with older silicons (e.g. erratum PCI-ex 3 for MPC8572).  
> " When its link goes down, the PCI Express controller clears all outstanding transactions with an
> error indicator and sends a link down exception to the interrupt controller if
> PEX_PME_MES_DISR[LDDD] = 0. If, however, any transactions are sent to the controller after
> the link down event, they are accepted by the controller and wait for the link to come back up
> before starting any timeout counters (for example, completion timeout). There is no mechanism to
> cancel the new transactions short of a device HRESET. "
>
> But it was removed in newer silicon like P2020/P2010 probably because a Machine Check will be triggered in this situation to deal with the stalled instruction and no longer considered it as a hardware issue.
> 

Maybe this fixup should be configurable then?

> The A-005125 is dealt with in u-boot.   https://lists.denx.de/pipermail/u-boot/2013-August/161185.html

Yes, I found it eventually :)

However, I cannot return to normal execution. I can follow the code to returning from
machine_check_exception() and moving into ASM handler for returning from a ME but then I
am a bit lost. It does not seem to be any problem executing, it feels more like a SW bug
dealing with machine checks. Don't known how to diagnose this further and could use some pointers.

 Jocke