Machine Check in P2010(e500v2)

Thu Sep 7 07:13:28 AEST 2017

> -----Original Message-----
> From: Joakim Tjernlund [mailto:Joakim.Tjernlund at infinera.com]
> Sent: Wednesday, September 06, 2017 3:54 PM
> To: linuxppc-dev at lists.ozlabs.org; Leo Li <leoyang.li at nxp.com>; York Sun
> <york.sun at nxp.com>
> Subject: Re: Machine Check in P2010(e500v2)
> 
> On Wed, 2017-09-06 at 20:28 +0000, Leo Li wrote:
> > > -----Original Message-----
> > > From: Joakim Tjernlund [mailto:Joakim.Tjernlund at infinera.com]
> > > Sent: Wednesday, September 06, 2017 3:17 PM
> > > To: linuxppc-dev at lists.ozlabs.org; Leo Li <leoyang.li at nxp.com>; York
> > > Sun <york.sun at nxp.com>
> > > Subject: Re: Machine Check in P2010(e500v2)
> > >
> > > On Wed, 2017-09-06 at 19:31 +0000, Leo Li wrote:
> > > > > -----Original Message-----
> > > > > From: York Sun
> > > > > Sent: Wednesday, September 06, 2017 10:38 AM
> > > > > To: Joakim Tjernlund <Joakim.Tjernlund at infinera.com>; linuxppc-
> > > > > dev at lists.ozlabs.org; Leo Li <leoyang.li at nxp.com>
> > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > >
> > > > > Scott is no longer with Freescale/NXP. Adding Leo.
> > > > >
> > > > > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote:
> > > > > > So after some debugging I found this bug:
> > > > > > @@ -996,7 +998,7 @@ int fsl_pci_mcheck_exception(struct pt_regs
> *regs)
> > > > > >          if (is_in_pci_mem_space(addr)) {
> > > > > >                  if (user_mode(regs)) {
> > > > > >                          pagefault_disable();
> > > > > > -                       ret = get_user(regs->nip, &inst);
> > > > > > +                       ret = get_user(inst, (__u32 __user
> > > > > > + *)regs->nip);
> > > > > >                          pagefault_enable();
> > > > > >                  } else {
> > > > > >                          ret = probe_kernel_address(regs->nip,
> > > > > > inst);
> > > > > >
> > > > > > However, the kernel still locked up after fixing that.
> > > > > > Now I wonder why this fixup is there in the first place? The
> > > > > > routine will not really fixup the insn, just return 0xffffffff
> > > > > > for the failing read and then advance the process NIP.
> > > >
> > > > You are right.  The code here only gives 0xffffffff to the load
> > > > instructions and
> > >
> > > continue with the next instruction when the load instruction is
> > > causing the machine check.  This will prevent a system lockup when
> > > reading from PCI/RapidIO device which is link down.
> > > >
> > > > I don't know what is actual problem in your case.  Maybe it is a
> > > > write
> > >
> > > instruction instead of read?   Or the code is in a infinite loop waiting for a
> valid
> > > read result?  Are you able to do some further debugging with the NIP
> > > correctly printed?
> > > >
> > >
> > > According to the MC it is a Read and the NIP also leads to a read in the
> program.
> > > ATM, I have disabled the fixup but I will enable that again.
> > > Question, is it safe add a small printk when this MC happens(after
> > > fixing up)? I need to see that it has happened as the error is somewhat
> random.
> >
> > I think it is safe to add printk as the current machine check handlers are also
> using printk.
> 
> I hope so, but if the fixup fires there is no printk at all so I was a bit unsure.
> Don't like this fixup though, is there not a better way than faking a read to user
> space(or kernel for that matter) ?

I don't have a better idea.  Without the fixup, the offending load instruction will never finish if there is anything wrong with the backing device and freeze the whole system.  Do you have any suggestion in mind?

Regards,
Leo