Machine Check in P2010(e500v2)
Joakim Tjernlund
Joakim.Tjernlund at infinera.com
Thu Sep 7 06:53:43 AEST 2017
On Wed, 2017-09-06 at 20:28 +0000, Leo Li wrote:
> > -----Original Message-----
> > From: Joakim Tjernlund [mailto:Joakim.Tjernlund at infinera.com]
> > Sent: Wednesday, September 06, 2017 3:17 PM
> > To: linuxppc-dev at lists.ozlabs.org; Leo Li <leoyang.li at nxp.com>; York Sun
> > <york.sun at nxp.com>
> > Subject: Re: Machine Check in P2010(e500v2)
> >
> > On Wed, 2017-09-06 at 19:31 +0000, Leo Li wrote:
> > > > -----Original Message-----
> > > > From: York Sun
> > > > Sent: Wednesday, September 06, 2017 10:38 AM
> > > > To: Joakim Tjernlund <Joakim.Tjernlund at infinera.com>; linuxppc-
> > > > dev at lists.ozlabs.org; Leo Li <leoyang.li at nxp.com>
> > > > Subject: Re: Machine Check in P2010(e500v2)
> > > >
> > > > Scott is no longer with Freescale/NXP. Adding Leo.
> > > >
> > > > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote:
> > > > > So after some debugging I found this bug:
> > > > > @@ -996,7 +998,7 @@ int fsl_pci_mcheck_exception(struct pt_regs *regs)
> > > > > if (is_in_pci_mem_space(addr)) {
> > > > > if (user_mode(regs)) {
> > > > > pagefault_disable();
> > > > > - ret = get_user(regs->nip, &inst);
> > > > > + ret = get_user(inst, (__u32 __user
> > > > > + *)regs->nip);
> > > > > pagefault_enable();
> > > > > } else {
> > > > > ret = probe_kernel_address(regs->nip,
> > > > > inst);
> > > > >
> > > > > However, the kernel still locked up after fixing that.
> > > > > Now I wonder why this fixup is there in the first place? The
> > > > > routine will not really fixup the insn, just return 0xffffffff for
> > > > > the failing read and then advance the process NIP.
> > >
> > > You are right. The code here only gives 0xffffffff to the load instructions and
> >
> > continue with the next instruction when the load instruction is causing the
> > machine check. This will prevent a system lockup when reading from
> > PCI/RapidIO device which is link down.
> > >
> > > I don't know what is actual problem in your case. Maybe it is a write
> >
> > instruction instead of read? Or the code is in a infinite loop waiting for a valid
> > read result? Are you able to do some further debugging with the NIP correctly
> > printed?
> > >
> >
> > According to the MC it is a Read and the NIP also leads to a read in the program.
> > ATM, I have disabled the fixup but I will enable that again.
> > Question, is it safe add a small printk when this MC happens(after fixing up)? I
> > need to see that it has happened as the error is somewhat random.
>
> I think it is safe to add printk as the current machine check handlers are also using printk.
I hope so, but if the fixup fires there is no printk at all so I was a bit unsure.
Don't like this fixup though, is there not a better way than faking a read
to user space(or kernel for that matter) ?
Jocke
More information about the Linuxppc-dev
mailing list