Machine Check in P2010(e500v2)

Thu Sep 7 05:31:16 AEST 2017

> -----Original Message-----
> From: York Sun
> Sent: Wednesday, September 06, 2017 10:38 AM
> To: Joakim Tjernlund <Joakim.Tjernlund at infinera.com>; linuxppc-
> dev at lists.ozlabs.org; Leo Li <leoyang.li at nxp.com>
> Subject: Re: Machine Check in P2010(e500v2)
> 
> Scott is no longer with Freescale/NXP. Adding Leo.
> 
> On 09/05/2017 01:40 AM, Joakim Tjernlund wrote:
> > So after some debugging I found this bug:
> > @@ -996,7 +998,7 @@ int fsl_pci_mcheck_exception(struct pt_regs *regs)
> >          if (is_in_pci_mem_space(addr)) {
> >                  if (user_mode(regs)) {
> >                          pagefault_disable();
> > -                       ret = get_user(regs->nip, &inst);
> > +                       ret = get_user(inst, (__u32 __user
> > + *)regs->nip);
> >                          pagefault_enable();
> >                  } else {
> >                          ret = probe_kernel_address(regs->nip, inst);
> >
> > However, the kernel still locked up after fixing that.
> > Now I wonder why this fixup is there in the first place? The routine
> > will not really fixup the insn, just return 0xffffffff for the failing
> > read and then advance the process NIP.

You are right.  The code here only gives 0xffffffff to the load instructions and continue with the next instruction when the load instruction is causing the machine check.  This will prevent a system lockup when reading from PCI/RapidIO device which is link down.

I don't know what is actual problem in your case.  Maybe it is a write instruction instead of read?   Or the code is in a infinite loop waiting for a valid read result?  Are you able to do some further debugging with the NIP correctly printed?

Regards,
Leo

> >
> > Removing the fixup does not help either, kernel still locks up:
> > [   28.170532] Machine check in kernel mode.
> > [   28.174538] Caused by (from MCSR=10008):
> > [   28.182804] Bus - Read Data Bus Error: DAR:b7013000
> > [   28.197079] Oops: Machine check, sig: 7 [#1]
> > [   28.201343] P1010 RDB
> > [   28.203608] Modules linked in: linux_bcm_knet(PO) linux_user_bde(PO)
> linux_kernel_bde(PO)
> > [   28.211796] CPU: 0 PID: 470 Comm: emxp2_hw_bl Tainted: P           O
> 4.1.38+ #201
> > [   28.219540] task: db16ed10 ti: df122000 task.ti: df122000
> > [   28.224935] NIP: 10a4e2f4 LR: 10a4e404 CTR: 10046c38
> > [   28.229896] REGS: df123f10 TRAP: 0204   Tainted: P           O     (4.1.38+)
> > [   28.236942] MSR: 0002d000 <CE,EE,PR,ME>  CR: 44002428  XER: 00000000
> > [   28.243306] DEAR: b7013000 ESR: 00000000
> > GPR00: 10a4e404 bfab2730 b7b354a0 132f9fa8 07006000 07000000
> 00000000
> > 132f9fd8
> > GPR08: b6fd5000 b6fe5000 0003e000 bfab2720 24004424 11d6cf7c 00000000
> > 00000000
> > GPR16: 10f6e29c 10f6c872 10f6db01 0000b541 0000b541 11d92fcc 00000011
> > 00000001
> > GPR24: 01a5bd3e 132ffbf0 11d60000 00000000 07006000 00000000 132f9fa8
> 00000000
> > [   28.275547] NIP [10a4e2f4] 0x10a4e2f4
> > [   28.279204] LR [10a4e404] 0x10a4e404
> > [   28.282772] Call Trace:
> > [   28.285213] ---[ end trace 9f8b64ab1e83f449 ]---
> > [   28.289825]
> >
> >
> >   Jocke
> >
> > On Fri, 2017-09-01 at 13:32 +0200, Joakim Tjernlund wrote:
> >> I am trying to debug a Machine Check for a P2010 (e500v2) CPU:
> >>
> >> [   28.111816] Caused by (from MCSR=10008): Bus - Read Data Bus Error
> >> [   28.117998] Oops: Machine check, sig: 7 [#1]
> >> [   28.122263] P1010 RDB
> >> [   28.124529] Modules linked in: linux_bcm_knet(PO) linux_user_bde(PO)
> linux_kernel_bde(PO)
> >> [   28.132718] CPU: 0 PID: 470 Comm: emxp2_hw_bl Tainted: P           O
> 4.1.38+ #49
> >> [   28.140376] task: db16cd10 ti: df128000 task.ti: df128000
> >> [   28.145770] NIP: 00000000 LR: 10a4e404 CTR: 10046c38
> >> [   28.150730] REGS: df129f10 TRAP: 0204   Tainted: P           O     (4.1.38+)
> >> [   28.157776] MSR: 0002d000 <CE,EE,PR,ME>  CR: 44002428  XER: 00000000
> >> [   28.164140] DEAR: b7187000 ESR: 00000000
> >> GPR00: 10a4e404 bf86ea30 b7ca94a0 132f9fa8 07006000 07000000
> 00000000
> >> 132f9fd8
> >> GPR08: b7149000 b7159000 0003e000 bf86ea20 24004424 11d6cf7c
> 00000000
> >> 00000000
> >> GPR16: 10f6e29c 10f6c872 10f6db01 0000b541 0000b541 11d92fcc
> 00000011
> >> 00000001
> >> GPR24: 01a4d12d 132ffbf0 11d60000 00000000 07006000 00000000
> 132f9fa8 00000000
> >> [   28.196375] NIP [00000000]   (null)
> >> [   28.199859] LR [10a4e404] 0x10a4e404
> >> [   28.203426] Call Trace:
> >> [   28.205866] ---[ end trace f456255ddf9bee83 ]---
> >>
> >> I cannot figure out why NIP is NULL ? It LOOKs like NIP is set to
> >> MCSRR0 early on but maybe it is lost somehow?
> >>
> >> Anyhow, looking at entry_32.S:
> >> 	.globl	mcheck_transfer_to_handler
> >> mcheck_transfer_to_handler:
> >> 	mfspr	r0,SPRN_DSRR0
> >> 	stw	r0,_DSRR0(r11)
> >> 	mfspr	r0,SPRN_DSRR1
> >> 	stw	r0,_DSRR1(r11)
> >> 	/* fall through */
> >>
> >> 	.globl	debug_transfer_to_handler
> >> debug_transfer_to_handler:
> >> 	mfspr	r0,SPRN_CSRR0
> >> 	stw	r0,_CSRR0(r11)
> >> 	mfspr	r0,SPRN_CSRR1
> >> 	stw	r0,_CSRR1(r11)
> >> 	/* fall through */
> >>
> >> 	.globl	crit_transfer_to_handler
> >> crit_transfer_to_handler:
> >>
> >> It looks odd that DSRRx is assigned in mcheck and CSRRx in debug and
> >> crit has none. Should not this assigment be shifted down one level?
> >>
> >>    Jocke
> >