[PATCH v5 00/21] EEH reorganization

Tue Apr 17 11:57:51 EST 2012

On Tue, 2012-04-17 at 11:37 +1000, Anton Blanchard wrote:
> 
> No. I replaced that backtrace in eeh_dn_check_failure with a WARN_ON()
> because the backtrace doesn't give us enough info. I'm submitting a
> patch for that today.
> 
> Bottom line is mstmread has been causing an EEH error since at least
> 3.0, but in 3.4 we now oops instead of recovering. The signs all point
> to the EEH rework in 3.4.

More precisely, the original oops reported by Anton decodes as such:

>Oops: Kernel access of bad area, sig: 11 [#1]

This is typically a bad memory access..

>SMP NR_CPUS=1024 NUMA pSeries
>Modules linked in:
>NIP: c000000000055af8 LR: c000000000033204 CTR: 0000000000000000
>REGS: c000001f42fb7990 TRAP: 0300   Tainted: G        W     (3.4.0-rc2-00065-gf549e08-dirty)

TRAP: 300 means that it's the result of a data access interrupts, ie,
load or store to a bad address

>MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 24008084  XER: 00000000
>SOFTE: 1
>CFAR: 00000000000049b8
>DAR: 0000000000000070, DSISR: 40000000

Here the DAR tells us what address was accessed. 0x70 is a strong indication
that this was an access to a NULL pointer (at offset 0x70 from that pointer).

It -might- be something else (such as a NULL passed to a list head or such)
but the idea that there's a NULL floating around is a good hint.

>TASK = c000001f6c7dfc40[19010] 'eehd' THREAD: c000001f42fb4000 CPU: 6
>GPR00: 0000000000000001 c000001f42fb7c10 c000000000bd3a28 c000001f80ab0800 
>GPR04: c000001f7c57d418 0000000000000380 c000001f7c57e070 c000000000ed5360 
>GPR08: 0000000000000000 c000000000c77088 0000000000000000 0000000000000001 
>GPR12: 0000000044008088 c00000000eda1500 00000000019ffa78 0000000000a70000 
>GPR16: 00000000000000bb c000000000a9f754 c000000000963230 000000000000005e 
>GPR20: 0000000001b37e80 00000000000000bb 0000000000000000 c000000000b0ad90 
>GPR24: 0000000000000000 c000000000b10588 0000000000000001 c000001f80ab0800 
>GPR28: 0000000000000000 c000001f80ab0828 0000000000000000 c000001f7ee10000 
>NIP [c000000000055af8] .eeh_add_device_tree_late+0x58/0xf0

This is the function where it happened (eeh_add_device_tree_late)

>LR [c000000000033204] .pcibios_finish_adding_to_bus+0x34/0x50
>Call Trace:
>[c000001f42fb7c10] [00000000fdffffff] 0xfdffffff (unreliable)
>[c000001f42fb7ca0] [c000000000033204] .pcibios_finish_adding_to_bus+0x34/0x50
>[c000001f42fb7d20] [c000000000059a5c] .pcibios_add_pci_devices+0x7c/0x190
>[c000001f42fb7db0] [c000000000057a6c] .eeh_reset_device+0xfc/0x1a0
>[c000001f42fb7e50] [c000000000057e18] .handle_eeh_events+0x308/0x480
>[c000001f42fb7f00] [c0000000000584dc] .eeh_event_handler+0x13c/0x1d0
>[c000001f42fb7f90] [c00000000002099c] .kernel_thread+0x54/0x70

And your backtrace. You can see that you got an eeh event, which triggered an
eeh reset, which triggered a pcibios_add_pci_devices() etc...

>Instruction dump:
>480000a8 60000000 ebff0000 7fbfe800 419e0098 2fbf0000 419e005c e9229eb0 
>80090008 2f800000 419e004c ebdf01d0 <e81e0070> 7fbf0000 3160ffff
>7d2b0110 

Cheers,
Ben.