Inbound PCI and Memory Corruption
Peter LaDow
pladow at gmail.com
Wed Jun 26 04:44:14 EST 2013
On Sat, Jun 22, 2013 at 5:00 PM, Benjamin Herrenschmidt
<benh at kernel.crashing.org> wrote:
> Afaik e300 is slightly out of order, maybe it's missing a memory barrier
> somewhere.... One thing to try is to add some to the dma_map/unmap ops.
I went through the driver and added memory barriers to the
dma_map_page/dma_unmap_page and dma_alloc_coherent/dma_free_coherent
calls (wmb() calls after each, which resolves to a sync instruction).
I still get a kernel panic.
I did turn on DEBUG_PAGE_ALLOC to try and get more information, but
I'm not finding anything new. However, with the SLAB debugging I do
find SLAB corruption, e.g.:
Slab corruption: fib6_nodes start=e900c7f8, len=32
Redzone: 0x9f911029d74e35b/0x30a706a6050806.
Last user: [<06040001>](0x6040001)
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ff ff ff ff ff ff
Prev obj: start=e900c7c0, len=32
Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
Last user: [< (null)>](0x0)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5
Next obj: start=e900c830, len=32
Redzone: 0x30a706a6050aca/0xc8be11029d74e35b.
Last user: [< (null)>](0x0)
000: 0d aa 00 00 00 00 00 00 0a ca 0d 49 00 00 00 00
010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 75 8b
Which is clearly corrupted with ethernet frames. The only interface
connected is the e1000. Eventually this corruption leads to a kernel
panic.
I'm completely confused on how this could happen. Given the M bit is
set for all pages (see below), and with memory barriers on the DMA
map/unmap and register operations, the only thing I can think of is
something in the IO sequencer (which was suggested in the link I gave
earlier). Yet the patch mentioned is in place.
> Also audit the driver to ensure that it properly uses barriers when
> populating descriptors (and maybe compare to a more recent version of
> the driver upstream).
I've gone through the driver and didn't see anything missing. And the
upstream (v3.10-rc5) driver is the same version (7.3.21-k8-NAPI). And
I've used the latest from the e1000 release (8.0.35-NAPI), and I get
the same problem.
On Sun, Jun 23, 2013 at 6:16 PM, Benjamin Herrenschmidt
<benh at kernel.crashing.org> wrote:
> Also dbl check that the MMU is indeed mapping all these pages with the
> "M" bit.
The DBAT's have the M bit set (both have 0x12 in the DBATxL
registers)...sometimes. Usually when I halt the CPU and dumps the
BAT's, all the IBAT's and DBAT's have zeros. But occasionally I see
DBAT2 and DBAT3 with values and the M bit set.
I also dumped all the TLB entries, and every one of them has the M bit
set (see below).
TLB dump:
BDI>dtlb 0 63
IDX V RC VSID VPI RPN WIMG PP
0: V 0C 000eee_e9a0000 -> 2e9a0000 --M- 00
1: V 0C 000eee_f401000 -> 2f401000 --M- 00
2: V 1C 000ccc_0502000 -> 00502000 --M- 00
3: V 0C 000eee_f403000 -> 2f403000 --M- 00
4: V 0C 000eee_c124000 -> 2c124000 --M- 00
5: V 0C 000eee_f405000 -> 2f405000 --M- 00
6: V 0C 000eee_e9e6000 -> 2e9e6000 --M- 00
7: V 0C 33afd1_0427000 -> 005f8000 --M- 10
8: V 0C 33afd1_0428000 -> 2ff63000 --M- 10
9: V 0C 000ccc_0349000 -> 00349000 --M- 00
10: V 1C 000ccc_03ca000 -> 003ca000 --M- 00
11: V 1C 000ccc_03cb000 -> 003cb000 --M- 00
12: V 0C 33afd1_040c000 -> 003b4000 --M- 11
13: V 0C 000eee_f40d000 -> 2f40d000 --M- 00
14: V 1C 000eee_fa8e000 -> 2fa8e000 --M- 00
15: V 0- 33afd1_034f000 -> 2e6b1000 --M- 11
16: V 0C 000eee_f470000 -> 2f470000 --M- 00
17: V 0C 33afd1_0411000 -> 2fe54000 --M- 10
18: V 0C 000eee_f4b2000 -> 2f4b2000 --M- 00
19: V 1C 33eb14_8073000 -> 00462000 --M- 10
20: V 0C 000ccc_02f4000 -> 002f4000 --M- 00
21: V 0C 000eee_f415000 -> 2f415000 --M- 00
22: V 1C 000ccc_03f6000 -> 003f6000 --M- 00
23: V 0C 000ccc_02f7000 -> 002f7000 --M- 00
24: V 1C 000ccc_03f8000 -> 003f8000 --M- 00
25: V 0C 000ccc_03d9000 -> 003d9000 --M- 00
26: V 1C 33b304_a31a000 -> 007f4000 --M- 10
27: V 1C 000ccc_03fb000 -> 003fb000 --M- 00
28: V 1C 000ccc_03fc000 -> 003fc000 --M- 00
29: V 0C 000eee_f41d000 -> 2f41d000 --M- 00
30: V 1C 000eee_e87e000 -> 2e87e000 --M- 00
31: V 1C 33afd1_045f000 -> 2fe52000 --M- 10
32: V 0C 000ccc_0000000 -> 00000000 --M- 00
33: V 0C 000eee_e9a1000 -> 2e9a1000 --M- 00
34: V 1C 33b304_8022000 -> 00f44000 --M- 10
35: V 0C 000ccc_0503000 -> 00503000 --M- 00
36: V 0C 33afd1_0744000 -> 2fe17000 --M- 10
37: V 0C 000eee_c125000 -> 2c125000 --M- 00
38: V 0C 33e7e1_0406000 -> 0078e000 --M- 11
39: V 0C 000eee_e987000 -> 2e987000 --M- 00
40: V 0C 000ccc_0008000 -> 00008000 --M- 00
41: V 0C 000ccc_03c9000 -> 003c9000 --M- 00
42: V 1C 33ba7b_f8ea000 -> 005f9000 --M- 10
43: V 1C 33afd1_040b000 -> 2ffe0000 --M- 11
44: V 0C 000ccc_03cc000 -> 003cc000 --M- 00
45: V 0C 000eee_b68d000 -> 2b68d000 --M- 00
46: V 1C 000eee_f40e000 -> 2f40e000 --M- 00
47: V 0C 000eee_fa8f000 -> 2fa8f000 --M- 00
48: V 0C 33afd1_0410000 -> 2fe4a000 --M- 10
49: V 0C 000eee_f471000 -> 2f471000 --M- 00
50: V 0C 000ccc_03f2000 -> 003f2000 --M- 00
51: V 1C 000eee_f473000 -> 2f473000 --M- 00
52: V 0C 000ccc_03f4000 -> 003f4000 --M- 00
53: V 0C 000ccc_03f5000 -> 003f5000 --M- 00
54: V 1C 000eee_f456000 -> 2f456000 --M- 00
55: V 0C 000eee_d2f7000 -> 2d2f7000 --M- 00
56: V 1C 000ccc_03d8000 -> 003d8000 --M- 00
57: V 0C 000eee_e879000 -> 2e879000 --M- 00
58: V 1C 000eee_f41a000 -> 2f41a000 --M- 00
59: V 1C 000ccc_03db000 -> 003db000 --M- 00
60: V 1C 000eee_f43c000 -> 2f43c000 --M- 00
61: V 0C 000ccc_04fd000 -> 004fd000 --M- 00
62: V 1C 000eee_f43e000 -> 2f43e000 --M- 00
63: V 1C 000eee_e93f000 -> 2e93f000 --M- 00
More information about the Linuxppc-dev
mailing list