[Cbe-oss-dev] intermittent trouble on startup on a QS22 with 2.6.30
Marcus Daniels
mdaniels at lanl.gov
Fri Dec 11 11:01:27 EST 2009
Also, building the newer (1.1) drivers/axon/ code in the 2.6.32 tree
against 2.6.30 boot loads ok:
eth1: dma_rwctrl[769f4000] dma_mask[64-bit]
irq: irq 116 on host /axon at 10000000000/interrupt-controller mapped to
virtual irq 116
irq: irq 117 on host /axon at 10000000000/interrupt-controller mapped to
virtual irq 117
irq: irq 116 on host /axon at 30000000000/interrupt-controller mapped to
virtual irq 64
irq: irq 117 on host /axon at 30000000000/interrupt-controller mapped to
virtual irq 65
irq: irq 119 on host /axon at 10000000000/interrupt-controller mapped to
virtual irq 119
irq: irq 120 on host /axon at 10000000000/interrupt-controller mapped to
virtual irq 120
irq: irq 121 on host /axon at 10000000000/interrupt-controller mapped to
virtual irq 121
irq: irq 122 on host /axon at 10000000000/interrupt-controller mapped to
virtual irq 122
irq: irq 123 on host /axon at 10000000000/interrupt-controller mapped to
virtual irq 123
irq: irq 124 on host /axon at 10000000000/interrupt-controller mapped to
virtual irq 124
irq: irq 125 on host /axon at 10000000000/interrupt-controller mapped to
virtual irq 66
irq: irq 126 on host /axon at 10000000000/interrupt-controller mapped to
virtual irq 67
irq: irq 127 on host /axon at 10000000000/interrupt-controller mapped to
virtual irq 127
Probe of dmax0 on /axon at 10000000000/plb5/dma-controller at 4000004400001000
complete
irq: irq 119 on host /axon at 30000000000/interrupt-controller mapped to
virtual irq 68
irq: irq 120 on host /axon at 30000000000/interrupt-controller mapped to
virtual irq 70
irq: irq 121 on host /axon at 30000000000/interrupt-controller mapped to
virtual irq 71
irq: irq 122 on host /axon at 30000000000/interrupt-controller mapped to
virtual irq 72
irq: irq 123 on host /axon at 30000000000/interrupt-controller mapped to
virtual irq 73
irq: irq 124 on host /axon at 30000000000/interrupt-controller mapped to
virtual irq 74
irq: irq 125 on host /axon at 30000000000/interrupt-controller mapped to
virtual irq 75
irq: irq 126 on host /axon at 30000000000/interrupt-controller mapped to
virtual irq 76
irq: irq 127 on host /axon at 30000000000/interrupt-controller mapped to
virtual irq 77
Probe of dmax1 on /axon at 30000000000/plb5/dma-controller at 4000004400001000
complete
axon0: sdr_base 0xe, len = 0x2, mapped to d000080081800000
axon0:Set to strong ordering. Changed SDR_C3PO from 0x30000000 to
0x00000000.
Probe of axon0 on /axon at 10000000000/plb5/pciep at a00000a200000000 complete
axon1: sdr_base 0xe, len = 0x2, mapped to d000080081900000
axon1:Set to strong ordering. Changed SDR_C3PO from 0x30000000 to
0x00000000.
Probe of axon1 on /axon at 30000000000/plb5/pciep at a00000a200000000 complete
axon driver Version 1.1.00 (257,1) loaded.
IBM AXON PCIe Network Driver - apnet version 1.01, compatability version 2
Instantiating apnet0
apnet0: MAC address D2:F3:4E:0A:0A:51
apnet0 (): not using net_device_ops yet
apnet0: 2048 bytes mapped at 0xc000000002404100 for RX descriptors
Initialized apnet0 interface
Instantiating apnet1
apnet1: MAC address EE:44:40:4B:27:90
apnet1 (): not using net_device_ops yet
apnet1: 2048 bytes mapped at 0xc0000001fb404100 for RX descriptors
Initialized apnet1 interface
.. but I'm having some reliability problems. First I see this at the
end of the boot:
apnet0: TX descriptors mapped at 0xd000080082004100
apnet0: TX stopped, remote ring not ready!
apnet1: TX descriptors mapped at 0xd000080082804100
apnet1: TX stopped, remote ring not ready!
NETDEV WATCHDOG: apnet0 (): transmit timed out
------------[ cut here ]------------
Badness at net/sched/sch_generic.c:226
NIP: c000000000482b50 LR: c000000000482b4c CTR: 0000000000000001
REGS: c00000000ffe7ad0 TRAP: 0700 Not tainted (2.6.30)
MSR: 9000000000029032 <EE,ME,CE,IR,DR> CR: 24000024 XER: 20000000
TASK = c0000000fe6b27c0[0] 'swapper' THREAD: c0000000fe6d0000 CPU: 3
GPR00: c000000000482b4c c00000000ffe7d50 c00000000093ab00 0000000000000032
GPR04: 0000000000000000 ffffffffffffffff 0000000000000004 c00000000080bf5c
GPR08: 000000000001ffff 0000000000000000 c000000000a0d63c 0000000000000001
GPR12: 0000000048000042 c0000000009e2a00 ffffffffffffffff ffffffffffffffff
GPR16: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
GPR20: 01020304cabebabe c0000000009f3c40 0000000000000001 c0000000009e2400
GPR24: c0000000009e2a00 0000000000000003 c0000000fccd8000 0000000000000003
GPR28: 0000000000000001 c0000001fe8cdf00 c0000000008cafa8 c0000000fccd8000
NIP [c000000000482b50] .dev_watchdog+0x1b0/0x2e4
LR [c000000000482b4c] .dev_watchdog+0x1ac/0x2e4
Call Trace:
[c00000000ffe7d50] [c000000000482b4c] .dev_watchdog+0x1ac/0x2e4 (unreliable)
[c00000000ffe7e30] [c0000000000a2d98] .run_timer_softirq+0x1a8/0x268
[c00000000ffe7ee0] [c00000000009c618] .__do_softirq+0x104/0x228
[c00000000ffe7f90] [c00000000002a094] .call_do_softirq+0x14/0x24
[c0000000fe6d3800] [c00000000000d5f8] .do_softirq+0x88/0xf0
[c0000000fe6d38a0] [c00000000009c828] .irq_exit+0x54/0xa8
[c0000000fe6d3920] [c0000000000273f0] .timer_interrupt+0x1b0/0x1e0
[c0000000fe6d39b0] [c000000000061d80] .cbe_system_reset_exception+0x74/0xb0
[c0000000fe6d3a30] [c0000000000284cc] .system_reset_exception+0x44/0xd8
[c0000000fe6d3ab0] [c000000000003414] system_reset_common+0x114/0x180
--- Exception: 100 at .cbe_power_save+0x98/0xb4
LR = .cpu_idle+0x10c/0x1d0
[c0000000fe6d3da0] [c0000000fe6d3e30] 0xc0000000fe6d3e30 (unreliable)
[c0000000fe6d3e30] [c00000000001384c] .cpu_idle+0x10c/0x1d0
[c0000000fe6d3ec0] [c00000000051601c] .start_secondary+0x38c/0x3d0
[c0000000fe6d3f90] [c0000000000082e0] .start_secondary_prolog+0x10/0x14
Instruction dump:
2f800000 40be003c 38810070 7fe3fb78 38a00040 4bfe292d 60000000 7fe4fb78
7c651b78 e87e8050 4bc13d89 60000000 <0fe00000> 38000001 e93e8048 90090000
and then, if I ping back to the Opteron host node, the system freezes,
after two of:
[mdaniels at rtd006c ~]$ ping 192.168.3.1
PING 192.168.3.1 (192.168.3.1) 56(84) bytes of data.
64 bytes from 192.168.3.1: icmp_seq=1 ttl=64 time=60.7 ms
64 bytes from 192.168.3.1: icmp_seq=2 ttl=64 time=7.83 ms
[freeze]
> disabling the PCI endpoint on 2.6.30 did work. The Axon devices
> register and apnet comes up too. However, this was Axon 1.0.18 and
> apnet 1.01.
More information about the cbe-oss-dev
mailing list