Network TX Stall on 440EP Processor

Wed Jun 21 07:17:10 AEST 2017

I'm working on a project that is derived from the Yosemite
PPC 440EP board.  It's a legacy project that was running the
2.6.24 Kernel, and network traffic was stalling due to transmission
halting without an understandable error (in this error condition, the
various
status registers of network interface showed no issues), other
than TX stalling due to Buffer Descriptor Ring becoming full.

In order to see if the problem has been resolved, the Kernel
has been updated to 4.9.13, compiled with gcc version 5.4.0
(Buildroot 2017.02.2).  Although the frequency of the
problem is decreased, it still does show up.

The test case is the Linux Target running idle, no application
code.  From a Linux host on a directly connected network, 30
flood pings are started.  After a period of several minutes to
perhaps hours, the transmit aspect of the network controller
ceases to transmit packets (Buffer Descriptor ring becomes full).
RX still works.  In the 2.6.24 Kernel, the problem happens
within seconds, so it has improved with the new Kernel.

Below is the output from the Kernel when this happens.

Has anybody seen this problem before?  I can't find any
errata on it, nor can I find any reports of it.

The orginal problem is rooted in the Embedded Application
running, and after a period of time of heavy network
traffic, the TX side of network stalls.  The flood ping
test is used simply to force the problem to happen.

[ 3127.143572] NETDEV WATCHDOG: eth0 (emac): transmit queue 0 timed out
[ 3127.150172] ------------[ cut here ]------------
[ 3127.154778] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316
dev_watchdog+0x23c/0x244
[ 3127.162965] Modules linked in:
[ 3127.166013] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.13 #9
[ 3127.171707] task: c0e67300 task.stack: c0f00000
[ 3127.176192] NIP: c068e734 LR: c068e734 CTR: c04672f4
[ 3127.181107] REGS: c0f01c90 TRAP: 0700   Not tainted  (4.9.13)
[ 3127.186793] MSR: 00029000 <CE,EE,ME>[ 3127.190241]   CR: 28122222  XER:
00000000
[ 3127.194210]
GPR00: c068e734 c0f01d40 c0e67300 00000038 d1006301 000000df c04683e4
000000df
GPR08: 000000df c0eff4b0 c0eff4b0 00000004 24122424 00b960f0 00000000
c0e80000
GPR16: 000ac8c1 c07b8618 c098bddc c0e69000 0000000a c0ee0000 c0e73f20
c0f00000
GPR24: c100e4e8 c0ee0000 c0e77d60 c3128000 c068e4f8 c0e80000 00000000
c3128000
NIP [c068e734] dev_watchdog+0x23c/0x244
[ 3127.227680] LR [c068e734] dev_watchdog+0x23c/0x244
[ 3127.232427] Call Trace:
[ 3127.234857] [c0f01d40] [c068e734] dev_watchdog+0x23c/0x244 (unreliable)
[ 3127.241447] [c0f01d60] [c00805e8] call_timer_fn+0x40/0x118
[ 3127.246889] [c0f01d80] [c00808e8] expire_timers.isra.13+0xbc/0x114
[ 3127.253032] [c0f01db0] [c0080a94] run_timer_softirq+0x90/0xf0
[ 3127.258753] [c0f01e00] [c07b31b4] __do_softirq+0x114/0x2b0
[ 3127.264202] [c0f01e60] [c002a158] irq_exit+0xe8/0xec
[ 3127.269144] [c0f01e70] [c0008c98] timer_interrupt+0x34/0x4c
[ 3127.274684] [c0f01e80] [c000ec94] ret_from_except+0x0/0x18
[ 3127.280151] --- interrupt: 901 at cpm_idle+0x3c/0x70
[ 3127.280151]     LR = arch_cpu_idle+0x30/0x68
[ 3127.289300] [c0f01f40] [c0f058e4] cpu_idle_force_poll+0x0/0x4
(unreliable)
[ 3127.296146] [c0f01f50] [c00073e4] arch_cpu_idle+0x30/0x68
[ 3127.301509] [c0f01f60] [c005bce8] cpu_startup_entry+0x184/0x1bc
[ 3127.307392] [c0f01fb0] [c0a76a1c] start_kernel+0x3d4/0x3e8
[ 3127.312843] [c0f01ff0] [c00000b4] _start+0xb4/0xf8
[ 3127.317599] Instruction dump:
[ 3127.320557] 811f0284 4bffff78 39200001 7fe3fb78 99281966 4bfd9cd5
7c651b78 3c60c0a1
[ 3127.328359] 7fc6f378 7fe4fb78 3863357c 48125319 <0fe00000> 4bffffb8
7c0802a6 90010004
[ 3127.336327] ---[ end trace c31dfe4772ff0e8f ]---
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20170620/c6eedc5b/attachment.html>