MPC5200B FEC TX packets getting stuck

Thu Feb 2 13:33:13 EST 2012

First I think the spin_locks in the irq handlers should be
spin_lock_irqsave(), because the same lock is used in multiple irq
handlers.  If we get an rx interrupt while the tx interrupt holds the
spin lock, this would seem to be a problem.  In this case maybe not
because it is a single processor system and spin_locks should compile
to nothing(I haven't verified this), and the rx and tx handlers don't
really touch any common data elements.  I haven't tested changing
this, because I've currently running a long test.

On another front, I put some time stamp tracing into the
mpc52xx_fec_start_xmit, and verified that the delay is happening after
the packet is added the the BestComm ring buffer.  There will be 3
quick calls to the xmit, but I'll only see 2 packets at the PC, until
200 - 400 ms later, when I'll get another xmit call (for the
retransmit), and then get two duplicate packets at pc.

Attempting to add time stamping to the TX irq handler have revealed
this to be a Heisenbug of sorts. After the following changes, I
haven't seen any delays two hours of running.  Previously every minute
of so.

I'll let it run over night and see if I see an additional delays.
Next I'll remove the timestamp code, and attempt to capture the state
of the ring buffer and BestComm at the point the retransmit packet is
handed off to the driver.  The delayed packet has to be somewhere at
that point.  I could be in the FEC Queue, as I don't think I've seen a
delayed packet larger than 1k.

@@ -382,6 +414,8 @@
      dev_kfree_skb_irq(skb);
   }
   spin_unlock(&priv->lock);
+   js_irq_timestamps[js_irq_idx] = get_tbl();
+   js_irq_idx = (js_irq_idx+1 == TS_COUNT)? 0 : js_irq_idx+1;

    netif_wake_queue(dev);

@@ -409,6 +443,7 @@



Joey Nelson



On Fri, Jan 27, 2012 at 12:14 PM, Joey Nelson <joey at joescan.com> wrote:
>
>
> In my application, I have a PC connected through TCP to a MPC5200B based system.  The PC sends a short request, the MPC5200B receives the request and sends the data back.  It is doing this about 300 times per second.  Normally exchange happens in just handful of milliseconds.  But randomly every 2 to 15 minutes the MPC5200B sends all but the last packet of the response, and about 200ms later the PC sends a delayed ACK, and the MPC5200B TCP stack figures the packet was lost.  It then sends two nearly identical packets (The IP header Identification and Checksum fields are incremented).  I can also see that RetransSegs in /proc/net/snmp increments by one for each of these delays.
>
> My theory is that the packet is getting suck somewhere in the network stack (most likely toward the bottom).  Then when another packet is sent, the suck one gets pushed out.
>
> I've done a test where I have another task on the MPC5200B sending UDP packets to a different PC every 10ms.  This eliminated the long delays, and seems to support my stuck packet theory.
>
> I'm seeing the same issue with 2.6.23 and 3.1.6.
>
> I'm getting ready to dive into the hairy world of Bestcomm and FEC, but I figured I'd see if anyone else has any suggestions before I make my decent.  Has anyone seen this behavior before?  Any likely candidates for where the packet is getting stuck?  General advice for reference materials (I've started on Linux Device Drivers 3rd Ed, BestComm AN2604, and the Datasheets)
>
> Thanks in advance.
>
> Joey Nelson
> joey at joescan.com
>