BestComm/FEC Linux system crash

Sylvain Munaut tnt at 246tnt.com
Wed Apr 16 02:20:18 EST 2008


Hi
> I hereby take the liberty to contact you regarding an issue we
> experience with the
> MPC5200 BestComm/FEC in our system. I found that you are the writer of
> the drivers
> for these, so apparently with a lot of experience with these devices.
> I hope you can find
> the time and inspiration to look into our case.
Well, feel free to CC me to bring my attention to it, but such question
should still go to the list.
It's been a while since I worked on the 5200 and some other people might
have more recent expertise than I do.

Plus, it's actually Domen Puncer who reworked a lot of the network
driver code quite recently ...

> We are running a Lunix based system based on a MPC5200
Need more precision.
- 5200 or 5200B ?
- What kernel version (version ?, where did you get it ?, external patch
applied ?)

> This process dies after several minutes due to a FEC RxFifo overflow
> interrupt. This interrupt
> now causes the FEC to be re-initialized, but for some reason the
> receiver channel still does
> not work properly, causing the RxFifo overflow to occur nearly
> immediately again, causing
> a subsequent FEC re-init again, again resulting in failing receiver
> channel, causing another
> RxFifo overflow interrupt etc etc etc......
Huh ... you transmit lots of data ... and it's the RX fifo that overlow ...

> In the FEC driver we stumbled upon the following code:
>
> static irqreturn_t fec_rx_interrupt(int irq, void *dev_id)
> {
>    struct net_device *dev = dev_id;
>    struct fec_priv *priv = (struct fec_priv *)dev->priv;
>
>    for (;;) {
>        struct sk_buff *skb;
>        struct sk_buff *rskb;
>        struct bcom_fec_bd *bd;
>        u32 status;
>
>        if (!bcom_buffer_done(priv->rx_dmatsk))
>            break;
>
> [...snipped...]
> Now what we see is that the statement in the FEC interrupt handler
>
>        if (!bcom_buffer_done(priv->rx_dmatsk))
>            break;
>
> is executed frequently.
>
> Can you explain why this statement is there? 
Well ... that test is inside an infinite loop ( for(;;) ... ), so yes,
hopefully it will be 'break' at some point ...
What we do here is that we try to process as much receive buffer as
possible ... So we loop indefinitly until no more buffers are ready ...

> During debug, after receiving the first RxFifo overflow interrupt, we
> suspended all further FEC processing and dumped
> various system status, of which the BestComm receiver descriptors.
> Here we found that always all but one were initialized
> to 0x4000005f2, but the different one to 0x08000040.
Theses are Receive Buffer descriptor. So it the BCOM_BD_READY bit is
_set_, that means, that they're _not_ done (i.e. they are ready for
bestcomm to fill).
If you check the definition of bcom_buffer_done, you'll see that we
check if the bit is _cleared_

So the situation you are describing is essentially :
 - One of the buffer is filled with some received packet (length = 0x40)
 - All the other buffers are ready for bestcomm and they can contain at
maximum 1522 bytes (0x5f2)

There is nothing 'wrong' about this situation.

> This all directs us somewhat to the believe that the following is
> occurring:
>
> For some reason the BestComm gets confused during FEC reception
> causing a descriptor not to be handled properly, which
> causes its status never to be set to 'ready' (BCOM_BD_READY
> 0x40000000ul).  Eventually, because of all receiving
> traffic to be ceased, the RxFifo will overflow causing the described
> interrupt and following re-initialization actions. But the
> BestComm FEC receiver channel fails to re-initialize (or even does not
> get re-initialized at all) and/or the BestComm FEC
> receiver descriptor table does not get re-initialized, causing the
> 0x08000040 status to remain in there. So either BestComm
> fails to work at all for the FEC Receiver channel and/or BestComm
> eventually stumbles upon the 'incorrect' descriptor causing
> the FEC receiver to stall again causing an RxFifo overflow again etc
> etc etc.
Well, given you misunderstood the meaning of BCOM_BD_READY, this theory
doesn't make much sense sorry ...

The re-initialize process should work however ... there is a bug there.

> This all seems plausible for what we experience so far, but does get
> confirmed by any data we can find in datasheets and
> hard-/software descriptions. The FEC receiver has the highest priority
> within BestComm and thus should always get serviced.
> The thing we can not find however is what system impact the PCI DMA by
> the PLX9056 is causing on the BestComm
> performance. 
The only interference I see would be contention on the XLB bus ... Maybe
you can try to play with the xlb priority and give a higher one to
bestcomm or a lower one to the PCI.
Look in the platform setup there is some code setting xlb priorities.
And refer to the 'XLB arbiter' section of the manual for the registers
to tweak.

What kind of bandwidth are you using for RX/TX on ethernet and PCI ?
Does your PCI card do _very_ long bursts without releasing the bus
(locking the xlb for a long time), or _very_ short burst causing big
overhead ?

You can also try playing the FEC RX fifo alarm levels.

> We can imagine that it disrupts 'normal' BestComm performance i.e.
> Ethernet traffic, but then again the overflow
> interrupt should take care of a proper re-initialization of all hard-
> and software, allowing the TCP/IP stack to subsequently
> handle correct transfer of missing packets.
The overflow should still not happen ... that's a pretty serious error
imho.


Sylvain



More information about the Linuxppc-dev mailing list