Bestcomm trouble with NAPI for MPC5200 FEC

Fri Jul 10 17:37:25 EST 2009

Grant Likely wrote:
> On Thu, Jul 9, 2009 at 2:33 PM, Wolfgang Grandegger<wg at grandegger.com> wrote:
>> Hello,
>>
>> I'm currently trying to implement NAPI for the FEC on the MPC5200 to
>> solve the well known problem, that network packet storms can cause
>> interrupt flooding, which may totally block the system.
> 
> Good to hear it!  Thanks for this work.
> 
>> The NAPI
>> implementation, in principle, is straight forward and works
>> well under normal and moderate network load. It just calls disable_irq()
>> in the receive interrupt handler to defer packet processing to the NAPI
>> poll callback, which calls enable_irq() when it has processed all
>> packets. Unfortunately, under heavy network load (packet storm),
>> problems show up:
>>
>> - With DENX 2.4.25, the Bestcomm RX task gets and remains stopped after
>>  a while under additional system load. I have no idea how and when
>>  Bestcom tasks are stopped. In the auto-start mode, the firmware should
>>  poll forever for the next free descriptor block.

Do you know when the Bestcomm firmware does stop the task? I have the
impression that it happens when all buffer descriptors are used (RX
queue full).

>> - With 2.6.31-rc2, the RFIFO error occurs quickly which does reset the
>>  FEC and Bestcomm (unfortunately, this does trigger an oops because
>>  it's called from the interrupt context, but that's another issue).
>>
>> I'm realized that working with Bestcomm is a pain :-( but so far I have
>> little knowledge of the Bestcomm limitations and quirks. Any idea what
>> might go wrong or how to implement NAPI for that FEC properly.
> 
> Yes, I have a few ideas.  First, I suspect that the FEC rx queue isn't
> big enough and I wouldn't be surprised if the RFIFO error is occurring
> because Bestcomm gets overrun.  This scenario needs to be handled more
> gracefully.

The RFIFO error does not show up with DENX 2.4.25 and therefore I'm not
sure if overruns are a real problem.

> Second, I think resetting the PHY should be removed from the reset
> path.  The phy doesn't at all need to be reset and doing this would
> avoid the OOPS condition.  Also, in the RFIFO error path needs to be
> audited to make sure that all the good received packets are processed
> correctly before resetting the BCOM engine and to make sure that
> skbufs are not getting leaked.

Agreed, the manual says: "When this occurs, software must ensure both
the FIFO Controller and BestComm are soft-reset."

> Essentially, I think that the RFIFO error condition is currently
> handled in far too heavy handed a manner and it should not be
> expensive to recover from.

Yep, it looks like.

Wolfgang.