[Cbe-oss-dev] [patch 07/16] Ethernet over PCI-E driver

Tue Jun 5 22:55:28 EST 2007

On Tuesday 05 June 2007, Jean-Christophe Dubois wrote:
> On Monday 04 June 2007 18:38:57 Arnd Bergmann wrote:
> > One of the main advantages of NAPI is the end-to-end flow control.
> > If no application actually processes any incoming data packets because
> > the CPU is overloaded, the network driver can simply disable interrupts
> > and stop receiving frames from the queue in order to avoid dropping
> > frames or spending additional cycles on that.
> 
> The end to end flow control will be harder to achieve when/if we are part of a 
> PCI-E switched fabric. If you consider such fabric, a CAB "Ethernet device" 
> can be targeted by several (from one to several dozens) remote devices. In 
> this case we are no more in a point to point architecture. So even if this is 
> mostly the case today in an host/CAB architecture, I'd like to keep things 
> open for later.

I'm not sure I understand what would change in a point-to-multipoint setup,
you still have the problem that you normally want to stop receiving new
frames while the kernel has not completely processed the previous ones.

> It seems to me that flow contolling the sending side is good  
> enough (which is done in the actual driver).

It's the important part, yes. If the sender can always tell whether
the receive buffer if full, you're mostly there.

> > Right, but you know what devices are potentially there and can register
> > a network device for each of them. The actual acquisition of shared
> > resources can still happen at open(), i.e. ifup, time.
> 
> It seems to me you are proposing to move the problem to the fake Ethernet 
> device creation but the exact same issue will have to be solved there (you 
> can't attach exclusively to any hardware resource).

> > That sounds like we don't treat the dma-ranges property of the device
> > node incorrectly, or that the property itself is broken.
> >
> > Can you show a hexdump of all dma-ranges properties below the axon node?
> 
> There is no "dma-ranges" property at all in our device tree (as provided by 
> IBM SLOF firmware).

Ok, that sounds like another firmware bug that should be fixed.

> > > I can't be sure if this is before or after, so I have to start again the
> > > all thing and reschedule the tasklet that will disable the interrupts an
> > > so on. I don't really care if, in the end, the tasklet will be reschedule
> > > from here or from the interrupt handler.
> >
> > Yes, but returning '1' has the same effect as scheduling the tasklet, just
> > without the overhead.
> 
> Except the hardware interrupt would not be re-disabled. I guess I could do it 
> here and then return 1. That would avoid the reschedule().

yes, that sounds good. In principle, the poll() function should return 0
if interrupts are on, and 1 otherwise (to be called again next time).
AFAICS, it's always enough to go through a logic like this:

	if (more_work_to_to(dev))
		return 1;
	enable_interrupt(dev, RECEIVE_IRQ);
	return 0;

> > So are you saying that you send an interrupt after every single frame that
> > is transferred? That sounds like you are creating much more interrupt
> > traffic than necessary.
> 
> After a frame is consumed the receiving side is sending a MBX (interrupt + 
> some info) to the sending side telling it exactly which SKB is done and can 
> be released.

Right, this sounds like it would significantly increase your interrupt load.
For every frame, you theoretically send one interrupt to each side, one
when the frame has been sent, the other when it was received. With your
existing NAPI poll() function, you can get rid of most of the first kind,
but not the second one.

I would suggest that you change the logic to go through the same poll()
function for both ways and only enable interrupts if you need to wait for
one of the events, like

void irq_handler(struct my_dev *dev)
{
	enable_interrupts(dev, 0); /* disabled both */
	netif_rx_schedule(&dev->netdev);
}

int poll(struct net_device *netdev, int *budget)
{
	int has_cleaned_tx, more_rx_pending, irq_mask = 0;

	has_cleaned_tx = try_to_clean_up(netdev);
	/* tell remote side to interrupt us if more descriptors
           need to be cleaned */
	if (has_cleaned_tx)
		irq_mask |= ENABLE_TX_IRQ;

	more_rx_pending = try_to_receive(netdev, budget, min(*budget, netdev->quota));
	/* we see no more incoming data, so need an interrupt when new data becomes
           available */
	if (!more_rx_pending) {
		/* no more incoming data */
		netif_rx_complete(netdev);
		irq_mask |= ENABLE_RX_IRQ;
	}

	enable_interrupts(dev, irq_mask);

	/* make sure to get called again if one side has more work to do */
	return irq_mask != (ENABLE_TX_IRQ | ENABLE_RX_IRQ);
}

	Arnd <><