[Cbe-oss-dev] [RFC 4/9] AXON - Ethernet over PCI-E driver

Fri Dec 29 03:29:16 EST 2006

On Thursday 28 December 2006 16:11, Jean-Christophe Dubois wrote:
> On Thursday 28 December 2006 03:45, Benjamin Herrenschmidt wrote:

> > >
> > > It depends of course on whether it is a bottleneck or not. If this is
> > > used only for slow data like the occasional DNS query or a heartbeat,
> > > it's probably not worth doing something more efficient.
> 
> As of now the driver is doing (from memory) around 5Gib/s (600 MB/s) data rate 
> (netperf) using around up to 64KB MTUs (I need to confirm this with hard 
> numbers). I guess we could hope for more (on my opteron platform the max 
> PCI-E transfer rate with the DMAX is a little less than 1.7GB/s from host to 
> cell and 1GB/s from cell to host) and the interrupt driven protocol might not 
> be the most efficient. However a quick solution is to increase the MTU size 
> some more (maybe up to 1MB) which should increase the data rate (improving 
> the data/interrupt ratio).

5Gib/s is already much more than I would have expected.
I would rather not increase the MTU size further though, because that makes
it much more likely to get allocation problems, because the SKB data
is normally allocated with kmalloc. Going to larger sizes probably means
you have to use vmalloc, with scatter-gather DMA.

> Here the driver is in pull mode which is somewhat more comfortable as the 
> receiving side is programming the DMA destination of data in its own memory.
> 
> We could speak about changing the driver to push mode but we will still need 
> some messagery to know were are located the remote buffers/SKBs (what PLB 
> address to program the DMA with). Keep in mind that we have a single DMA 
> engine reading in one Linux memory and writting to the other Linux memory. It 
> is not the traditional Ethernet device model. The side programing the DMA 
> needs to know both source and destination PLB addresses. And this for each 
> SKB ... And SKBs have to be reallocated all the time as they are passed to 
> the Ethernet stack ...
>
> Also as stated above, we are working with potentially big MTUs (bigger than 
> traditional ethernet even with jumbo frames). pre-allocating max size SKBs 
> for the all ring could be memory consuming if in the end you only transfer 
> small packets.

Push vs. pull is an interesting question. In a push model, you get the
advantage that the packet is already in transfer by the time you get the
interrupt, and you have less work to do.

One model that you could use is to have a statically allocated buffer
for addresses of incoming data on each side. In pseudocode:

struct descriptor_array {
	u32 recv_index;
	int recv_irq;
	int recv_ready;
	u32 send_index;
	int send_irq;
	int send_ready;
	struct { 
		dma_addr64_t addr;
		u64 len;
	} data[NR_SKBS];	
};

int send(struct mydev *dev, void *buf, size_t len)
{
	struct {
		u32 recv_index;
		int send_irq;
		int recv_ready;
	} receiver;
	struct {
		u32 send_index;
		int send_irq;
		int send_ready;
	} sender;
	struct {
		dma_addr64_t addr;
		u64 len;
	} data;
	dma_addr64_t remote_data;
	u32 index;

	/* read status of remote queue */
	dma_read(&receiver, dev->remote_descr_array, sizeof(receiver));

	/* remote side won't accept new data yet, wait for interrupt */
	if (!receiver.ready ||
            (dev->send_index - receiver.recv_index) == NR_SKBS) {
		sender.send_index = dev->send_index,
		sender.send_irq = 1,
		sender.send_ready = 1;

		/* set rts on remote side */
		dma_write(dev->remote_descr_array + sizeof(receiver),
			&sender, sizeof(sender));
		netif_stop_queue(&dev->netdev);
		return BUSY;
	}

	/* write pointer to our data */
	dev->send_index++;
	index = dev->send_index % NR_SKBS;
	data.addr = to_dma_addr(buf);
	data.len = len;
	remote_data = dev->remote_descr_array +
		offsetof(struct descr_array, data[index]);
	dma_write(remote_pointer, data, sizeof(data));

	/* update index */
	sender.send_index = dev->send_index;
	sender.send_irq = 0;
	sender.send_ready = 0;
	dma_write(dev->remote_descr_array + sizeof(receiver),
		  &sender, sizeof(sender));

	if (receiver.recv_irq)
		mailbox_write(dev->mailbox, DATA_SENT);

	return OK;
}

int receive(struct mydev *dev, int *budget)
{
	struct descriptor_array *desc = dev->my_desc;
	struct sk_buf *skb;

	int index;

	/* no data available, so back out */
	if (desc->recv_index == desc->send_index) {
		desc->recv_ready = 1;
		desc->recv_irq = 1;
		return 0;
	}

	while (desc->recv_index <= desc->send_index) {
		index = desc->recv_index % NR_SKBS;
		skb = dev_alloc_skb(desc->data[index].len);
		dma_receive(skb->data, desc-data[index].addr);
		netif_receive_skb(skb);
		desc->recv_index++;

		if ((--(*budget)) < 0) {
			/* make sender stop for now */
			desc->recv_ready = 0;
			goto out;
		}
	}
	desc->recv_ready = 1;
out:
	if (desc->sender_irq)
		mailbox_write(dev->mailbox, DATA_RECEIVED);

	return 1;
}

int cleanup_tx(struct mydev *dev)
{
	/* enable interrupts when all packets have been received */
}

int poll(struct net_device *netdev, int *budget)
{
	struct mydev *dev = netdev_priv(netdev);
	cleanup_tx(dev);
	return receive(dev, budget);
}

> > > On the other hand, you have the advantage that you can tell exactly
> > > what state the other side is in, so you can implement a much better
> > > flow control than you could over an ethernet wire. Since you can tell
> > > whether the receiver is waiting for packets or not, the sender can
> > > block in user space when the receiver is too busy to accept more
> > > data.
> > >
> > > Also, you can have a huge virtual transfer buffer when you DMA directly
> > > between the sender and the receiver SKB queue.
> 
> Yes, we can increase the MTU some more to get better data rate. I had the 
> feeling that less than 64KB SKBs would be fine (optimal?) as there is a trend 
> to set the default page size to 64KB for Cell (so an SKB should always fit in 
> one single page, at least thisis ).

A default of 64k sounds fine to me, but if you can get the same performance
using smaller skbs and interrupt mitigation, that is probably less wasteful even
with 64k pages, because kmalloc can allocate in sub-page units.

My comment was headed in the same direction as Ben's. Instead of increasing the
size of one SKB, I think the better solution would be to make sure you can
read out all SKBS of the full queue from a single interrupt.

	Arnd <><