[Cbe-oss-dev] [patch 07/16] Ethernet over PCI-E driver

Arnd Bergmann arnd at arndb.de
Fri Jun 1 00:20:24 EST 2007


On Thursday 24 May 2007, jdubois at mc.com wrote:
> This is the the Ethernet over PCI-E driver. With it you get some
> reasonable performance.

Ok, I already reviewed an IBM internal implementation of an ethernet
driver, but as I heard you are now working together with the people
doing that one. I'd say it's fair to take only one of the two
upstream, after everybody agrees on which one to use.

My overall impression is that this one is more mature and will
give much better performance, but the other one is somewhat
cleaner and simpler. Of course those points are all related ;-)

> ~ $ netperf -H 10.0.0.1 -f M -t UDP_STREAM
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 (10.0.0.1) port 0 AF_INET
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   MBytes/sec
> 
> 122880   65507   10.00       62886      0     392.81
> 129024           10.00       62871            392.72
> 
> ~ $ netperf -H 10.0.0.1 -f M -t TCP_STREAM
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 (10.0.0.1) port 0 AF_INET
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    MBytes/sec
> 
>  87380  16384  16384    10.00     482.46
> ~ $

Impressive!

> +#define AXON_NIC_TX_QUEUE_LEN 	 255
> +
> +#ifdef __powerpc__
> +#define AXON_NIC_MAC_ADDR "\2AX0N0"
> +#else
> +#define AXON_NIC_MAC_ADDR "\2AX1N0"
> +#endif

Does a static mac address work if you have multiple CAB boards
in the system, or a blade with multiple axons?

> +typedef struct axon_nic_stat_t {
> +	u64 tx_max_time;
> +	u64 tx_min_time;
> +	u32 rx_cnt;
> +} axon_nic_stat_t;

no typedefs please.

Note that this data structure has different alignment constraints
on i386 and powerpc/x86_64, because of the legacy i386 ABI,
so you run into trouble if the data structure is shared with
user space or over the bus. (I haven't looked if it is used that
way, but the same problem may appear in other places).

> +#ifdef CONFIG_AXON_NAPI
> +	
> +	struct tasklet_struct reschedule_task;
> +
> +	atomic_t loop;
> +#endif

I'd suggest that for the version that gets in finally, you only support
NAPI, there is not much point in a compile-time option here, especially
if it needs to be the same on both sides.

> +static struct list_head axon_nic_list;

Should not be needed if you use the driver core correctly.

> +	if ((p_nic_skb = kzalloc(sizeof(axon_nic_skb_t), GFP_ATOMIC)) == NULL) {
> +		dbg_err("Failed to allocate SKB wrapper\n");
> +
> +		ret = -ENOMEM;
> +
> +		goto out;
> +	}
> +
> +	if ((skb = dev_alloc_skb(skb_size + NET_IP_ALIGN)) == NULL) {
> +		dbg_err("Failed to allocate SKB\n");
> +
> +		ret = -ENOMEM;
> +
> +		goto out;
> +	}

Can you turn this into a single allocation? That would reduce the
chance of failing with out-of-memory

> +	p_nic_skb->bus_addr =
> +	    dma_map_single(axon_nic->axon->get_device(axon_nic->axon),
> +			   skb_put(skb, skb_size), skb_size, DMA_FROM_DEVICE);
> +
> +	skb_dst_plb_addr =
> +	    axon_addr_xltr_to_plb(axon_nic->xltr, p_nic_skb->bus_addr);

I still don't understand how the translation works here. Normally,
dma_map_single should return the address in a way suitable for the
device you pass, so there is absolutely no need to do another
translation. What's different here?

> +#ifdef CONFIG_AXON_NAPI
> +static void axon_nic_reschedule_task_entry(unsigned long data)
> +{
> +	struct net_device *dev = (struct net_device *)data;
> +	axon_nic_t *axon_nic = netdev_priv(dev);
> +	dbg_log("in reschedule tasklet\n");
> +	if (netif_rx_schedule_prep(dev)) {
> +
> +		axon_sms_disable_hw_irqs(axon_nic->sms);
> +		__netif_rx_schedule(dev);
> +	} else {
> +		dbg_log("Hum is it already polling?\n");
> +	}
> +}

NAPI is all about doing stuff in the net rx softirq, so you 
should not need another tasklet in theory. What did you
introduce this for?

> +static irqreturn_t axon_nic_mbox_interrupt_handler(void
> +						   *data, int irq, void
> +						   *dev_id)
> +{
> +	dbg_log("in Eth interrupt\n");
> +	axon_nic_reschedule_task_entry((unsigned long)data);
> +	return IRQ_HANDLED;
> +}

here you shedule the softirq directly, which looks good.

> +#define AXON_NUMBER_OF_POOL 100
> +
> +static int axon_nic_poll(struct net_device *dev, int *budget)
> +{
> +	axon_nic_t *axon_nic = netdev_priv(dev);
> +	int to_be_received = dev->quota;
> +	int ret;
> +	dbg_log("in Eth poll\n");
> +
> +	atomic_set(&axon_nic->loop, AXON_NUMBER_OF_POOL);
> +
> +	
> +	while (((ret = axon_sms_process_next_msg(axon_nic->sms)) == 0)
> +	       || (atomic_dec_return(&axon_nic->loop))) {
> +
> +		
> +		if (ret == 0)
> +			atomic_set(&axon_nic->loop, AXON_NUMBER_OF_POOL);
> +		
> +		if (dev->quota <= 0) {
> +			goto not_done;
> +		}
> +	}
> +
> +	*budget -= (to_be_received - dev->quota);
> +	
> +	netif_rx_complete(dev);
> +	axon_sms_update_hw(axon_nic->sms);
> +	axon_sms_enable_hw_irqs(axon_nic->sms);
> +	
> +	if (axon_sms_has_message(axon_nic->sms)) {
> +		tasklet_schedule(&axon_nic->reschedule_task);
> +		dbg_log("There were MBX remaining\n");
> +	}

but this looks puzzling. There is still a message incoming, so you 
obviously have more work to do. I'd expect that you can simply
return 1 instead of scheduling ->poll to be called again
using the tasklet.

> +	dbg_log("all done\n");
> +	return 0;
> +
> +not_done:
> +	
> +	*budget -= (to_be_received - dev->quota);
> +	dbg_log("ran out of quota\n");
> +	return 1;
> +}
> +#endif

The ->poll function should also take care of cleaing up the tx queue, which
you don't do here at all. Any reason for this?

> +static int axon_nic_read_proc(char *page, char **start, off_t offset,
> +			      int count, int *eof, void *data)
> +{
> +	axon_nic_t *axon_nic = (axon_nic_t *) data;
> +
> +	int len = 0;
> +
> +	len += sprintf(page + len, "NIC %d:\n", axon_nic->id);
> +	len +=
> +	    sprintf(page + len, "\ttx_max_time = %llu\n",
> +		    (long long unsigned int)axon_nic->axon_stats.tx_max_time);
> +	len +=
> +	    sprintf(page + len, "\ttx_min_time = %llu\n",
> +		    (long long unsigned int)axon_nic->axon_stats.tx_min_time);
> +	len +=
> +	    sprintf(page + len, "\tpending tx  = %d\n",
> +		    AXON_NIC_TX_QUEUE_LEN - atomic_read(&axon_nic->tx_cnt));
> +	len +=
> +	    sprintf(page + len, "\tpending rx  = %d\n",
> +		    axon_nic->axon_stats.rx_cnt);
> +	*eof = 1;
> +	return len;
> +}

as mentioned elsewhere, no new /proc files. This could be either a debugfs
file if it's for debugging, or be put into the netdevice stats that you read
with ethtool.

	Arnd <><



More information about the cbe-oss-dev mailing list