[Cbe-oss-dev] [patch 07/16] Ethernet over PCI-E driver

Jean-Christophe Dubois jdubois at mc.com
Sun Jun 3 03:45:23 EST 2007


On Thursday 31 May 2007 16:20:24 Arnd Bergmann wrote:
> On Thursday 24 May 2007, jdubois at mc.com wrote:
> > This is the the Ethernet over PCI-E driver. With it you get some
> > reasonable performance.
>
> Ok, I already reviewed an IBM internal implementation of an ethernet
> driver, but as I heard you are now working together with the people
> doing that one. 

Do I ? Hum ... I will talk with myself about this ... Who at IBM should I tell 
myself to talk to?

> I'd say it's fair to take only one of the two 
> upstream, after everybody agrees on which one to use.

sure ...

> My overall impression is that this one is more mature and will
> give much better performance, but the other one is somewhat
> cleaner and simpler. Of course those points are all related ;-)

This one doesn't seem that complex to me :-). But I guess having all the 
comments may help on this ... ;-)

> > ~ $ netperf -H 10.0.0.1 -f M -t UDP_STREAM
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 10.0.0.1 (10.0.0.1) port 0 AF_INET Socket  Message  Elapsed      Messages
> > Size    Size     Time         Okay Errors   Throughput
> > bytes   bytes    secs            #      #   MBytes/sec
> >
> > 122880   65507   10.00       62886      0     392.81
> > 129024           10.00       62871            392.72
> >
> > ~ $ netperf -H 10.0.0.1 -f M -t TCP_STREAM
> > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1
> > (10.0.0.1) port 0 AF_INET Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    MBytes/sec
> >
> >  87380  16384  16384    10.00     482.46
> > ~ $
>
> Impressive!

This is with an HP9300 as host (which is getting old) ... Newer PCI-E root 
complex might give better performances.

> > +#define AXON_NIC_TX_QUEUE_LEN 	 255
> > +
> > +#ifdef __powerpc__
> > +#define AXON_NIC_MAC_ADDR "\2AX0N0"
> > +#else
> > +#define AXON_NIC_MAC_ADDR "\2AX1N0"
> > +#endif
>
> Does a static mac address work if you have multiple CAB boards
> in the system, or a blade with multiple axons?

I will certainly have to rework this part when I get a system with 2 CABs (or 
better, a PCI-E switched fabric). As for the malta blade it is not concerned 
about this ethernet implementation so far because it has no remote Axon to 
speak to. The Tri-Blade system might change this though... Other drivers 
(DMA, buffer) are working on malta.

> > +typedef struct axon_nic_stat_t {
> > +	u64 tx_max_time;
> > +	u64 tx_min_time;
> > +	u32 rx_cnt;
> > +} axon_nic_stat_t;
>
> no typedefs please.

OK I'll removed it.

> Note that this data structure has different alignment constraints
> on i386 and powerpc/x86_64, because of the legacy i386 ABI,
> so you run into trouble if the data structure is shared with
> user space or over the bus. (I haven't looked if it is used that
> way, but the same problem may appear in other places).

It is not shared ... There is no shared structures between the 2 (or N) ends 
beside the SKB data parts (which is not a "structure" but just a BLOB).

> > +#ifdef CONFIG_AXON_NAPI
> > +
> > +	struct tasklet_struct reschedule_task;
> > +
> > +	atomic_t loop;
> > +#endif
>
> I'd suggest that for the version that gets in finally, you only support
> NAPI, there is not much point in a compile-time option here, especially
> if it needs to be the same on both sides.

I know everybody is very in favor of NAPI (and it makes sense for normal 
devices) but honestly I am not sure it makes sense in this case. It just adds 
up a all lot of complexity for no real performance advantage in my opinion 
(at least with the present design). The main problem is that the Ethernet 
driver is sharing its hardware resources with other drivers. And everything 
is done through hardware ... As I said before, there is no shared structure 
at all.

> > +static struct list_head axon_nic_list;
>
> Should not be needed if you use the driver core correctly.

Remember that I have no single hardware device to tie this driver to ... This 
driver is implemented on top of a set of services not on top of a particular 
hardware ... In addition the hardware/services used are shared with other 
drivers, so I can't get "exclusive" usage of it ...

> > +	if ((p_nic_skb = kzalloc(sizeof(axon_nic_skb_t), GFP_ATOMIC)) == NULL)
> > { +		dbg_err("Failed to allocate SKB wrapper\n");
> > +
> > +		ret = -ENOMEM;
> > +
> > +		goto out;
> > +	}
> > +
> > +	if ((skb = dev_alloc_skb(skb_size + NET_IP_ALIGN)) == NULL) {
> > +		dbg_err("Failed to allocate SKB\n");
> > +
> > +		ret = -ENOMEM;
> > +
> > +		goto out;
> > +	}
>
> Can you turn this into a single allocation? That would reduce the
> chance of failing with out-of-memory

I could work on this ... This would mean I should allocate a larger SKB and 
put my private structure somewhere in the data part. This seems a bit hacky 
to me ... Is is worth the benefit?

Now maybe my question is stupid but how do I have more chance to fail? 
Allocating a small memory piece + a large SKB or allocating a single larger 
SKB? Remember I may allocate SKB that are up to 64KB already ...
>
> > +	p_nic_skb->bus_addr =
> > +	    dma_map_single(axon_nic->axon->get_device(axon_nic->axon),
> > +			   skb_put(skb, skb_size), skb_size, DMA_FROM_DEVICE);
> > +
> > +	skb_dst_plb_addr =
> > +	    axon_addr_xltr_to_plb(axon_nic->xltr, p_nic_skb->bus_addr);
>
> I still don't understand how the translation works here. Normally,
> dma_map_single should return the address in a way suitable for the
> device you pass, so there is absolutely no need to do another
> translation. What's different here?

First remember this works both from the PCI and the BEI side. On the PCI end 
you do get a PCI address but it is not suitable for the DMAX as it is using 
PLB addresses (all of our drivers are abstracted on top of PLB). And it turns 
out that on the BEI side dma_map_single() on the Axon node device returns in 
fact a processor/Cell address (not a PLB address).

> > +#ifdef CONFIG_AXON_NAPI
> > +static void axon_nic_reschedule_task_entry(unsigned long data)
> > +{
> > +	struct net_device *dev = (struct net_device *)data;
> > +	axon_nic_t *axon_nic = netdev_priv(dev);
> > +	dbg_log("in reschedule tasklet\n");
> > +	if (netif_rx_schedule_prep(dev)) {
> > +
> > +		axon_sms_disable_hw_irqs(axon_nic->sms);
> > +		__netif_rx_schedule(dev);
> > +	} else {
> > +		dbg_log("Hum is it already polling?\n");
> > +	}
> > +}
>
> NAPI is all about doing stuff in the net rx softirq, so you
> should not need another tasklet in theory. What did you
> introduce this for?
>
> > +static irqreturn_t axon_nic_mbox_interrupt_handler(void
> > +						   *data, int irq, void
> > +						   *dev_id)
> > +{
> > +	dbg_log("in Eth interrupt\n");
> > +	axon_nic_reschedule_task_entry((unsigned long)data);
> > +	return IRQ_HANDLED;
> > +}
>
> here you shedule the softirq directly, which looks good.
>
> > +#define AXON_NUMBER_OF_POOL 100
> > +
> > +static int axon_nic_poll(struct net_device *dev, int *budget)
> > +{
> > +	axon_nic_t *axon_nic = netdev_priv(dev);
> > +	int to_be_received = dev->quota;
> > +	int ret;
> > +	dbg_log("in Eth poll\n");
> > +
> > +	atomic_set(&axon_nic->loop, AXON_NUMBER_OF_POOL);
> > +
> > +
> > +	while (((ret = axon_sms_process_next_msg(axon_nic->sms)) == 0)
> > +	       || (atomic_dec_return(&axon_nic->loop))) {
> > +
> > +
> > +		if (ret == 0)
> > +			atomic_set(&axon_nic->loop, AXON_NUMBER_OF_POOL);
> > +
> > +		if (dev->quota <= 0) {
> > +			goto not_done;
> > +		}
> > +	}
> > +
> > +	*budget -= (to_be_received - dev->quota);
> > +
> > +	netif_rx_complete(dev);
> > +	axon_sms_update_hw(axon_nic->sms);
> > +	axon_sms_enable_hw_irqs(axon_nic->sms);
> > +
> > +	if (axon_sms_has_message(axon_nic->sms)) {
> > +		tasklet_schedule(&axon_nic->reschedule_task);
> > +		dbg_log("There were MBX remaining\n");
> > +	}
>
> but this looks puzzling. There is still a message incoming, so you
> obviously have more work to do. I'd expect that you can simply
> return 1 instead of scheduling ->poll to be called again
> using the tasklet.

There was no more work to do, so interrupt has been re-enabled. But before 
leaving a last check shows there are more SMS waiting. These came in while 
the interrupt were in the process of being re-enabled. I can't be sure if 
this is before or after, so I have to start again the all thing and 
reschedule the tasklet that will disable the interrupts an so on. I don't 
really care if, in the end, the tasklet will be reschedule from here or from 
the interrupt handler.

> > +	dbg_log("all done\n");
> > +	return 0;
> > +
> > +not_done:
> > +
> > +	*budget -= (to_be_received - dev->quota);
> > +	dbg_log("ran out of quota\n");
> > +	return 1;
> > +}
> > +#endif
>
> The ->poll function should also take care of cleaing up the tx queue, which
> you don't do here at all. Any reason for this?

TX queue is cleared by messaging when I am told by the remote that it is done 
with a particular message.

> > +static int axon_nic_read_proc(char *page, char **start, off_t offset,
> > +			      int count, int *eof, void *data)
> > +{
> > +	axon_nic_t *axon_nic = (axon_nic_t *) data;
> > +
> > +	int len = 0;
> > +
> > +	len += sprintf(page + len, "NIC %d:\n", axon_nic->id);
> > +	len +=
> > +	    sprintf(page + len, "\ttx_max_time = %llu\n",
> > +		    (long long unsigned int)axon_nic->axon_stats.tx_max_time);
> > +	len +=
> > +	    sprintf(page + len, "\ttx_min_time = %llu\n",
> > +		    (long long unsigned int)axon_nic->axon_stats.tx_min_time);
> > +	len +=
> > +	    sprintf(page + len, "\tpending tx  = %d\n",
> > +		    AXON_NIC_TX_QUEUE_LEN - atomic_read(&axon_nic->tx_cnt));
> > +	len +=
> > +	    sprintf(page + len, "\tpending rx  = %d\n",
> > +		    axon_nic->axon_stats.rx_cnt);
> > +	*eof = 1;
> > +	return len;
> > +}
>
> as mentioned elsewhere, no new /proc files. This could be either a debugfs
> file if it's for debugging, or be put into the netdevice stats that you
> read with ethtool.

Yes, I'll work on debugfs for all the drivers.

> 	Arnd <><





More information about the cbe-oss-dev mailing list