[Cbe-oss-dev] [RFC 4/9] AXON - Ethernet over PCI-E driver

Jean-Christophe Dubois jdubois at mc.com
Sat Dec 23 02:17:58 EST 2006


On Thursday 21 December 2006 22:34, Arnd Bergmann wrote:
> On Wednesday 20 December 2006 12:13, jdubois at mc.com wrote:
> > From: Jean-Christophe DUBOIS <jdubois at mc.com>
> > This is an ethernet device emulation between the host and the Cell
> > attached to the Axon. It uses the MBX as a messagery and the DMA to move
> > SKBs.
>
> Since this is a network driver, it should probably go to drivers/net/,
> and you should take netdev at vger.kernel.org on Cc: when submitting it
> for review.

OK, I'll do it next time. However you realize that this driver need the 
underlying infrastructure/services provided by the low level Axon driver. I 
mean, this driver doesn't work directly on top of hardware but on top of 
another driver that abstract the hardware to equalize it for the host and the 
Cell.

> > +
> > +#if defined(AXON_DEBUG_NIC)
> > +#define dbg_nic_log printk(KERN_DEBUG "AXON_NIC:%s=>",
> > __FUNCTION__);printk +#else
> > +#define dbg_nic_log if (0) printk
> > +#endif
> > +
> > +#define dbg_nic_err printk(KERN_EMERG "AXON_NIC:%s=>",
> > __FUNCTION__);printk +#define dbg_nic_inf printk(KERN_INFO
> > "AXON_NIC:%s=>", __FUNCTION__);printk
>
> dev_dbg/dev_info/dev_err

OK, Ill look into it.
>
> > +#ifdef __powerpc__
> > +#define AXON_NIC_MAC_ADDR "\2AX0N0"
> > +#else
> > +#define AXON_NIC_MAC_ADDR "\2AX1N0"
> > +#endif
>
> You can ask the network layer to generate a random valid mac address
> for you, instead of hardcoding these.

I'll think about this. So far I wanted to stay in control of MAC addresses so 
that I can control PCI-E routing when we will put several CAB boards (for 
example) in a PCI-E switched fabric. In this case I need to be able to 
establish a relationship between the MAC address and the Axon in the switched 
fabric.

> > +
> > +#ifndef __powerpc__
> > +#define axon_nic_virt_to_bus(x) virt_to_bus( (x) )
> > +#else
> > +#define axon_nic_virt_to_bus(x) virt_to_phys( (x) )
> > +#endif
>
> No, you need to use the pci dma mapping interface, or
> the of_device dma mapping on powerpc

Yes I know ... I need to work on this ...

> > +static struct net_device **axon_nic_devs = NULL;
>
> You shouldn't need to store a global array of these, just attach
> it to the struct device you use.

As I said above, there is no real "device" to attach to. It is very much 
virtual ...

> > +static void
> > +axon_nic_dma_skb_avail_completion_handler(struct axon_dmax_t
> > +					  *p_axon_dmax, struct axon_dma_req_t
> > +					  *p_dma_req, void *context)
> > +{
> > +	struct sk_buff *skb = context;
> > +
> > +	struct axon_nic_t *axon_nic = netdev_priv(skb->dev);
> > +
> > +#if defined(AXON_DEBUG_NIC)
> > +
> > +	dbg_nic_log("Skb 0x%p after DMA completion \n", skb);
> > +
> > +	axon_nic_skb_print(axon_nic, skb);
> > +#endif
> > +
> > +
> > +	skb->protocol = eth_type_trans(skb, skb->dev);
> > +
> > +
> > +	skb->ip_summed = CHECKSUM_UNNECESSARY;
> > +
> > +	axon_nic->stats.rx_packets++;
> > +	axon_nic->stats.rx_bytes += skb_headlen(skb);
> > +
> > +	skb->dev->last_rx = jiffies;
> > +
> > +
> > +	dbg_nic_log("Skb 0x%p is passed to the network stack\n", skb);
> > +	netif_rx(skb);
> > +}
>
> I'm not sure I understand what you do here. This looks like you get one
> interrupt callback for each incoming packet, which is rather inefficient.
> You should probably look into using a NAPI poll() function to avoid
> rx and tx interrupts whenever possible.

It is a bit more complex than that. I can explain here how it works. For each 
SKB to transfer there is a MBX dialog between the 2 sides of the PCI-E link.

It goes as follow:

/*
 * Ethernet frame exchanged Protocol
 *
 *    Emitter                                         Receiver
 * 1 -  Linux ask for transmitted a skb
 * 2 -  Emitter build a message with the
 *      SKB PLB addr and size
 *      AXON_NIC_SMS_SKB_AVAIL  ---------->  3 - Receiver allocate a sk_buff
 *                                               of the requested size, create
 *                                               a DMA read req to xfer the 
 *                                               data with a message notifying
 *                                               the completion of the xfer. 
It
 *                                               also ask to be notified when
 *                                               the transfer is complete.
 *
 * 4 - The Emitter free up the  <---------      AXON_NIC_SMS_SKB_XFERD  
 *     sk_buff                       |
 *                                   ---->  5 - The receiver propagate the 
 *                                              sk_buff up to the stack 
 *
 * If something does wrong in step 3, the receiver send a cancel message.
 *                                                  
 */

So for each SKB the receiver gets 2 interrupts (+ payload) and the emitter get 
one. It might not sound like the most efficient protocol but we do need some 
messagery to synchronize resource usage and SKB management.

> > +static __init int
> > +axon_nic_module_init(void)
> > +{
> > +	int             ret = 0;
> > +
> > +	axon_nic_devs_count = axon_board_count();
> > +	dbg_nic_inf("Found %d board(s) \n", axon_nic_devs_count);
> > +
> > +
> > +	if (axon_nic_devs_count > 0) {
> > +		axon_nic_devs =
> > +		    kzalloc(sizeof(struct net_device *) *
> > +			    axon_nic_devs_count, GFP_KERNEL);
> > +
> > +		if (axon_nic_devs != NULL) {
> > +			int             i_board;
> > +
> > +			for (i_board = 0; i_board < axon_nic_devs_count;
> > +			     i_board++) {
> > +
> > +				axon_nic_devs[i_board] =
> > +				    alloc_netdev(sizeof(struct axon_nic_t),
> > +						 "axon_nic%d",
> > +						 axon_nic_init);
>
> Your initialization completely circumvents the Linux driver model.
> Normally, the module init function should just register a driver that
> is then used for each device that gets found.

2 things:

1) this driver works for the host (a PCI-E opteron based system for example) 
and the Cell attached to the Axon.  So on one side (the PCI) we have a single 
Axon device with a slew of resources we can use if we know about them (no OF 
tree for them) and on the other side (the Cell) we have the Axon device (not 
always [FCAB]) expressed in an OF tree.
2) We are not attaching to a dedicated Ethernet device but to a set of 
sharable resources (DMA, MBX, PIM registers for PCI-E mapping, ...) that we 
use to emulate Ethernet.

> I realize that this is not that easy on AXON, since you can use
> the same hardware for a number of different tasks and/or kernel
> modules. I don't have a good overview of how you try to solve this,
> but I think that one side (cell or host) should define how
> the AXON interface is used, and the other side should have
> a way to detect this. E.g. when you configure the network
> interface on one side, the device should pop up on the other side
> and the driver loaded automatically.

Well when you start the service on one side the other side might not be 
running Linux yet. Or linux might be running bu the Ethernet driver is not 
loaded yet. Or you could be addressing a PCI-E switched fabric where some CAB 
are running the driver and some other one are not (or not yet).

> Can you describe which 
> resources on AXON can be used for different drivers in conflicting
> ways? E.g. is everything you need for the network driver always
> available, or can all the DMA channels be already in use?

At this point there is no real "negotiation". The host is using 1 DMAX channel 
for itself (and share it with all its drivers) and the Cell is using another 
DMAX channel (and share it with all its drivers).

There is one hardware MBX on the Cell shared by all local drivers and that can 
be targeted by all remote (host or even self) drivers.

There is an emulated software MBX on the host side shared by all local drivers 
and targeted by all remote (Cell or even self) drivers.

The host will also have to deal with the various PCI mapping window register 
to be able to access all the 1TB Axon memory space (we generally get 
something like 2 x 128MB bars on the PCI-E side to be compared to the 1TB 
internal memory space).

> > +module_init(axon_nic_module_init);
> > +module_exit(axon_nic_module_cleanup);
>
> these should be right after the respective function.

OK

>
> 	Arnd <><



More information about the cbe-oss-dev mailing list