[Cbe-oss-dev] [RFC 4/9] AXON - Ethernet over PCI-E driver

Wed Dec 27 07:42:32 EST 2006

On Friday 22 December 2006 16:17, Jean-Christophe Dubois wrote:
> On Thursday 21 December 2006 22:34, Arnd Bergmann wrote:

> > Since this is a network driver, it should probably go to drivers/net/,
> > and you should take netdev at vger.kernel.org on Cc: when submitting it
> > for review.
> 
> OK, I'll do it next time. However you realize that this driver need the 
> underlying infrastructure/services provided by the low level Axon driver. I 
> mean, this driver doesn't work directly on top of hardware but on top of 
> another driver that abstract the hardware to equalize it for the host and the 
> Cell.

We also have a few ethernet drivers outside of drivers/net (e.g.
in drivers/usb/net or drivers/s390/net), but I don't think we need
to treat this one as another of these exceptions. We definitely
need to get a good interface definition for how the high-level
network driver can communicate with the low-level drivers.

I haven't really understood how that aspect of your code works,
but my impression is that we should do it more like other
linux drivers already do.

> > > +#ifdef __powerpc__
> > > +#define AXON_NIC_MAC_ADDR "\2AX0N0"
> > > +#else
> > > +#define AXON_NIC_MAC_ADDR "\2AX1N0"
> > > +#endif
> >
> > You can ask the network layer to generate a random valid mac address
> > for you, instead of hardcoding these.
> 
> I'll think about this. So far I wanted to stay in control of MAC addresses so 
> that I can control PCI-E routing when we will put several CAB boards (for 
> example) in a PCI-E switched fabric. In this case I need to be able to 
> establish a relationship between the MAC address and the Axon in the switched 
> fabric.

Ideally, you should then request an official MAC address from IEEE or
whoever is responsible for assigning them and find a way to burn it into
the card, the same way it is done for the external card.

If that is not possible, maybe you can use a random address and encode
the PCI address of the CAB card into some of the bits. The hardcoded
address definitely becomes tricky at the point where you use virtually
switched ethernet in the host.

> > > +static struct net_device **axon_nic_devs = NULL;
> >
> > You shouldn't need to store a global array of these, just attach
> > it to the struct device you use.
> 
> As I said above, there is no real "device" to attach to. It is very much 
> virtual ...

Ok, I still need to find out just how virtual it really is, e.g. if you
can create any number of virtual devices on a given hardware or what
the limitation is.

> /*
>  * Ethernet frame exchanged Protocol
>  *
>  *    Emitter                                         Receiver
>  * 1 -  Linux ask for transmitted a skb
>  * 2 -  Emitter build a message with the
>  *      SKB PLB addr and size
>  *      AXON_NIC_SMS_SKB_AVAIL  ---------->  3 - Receiver allocate a sk_buff
>  *                                               of the requested size, create
>  *                                               a DMA read req to xfer the 
>  *                                               data with a message notifying
>  *                                               the completion of the xfer. 
> It
>  *                                               also ask to be notified when
>  *                                               the transfer is complete.
>  *
>  * 4 - The Emitter free up the  <---------      AXON_NIC_SMS_SKB_XFERD  
>  *     sk_buff                       |
>  *                                   ---->  5 - The receiver propagate the 
>  *                                              sk_buff up to the stack 
>  *
>  * If something does wrong in step 3, the receiver send a cancel message.
>  *                                                  
>  */
> 
> So for each SKB the receiver gets 2 interrupts (+ payload) and the emitter get 
> one. It might not sound like the most efficient protocol but we do need some 
> messagery to synchronize resource usage and SKB management.

Ok, I think I got this. I have worked on other virtual ethernet drivers
before and they usually try to use significantly less interrupts, typically
only for signalling when the status changes from no-data-available to
data-available. Are you satisfied with the performance you get over the
current design? I think it can be done in a much smarter way, but probably
at the cost of additional complexity.

> > > +static __init int
> > > +axon_nic_module_init(void)
> > > +{
> > > +	int             ret = 0;
> > > +
> > > +	axon_nic_devs_count = axon_board_count();
> > > +	dbg_nic_inf("Found %d board(s) \n", axon_nic_devs_count);
> > > +
> > > +
> > > +	if (axon_nic_devs_count > 0) {
> > > +		axon_nic_devs =
> > > +		    kzalloc(sizeof(struct net_device *) *
> > > +			    axon_nic_devs_count, GFP_KERNEL);
> > > +
> > > +		if (axon_nic_devs != NULL) {
> > > +			int             i_board;
> > > +
> > > +			for (i_board = 0; i_board < axon_nic_devs_count;
> > > +			     i_board++) {
> > > +
> > > +				axon_nic_devs[i_board] =
> > > +				    alloc_netdev(sizeof(struct axon_nic_t),
> > > +						 "axon_nic%d",
> > > +						 axon_nic_init);
> >
> > Your initialization completely circumvents the Linux driver model.
> > Normally, the module init function should just register a driver that
> > is then used for each device that gets found.
> 
> 2 things:
> 
> 1) this driver works for the host (a PCI-E opteron based system for example) 
> and the Cell attached to the Axon.  So on one side (the PCI) we have a single 
> Axon device with a slew of resources we can use if we know about them (no OF 
> tree for them) and on the other side (the Cell) we have the Axon device (not 
> always [FCAB]) expressed in an OF tree.
> 2) We are not attaching to a dedicated Ethernet device but to a set of 
> sharable resources (DMA, MBX, PIM registers for PCI-E mapping, ...) that we 
> use to emulate Ethernet.

ok. 

> > I realize that this is not that easy on AXON, since you can use
> > the same hardware for a number of different tasks and/or kernel
> > modules. I don't have a good overview of how you try to solve this,
> > but I think that one side (cell or host) should define how
> > the AXON interface is used, and the other side should have
> > a way to detect this. E.g. when you configure the network
> > interface on one side, the device should pop up on the other side
> > and the driver loaded automatically.
> 
> Well when you start the service on one side the other side might not be 
> running Linux yet. Or linux might be running bu the Ethernet driver is not 
> loaded yet. Or you could be addressing a PCI-E switched fabric where some CAB 
> are running the driver and some other one are not (or not yet).

Ok, I'm starting to see where the problems are coming from.

> > Can you describe which 
> > resources on AXON can be used for different drivers in conflicting
> > ways? E.g. is everything you need for the network driver always
> > available, or can all the DMA channels be already in use?
> 
> At this point there is no real "negotiation". The host is using 1 DMAX channel 
> for itself (and share it with all its drivers) and the Cell is using another 
> DMAX channel (and share it with all its drivers).
> 
> There is one hardware MBX on the Cell shared by all local drivers and that can 
> be targeted by all remote (host or even self) drivers.
> 
> There is an emulated software MBX on the host side shared by all local drivers 
> and targeted by all remote (Cell or even self) drivers.
> 
> The host will also have to deal with the various PCI mapping window register 
> to be able to access all the 1TB Axon memory space (we generally get 
> something like 2 x 128MB bars on the PCI-E side to be compared to the 1TB 
> internal memory space).

I wonder if the Cell side you use the mailbox to announce the services that
are provided by the card. Would it be ok if the CAB board has the power to
decide what drivers to use (network, user space pipe, ...) and let the
host autoload the respective drivers? I can see advantage of keeping both
sides of the driver identical, but then again the hardware is not symmetric
at all, so if you see the board as an appliance it would be the expected
case to have it give you a probe-able list of virtual devices attached to
it.

To express that as code, you could have a 'struct bus_type axon_bus',
which has two drivers providing the low-level functionality, one
being a collection of of_devices for Cell, and another one as
a pci_device for PCIe.

For that bus_type, you can have 'struct axon_driver' drivers
subclassed from 'struct device_driver', one driver each for stuff
like network, user mailbox, user dma, etc.

One of the tricky bits obviously is how to get the devices into the
bus, when all devices are virtual. This can be done either in a
symmetric or asymmetric fashion. I would say that the pci_device
should be subscribed for mailbox messages that tell it the
availability of a new low-level device, and it should be possible
to ask what devices are already there.

On the of_device side, you can have a simple user interface to
add virtual devices. This can be done e.g. using sysfs attributes
or a misc device with ioctl methods. When you add a virtual network
axon_device though this interface, it should be announced to the
host, which then would add a corresponding axon_device to its
linux device tree (which becomes visible in sysfs). Both sides
can then autoload the network driver for the new device and
see when they attach their axon_driver to the axon_bus, the
device gets passed to the driver's probe() function.

	Arnd <><