[Cbe-oss-dev] [patch 07/16] Ethernet over PCI-E driver

Arnd Bergmann arnd at arndb.de
Tue Jun 5 02:38:57 EST 2007


On Saturday 02 June 2007, Jean-Christophe Dubois wrote:
> On Thursday 31 May 2007 16:20:24 Arnd Bergmann wrote:
> > On Thursday 24 May 2007, jdubois at mc.com wrote:
> > > This is the the Ethernet over PCI-E driver. With it you get some
> > > reasonable performance.
> >
> > Ok, I already reviewed an IBM internal implementation of an ethernet
> > driver, but as I heard you are now working together with the people
> > doing that one. 
> 
> Do I ? Hum ... I will talk with myself about this ... Who at IBM should I tell 
> myself to talk to?

Murali (on Cc: now) has the other code, he told me that he was now working
together with Mercury developers


> > > +#define AXON_NIC_TX_QUEUE_LEN 	 255
> > > +
> > > +#ifdef __powerpc__
> > > +#define AXON_NIC_MAC_ADDR "\2AX0N0"
> > > +#else
> > > +#define AXON_NIC_MAC_ADDR "\2AX1N0"
> > > +#endif
> >
> > Does a static mac address work if you have multiple CAB boards
> > in the system, or a blade with multiple axons?
> 
> I will certainly have to rework this part when I get a system with 2 CABs (or 
> better, a PCI-E switched fabric). As for the malta blade it is not concerned 
> about this ethernet implementation so far because it has no remote Axon to 
> speak to. The Tri-Blade system might change this though... Other drivers 
> (DMA, buffer) are working on malta.

You might want to use the random_ether_addr() function. It creates a valid
random address, which means that it's always unique, but unfortunately
it also breaks things like dhcp if it changes with every boot.
 
Ideally there should be a way to assign an official mac address to the
board, but I'm not sure where to store it.

> > Note that this data structure has different alignment constraints
> > on i386 and powerpc/x86_64, because of the legacy i386 ABI,
> > so you run into trouble if the data structure is shared with
> > user space or over the bus. (I haven't looked if it is used that
> > way, but the same problem may appear in other places).
> 
> It is not shared ... There is no shared structures between the 2 (or N) ends 
> beside the SKB data parts (which is not a "structure" but just a BLOB).

ok

> >
> > I'd suggest that for the version that gets in finally, you only support
> > NAPI, there is not much point in a compile-time option here, especially
> > if it needs to be the same on both sides.
> 
> I know everybody is very in favor of NAPI (and it makes sense for normal 
> devices) but honestly I am not sure it makes sense in this case. It just adds 
> up a all lot of complexity for no real performance advantage in my opinion 
> (at least with the present design). The main problem is that the Ethernet 
> driver is sharing its hardware resources with other drivers. And everything 
> is done through hardware ... As I said before, there is no shared structure 
> at all.

One of the main advantages of NAPI is the end-to-end flow control.
If no application actually processes any incoming data packets because
the CPU is overloaded, the network driver can simply disable interrupts
and stop receiving frames from the queue in order to avoid dropping
frames or spending additional cycles on that.

> > > +static struct list_head axon_nic_list;
> >
> > Should not be needed if you use the driver core correctly.
> 
> Remember that I have no single hardware device to tie this driver to ... This 
> driver is implemented on top of a set of services not on top of a particular 
> hardware ... In addition the hardware/services used are shared with other 
> drivers, so I can't get "exclusive" usage of it ...

Right, but you know what devices are potentially there and can register
a network device for each of them. The actual acquisition of shared
resources can still happen at open(), i.e. ifup, time.

> > Can you turn this into a single allocation? That would reduce the
> > chance of failing with out-of-memory
> 
> I could work on this ... This would mean I should allocate a larger SKB and 
> put my private structure somewhere in the data part. This seems a bit hacky 
> to me ... Is is worth the benefit?

I think the code would get a little cleaner, but I don't know if it's really
meant to be done that way.
 
> Now maybe my question is stupid but how do I have more chance to fail? 
> Allocating a small memory piece + a large SKB or allocating a single larger 
> SKB? Remember I may allocate SKB that are up to 64KB already ...

You should not cross 64k with that, but rather make the maximum SKB size a
little smaller, because all allocations are power of two sized.
It only gets rid of the small allocation, but any GFP_ATOMIC kmalloc
is likely to fail at some point.

In the worst case, you have exactly one 64k page available, and doing the
small allocation first results in splitting it up into a new slab page,
meaning that the large allocation fails.

> > > +	p_nic_skb->bus_addr =
> > > +	    dma_map_single(axon_nic->axon->get_device(axon_nic->axon),
> > > +			   skb_put(skb, skb_size), skb_size, DMA_FROM_DEVICE);
> > > +
> > > +	skb_dst_plb_addr =
> > > +	    axon_addr_xltr_to_plb(axon_nic->xltr, p_nic_skb->bus_addr);
> >
> > I still don't understand how the translation works here. Normally,
> > dma_map_single should return the address in a way suitable for the
> > device you pass, so there is absolutely no need to do another
> > translation. What's different here?
> 
> First remember this works both from the PCI and the BEI side. On the PCI end 
> you do get a PCI address but it is not suitable for the DMAX as it is using 
> PLB addresses (all of our drivers are abstracted on top of PLB).

Hmm, I'd need to think about what to do about the PCI end.

> And it turns out that on the BEI side dma_map_single() on the Axon node
> device returns in fact a processor/Cell address (not a PLB address).

That sounds like we don't treat the dma-ranges property of the device
node incorrectly, or that the property itself is broken.

Can you show a hexdump of all dma-ranges properties below the axon node?

> > but this looks puzzling. There is still a message incoming, so you
> > obviously have more work to do. I'd expect that you can simply
> > return 1 instead of scheduling ->poll to be called again
> > using the tasklet.
> 
> There was no more work to do, so interrupt has been re-enabled. But before 
> leaving a last check shows there are more SMS waiting. These came in while 
> the interrupt were in the process of being re-enabled.

Right, that was what I thought.

> I can't be sure if this is before or after, so I have to start again the all
> thing and reschedule the tasklet that will disable the interrupts an so on.
> I don't really care if, in the end, the tasklet will be reschedule from here or from 
> the interrupt handler.

Yes, but returning '1' has the same effect as scheduling the tasklet, just
without the overhead.
 
> > The ->poll function should also take care of cleaing up the tx queue, which
> > you don't do here at all. Any reason for this?
> 
> TX queue is cleared by messaging when I am told by the remote that it is done 
> with a particular message.

So are you saying that you send an interrupt after every single frame that
is transferred? That sounds like you are creating much more interrupt traffic
than necessary.

	Arnd <><



More information about the cbe-oss-dev mailing list