[Cbe-oss-dev] [patch 07/16] Ethernet over PCI-E driver

Tue Jun 5 20:34:08 EST 2007

On Monday 04 June 2007 18:38:57 Arnd Bergmann wrote:
> > > Ok, I already reviewed an IBM internal implementation of an ethernet
> > > driver, but as I heard you are now working together with the people
> > > doing that one.
> >
> > Do I ? Hum ... I will talk with myself about this ... Who at IBM should I
> > tell myself to talk to?
>
> Murali (on Cc: now) has the other code, he told me that he was now working
> together with Mercury developers

OK, I'll ask him ...

> > I will certainly have to rework this part when I get a system with 2 CABs
> > (or better, a PCI-E switched fabric). As for the malta blade it is not
> > concerned about this ethernet implementation so far because it has no
> > remote Axon to speak to. The Tri-Blade system might change this though...
> > Other drivers (DMA, buffer) are working on malta.
>
> You might want to use the random_ether_addr() function. It creates a valid
> random address, which means that it's always unique, but unfortunately
> it also breaks things like dhcp if it changes with every boot.
>
> Ideally there should be a way to assign an official mac address to the
> board, but I'm not sure where to store it.

We can generate them on the fly as long as the genration rule apply to the all 
considered fabric. There is not need for an "official" MAC address (these MAC 
addresses should never be seen outside the PCI-E fabric). The actual ones are 
tagged as "privately managed" which is good enough. What I have in mind is to 
allow some "routing" info to be stored in the MAC address to allow an easier 
handling of a PCI-E switched fabric later on.

> > > I'd suggest that for the version that gets in finally, you only support
> > > NAPI, there is not much point in a compile-time option here, especially
> > > if it needs to be the same on both sides.
> >
> > I know everybody is very in favor of NAPI (and it makes sense for normal
> > devices) but honestly I am not sure it makes sense in this case. It just
> > adds up a all lot of complexity for no real performance advantage in my
> > opinion (at least with the present design). The main problem is that the
> > Ethernet driver is sharing its hardware resources with other drivers. And
> > everything is done through hardware ... As I said before, there is no
> > shared structure at all.
>
> One of the main advantages of NAPI is the end-to-end flow control.
> If no application actually processes any incoming data packets because
> the CPU is overloaded, the network driver can simply disable interrupts
> and stop receiving frames from the queue in order to avoid dropping
> frames or spending additional cycles on that.

The end to end flow control will be harder to achieve when/if we are part of a 
PCI-E switched fabric. If you consider such fabric, a CAB "Ethernet device" 
can be targeted by several (from one to several dozens) remote devices. In 
this case we are no more in a point to point architecture. So even if this is 
mostly the case today in an host/CAB architecture, I'd like to keep things 
open for later. It seems to me that flow contolling the sending side is good 
enough (which is done in the actual driver).

> > > > +static struct list_head axon_nic_list;
> > >
> > > Should not be needed if you use the driver core correctly.
> >
> > Remember that I have no single hardware device to tie this driver to ...
> > This driver is implemented on top of a set of services not on top of a
> > particular hardware ... In addition the hardware/services used are shared
> > with other drivers, so I can't get "exclusive" usage of it ...
>
> Right, but you know what devices are potentially there and can register
> a network device for each of them. The actual acquisition of shared
> resources can still happen at open(), i.e. ifup, time.

It seems to me you are proposing to move the problem to the fake Ethernet 
device creation but the exact same issue will have to be solved there (you 
can't attach exclusively to any hardware resource).

> > And it turns out that on the BEI side dma_map_single() on the Axon node
> > device returns in fact a processor/Cell address (not a PLB address).
>
> That sounds like we don't treat the dma-ranges property of the device
> node incorrectly, or that the property itself is broken.
>
> Can you show a hexdump of all dma-ranges properties below the axon node?

There is no "dma-ranges" property at all in our device tree (as provided by 
IBM SLOF firmware).

> > I can't be sure if this is before or after, so I have to start again the
> > all thing and reschedule the tasklet that will disable the interrupts an
> > so on. I don't really care if, in the end, the tasklet will be reschedule
> > from here or from the interrupt handler.
>
> Yes, but returning '1' has the same effect as scheduling the tasklet, just
> without the overhead.

Except the hardware interrupt would not be re-disabled. I guess I could do it 
here and then return 1. That would avoid the reschedule().

> > > The ->poll function should also take care of cleaing up the tx queue,
> > > which you don't do here at all. Any reason for this?
> >
> > TX queue is cleared by messaging when I am told by the remote that it is
> > done with a particular message.
>
> So are you saying that you send an interrupt after every single frame that
> is transferred? That sounds like you are creating much more interrupt
> traffic than necessary.

After a frame is consumed the receiving side is sending a MBX (interrupt + 
some info) to the sending side telling it exactly which SKB is done and can 
be released.

> 	Arnd <><