[Cbe-oss-dev] [RFC 4/9] AXON - Ethernet over PCI-E driver

Fri Dec 29 08:23:46 EST 2006

On Thursday 28 December 2006 21:44, Benjamin Herrenschmidt wrote:
> > One model that you could use is to have a statically allocated buffer
> > for addresses of incoming data on each side. In pseudocode:
>
> I'm not 100% sure but your example seems to be overlooking some ordering
> issues.

it definitely is. it was meant more as pseudocode than something you
could plug into a driver.

> The ideal solution (almost no barriers needed) is something around the
> lines of what TG3 does.
>
> The rx and tx path are of course completely separate. Then, you have the
> descriptor array, and a pair of separate shared memory areas per ring. a
> pair because you really want to separate things written by the host from
> things written by the cell.
>
> When transfering in direction A -> B, the emitter (A) checks for room by
> checking the ring tail in the message area for (B -> A) for room (it
> knows the ring head, as it's the sole manipulator for it, it's in (A ->
> B) message area.
>
> It then writes packets, updates descriptors, and then (with proper
> wmb()'s to make sure things happen in order), update the ring head.
>
> The receiver checks the ring head, compares it to the ring tail and
> consume data when appropriate.

I thought that was about what my example tries to do, but you seem
to ignore the problem that you can't have real shared memory here,
only DMA transfers.

In the data structure I laid out (you'd have one per direction),
there are distinct variables that are written only by the emitter
or by the receiver and only read by the other side. Of course,
these should stay in separate cache lines in each coherency
domain.

I guess one point you made that could simplify the scheme is that
the message area should not be separated per direction but depending
on who is writing into it. If you only do DMA reads and write into
local buffers, that should further simplify the model.

> That works fine for lock-less and almost barrier-less NAPI poll(). In
> addition, you can add a mecanism to trigger interrupts (based on
> threshold, or a "I want an IRQ" bit somewhere or whatever), in which
> case the ISR needs to perform an MMIO read on the host end to flush
> store buffers, and then schedules a NAPI poll.

right, except that we don't have an MMIO read here at all, it's always
a DMA transfer between the two memory domains.

	Arnd <><