RFC on writel and writel_relaxed

Tue Mar 27 08:09:51 AEDT 2018

On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote:
> On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg at ziepe.ca> wrote:
> > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
> >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg at ziepe.ca> wrote:
> >> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
> >> >> > > This is a super performance critical operation for most drivers and
> >> >> > > directly impacts network performance.
> >> >>
> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
> >> >> any barriers at all.
> >> >> This might mean that they are always just the memory operation,
> >> >> but it would make it more obvious what the driver was doing.
> >> >
> >> > I think that is what writel_relaxed is supposed to be.
> >> >
> >> > The only restriction it has is that the writes to a single device
> >> > using UC memory must be kept in program order..
> >>
> >> Not sure about whether we have ever defined what happens to
> >> writel_relaxed() on WC memory though: On ARM, we disallow
> >> the compiler to combine writes, but the CPU still might.
> >
> > If the driver uses WC memory then I think it should not expect
> > anything in terms of how writes map to TLPs other than nothing
> > combines across mmiowb() and mmiowb() is fully globally ordered when
> > enclosed in a spinlock.
> >
> > The entire point of using WC memory is usually to get combining :) If
> > the driver doesn't want that then it should map UC..
> 
> Usually, WC memory is used with memcpy_toio() though, which
> by definition doesn't have any barriers between accesses, and
> is required to get the correct byte ordering on writes to memory buffers.

memcpy_toio is too expensive to actually use for anything performance
though. It is too pessimistic. What the drivers usually want is a
unwound block of 4 or 8 8-byte copies. No function calls, no
branching. Everything is already known to be aligned.

Most of the drivers have a unwound loop with writeq() or something to
do it.

> > The same document says that _relaxed() does not give that guarentee.
> >
> > The lwn articule on this went into some depth on the interaction with
> > spinlocks.
> >
> > As far as I can see, containment in a spinlock seems to be the only
> > different between writel and writel_relaxed..
> 
> I was always puzzled by this: The intention of _relaxed() on ARM
> (where it originates) was to skip the barrier that serializes DMA
> with MMIO, not to skip the serialization between MMIO and locks.

But that was never a requirement of writel(),
Documentation/memory-barriers.txt gives an explicit example demanding
the wmb() before writel() for ordering system memory against writel.

I actually have no idea why ARM had that barrier, I always assumed it
was to give program ordering to the accesses and that _relaxed allowed
re-ordering (the usual meaning of relaxed)..

But the barrier document makes it pretty clear that the only
difference between the two is spinlock containment, and WillD wrote
this text, so I belive it is accurate for ARM.

Very confusing.

> I never fully understood the part about the locks, but from what
> I remember, ARM is still serialized without the barrier here, but
> dropping the barrier on powerpc writel_relaxed() would not
> serialize against locks or DMA.

WC is usually the problem here.. I've been told it is necessary on ARM
as well..

Jason