RFC on writel and writel_relaxed

Tue Mar 27 08:30:00 AEDT 2018

On Mon, Mar 26, 2018 at 11:09 PM, Jason Gunthorpe <jgg at ziepe.ca> wrote:
> On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote:
>> On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg at ziepe.ca> wrote:
>> > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
>> >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg at ziepe.ca> wrote:
>> >> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
>> >> >> > > This is a super performance critical operation for most drivers and
>> >> >> > > directly impacts network performance.
>> >> >>
>> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
>> >> >> any barriers at all.
>> >> >> This might mean that they are always just the memory operation,
>> >> >> but it would make it more obvious what the driver was doing.
>> >> >
>> >> > I think that is what writel_relaxed is supposed to be.
>> >> >
>> >> > The only restriction it has is that the writes to a single device
>> >> > using UC memory must be kept in program order..
>> >>
>> >> Not sure about whether we have ever defined what happens to
>> >> writel_relaxed() on WC memory though: On ARM, we disallow
>> >> the compiler to combine writes, but the CPU still might.
>> >
>> > If the driver uses WC memory then I think it should not expect
>> > anything in terms of how writes map to TLPs other than nothing
>> > combines across mmiowb() and mmiowb() is fully globally ordered when
>> > enclosed in a spinlock.
>> >
>> > The entire point of using WC memory is usually to get combining :) If
>> > the driver doesn't want that then it should map UC..
>>
>> Usually, WC memory is used with memcpy_toio() though, which
>> by definition doesn't have any barriers between accesses, and
>> is required to get the correct byte ordering on writes to memory buffers.
>
> memcpy_toio is too expensive to actually use for anything performance
> though. It is too pessimistic. What the drivers usually want is a
> unwound block of 4 or 8 8-byte copies. No function calls, no
> branching. Everything is already known to be aligned.
>
> Most of the drivers have a unwound loop with writeq() or something to
> do it.

But isn't the writeq() barrier much more expensive than anything you'd
do in function calls?

>> > The same document says that _relaxed() does not give that guarentee.
>> >
>> > The lwn articule on this went into some depth on the interaction with
>> > spinlocks.
>> >
>> > As far as I can see, containment in a spinlock seems to be the only
>> > different between writel and writel_relaxed..
>>
>> I was always puzzled by this: The intention of _relaxed() on ARM
>> (where it originates) was to skip the barrier that serializes DMA
>> with MMIO, not to skip the serialization between MMIO and locks.
>
> But that was never a requirement of writel(),
> Documentation/memory-barriers.txt gives an explicit example demanding
> the wmb() before writel() for ordering system memory against writel.

Indeed, but it's in an example for when to use dma_wmb(), not wmb().
Adding Alexander Duyck to Cc, he added that section as part of
1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
dma_wmb()"). Also adding the other people that were involved with that.

> I actually have no idea why ARM had that barrier, I always assumed it
> was to give program ordering to the accesses and that _relaxed allowed
> re-ordering (the usual meaning of relaxed)..
>
> But the barrier document makes it pretty clear that the only
> difference between the two is spinlock containment, and WillD wrote
> this text, so I belive it is accurate for ARM.
>
> Very confusing.

It does mention serialization with both DMA and locks in the
section about  readX_relaxed()/writeX_relaxed(). The part
about DMA is very clear here, and I must have just forgotten
the exact semantics with regards to spinlocks. I'm still not
sure what prevents a writel() from leaking out the end of a
spinlock section that doesn't happen with writel_relaxed(), since
the barrier in writel() comes before the access, and the
spin_unlock() shouldn't affect the external buses.

      Arnd