RFC on writel and writel_relaxed

Tue Mar 27 09:00:16 AEDT 2018

On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
>  Most of the drivers have a unwound loop with writeq() or something to
> > do it.
> 
> But isn't the writeq() barrier much more expensive than anything you'd
> do in function calls?

It is for us, and will break any write combining.

> > > > The same document says that _relaxed() does not give that guarentee.
> > > > 
> > > > The lwn articule on this went into some depth on the interaction with
> > > > spinlocks.
> > > > 
> > > > As far as I can see, containment in a spinlock seems to be the only
> > > > different between writel and writel_relaxed..
> > > 
> > > I was always puzzled by this: The intention of _relaxed() on ARM
> > > (where it originates) was to skip the barrier that serializes DMA
> > > with MMIO, not to skip the serialization between MMIO and locks.
> > 
> > But that was never a requirement of writel(),
> > Documentation/memory-barriers.txt gives an explicit example demanding
> > the wmb() before writel() for ordering system memory against writel.

This is a bug in the documentation.

> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> Adding Alexander Duyck to Cc, he added that section as part of
> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> dma_wmb()"). Also adding the other people that were involved with that.

Linus himself made it very clear years ago. readl and writel have to
order vs memory accesses.

> > I actually have no idea why ARM had that barrier, I always assumed it
> > was to give program ordering to the accesses and that _relaxed allowed
> > re-ordering (the usual meaning of relaxed)..
> > 
> > But the barrier document makes it pretty clear that the only
> > difference between the two is spinlock containment, and WillD wrote
> > this text, so I belive it is accurate for ARM.
> > 
> > Very confusing.
> 
> It does mention serialization with both DMA and locks in the
> section about  readX_relaxed()/writeX_relaxed(). The part
> about DMA is very clear here, and I must have just forgotten
> the exact semantics with regards to spinlocks. I'm still not
> sure what prevents a writel() from leaking out the end of a
> spinlock section that doesn't happen with writel_relaxed(), since
> the barrier in writel() comes before the access, and the
> spin_unlock() shouldn't affect the external buses.

So...

Historically, what happened is that we (we means whoever participated
in the discussion on the list with Linus calling the shots really)
decided that there was no sane way for drivers to understand a world
where readl/writel didn't fully order things vs. memory accesses (ie,
DMA).

So it should always be correct to do:

	- Write to some in-memory buffer
	- writel() to kick the DMA read of that buffer

without any extra barrier.

The spinlock situation however got murky. Mostly that came up because
on architecture (I forgot who, might have been ia64) has a hard time
providing that consistency without making writel insanely expensive.

Thus they created mmiowb whose main purpose was precisely to order
writel with a following spin_unlock.

I decided not to go down that path on power because getting all drivers
"fixed" to do the right thing was going to be a losing battle, and
instead added per-cpu tracking of writel in order to "escalate" to a
heavier barrier in spin_unlock itself when necessary.

Now, all this happened more than a decade ago and it's possible that
the understanding or expectations "shifted" over time...

Cheers,
Ben.