RFC on writel and writel_relaxed

Wed Mar 28 20:56:28 AEDT 2018

On Wed, 2018-03-28 at 10:07 +0100, Will Deacon wrote:
> 
> For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to
> add an mb() to make it work.
> 
> Do both of these work on power? 

Yes. There's even another quirk, see further down ;-)

> If so, I guess I can make readl even more
> expensive :/ Feels a bit like the tail wagging the dog, though.

Maybe, but then readl is always horribly slow anyway so you may not
necessarily be losing that much.

> Another thing I just realised is that we restrict the barriers we use in
> readl/writel on arm64 so that they don't necessary apply to both loads and
> stores. To be specific:
> 
>    writel is ordered against prior writes to memory, but not reads

That could be tricky... You may end up with something that reads before
triggering a DMA and ends up with the post-DMA value ... ugh.

>    readl is ordered against subsequent reads of memory, but not writes (but
>    note that in example (1) above, the control dependency ensures that).
> 
> If necessary, I could move the barrier in our readl implementation to be
> before the read, then play the control-dependency + instruction-sync (ISB)
> trick that you do on power.

Yeah so that other trick I'm talking about is also used for timing
accuracy.

For example, let's say I have a device with a reset bit and the spec
says the reset bit needs to be set for at least 10us.

This is wrong:

	writel(1, RESET_REG);
	usleep(10);
	writel(0, RESET_REG);

Because of write posting, the first write might arrive to the device
right before the second one.

The typical "fix" is to turn that into:

	writel(1, RESET_REG);
	readl(RESET_REG); /* Flush posted writes */
	usleep(10);
	writel(0, RESET_REG);

*However* the issue here, at least on power, is that the CPU can issue
that readl but doesn't necessarily wait for it to complete (ie, the
data to return), before proceeding to the usleep. Now a usleep contains
a bunch of loads and stores and is probably fine, but a udelay which
just loops on the timebase may not be.

Thus we may still violate the timing requirement.

What we did inside readl, with the twi;isync sequence (which basically
means, trap on return value with "trap never" as a condition, followed
by isync that ensures all excpetion conditions are resolved), is force
the CPU to "consume" the data from the read before moving on.

This effectively makes readl fully synchronous (we would probably avoid
that if we were to implement a readl_relaxed).

Cheers,
Ben.