MMIO and gcc re-ordering issue

Wed Jun 11 15:00:30 EST 2008

On Wednesday 11 June 2008 14:18, Paul Mackerras wrote:
> Nick Piggin writes:
> > OK, I'm sitll not quite sure where this has ended up. I guess you are
> > happy with x86 semantics as they are now. That is, all IO accesses are
> > strongly ordered WRT one another and WRT cacheable memory (which includes
> > keeping them within spinlocks),
>
> My understanding was that on x86, loads could pass stores in general,
> i.e. a later load could be performed before an earlier store.

Yes, this is the one reordering allowed by the ISA on cacheable memory.
WC memory is weaker, which Linus wants to allow exception for because
one must explicitly ask for it. UC memory (which presumably is what
we're talking about as "IO access") I think is stronger in that it does
not allow any reordering between one another or any cacheable access:

AMD says this:

c — A store (wp,wt,wb,uc,wc,wc+) may not pass a previous load 
(wp,wt,wb,uc,wc,wc+).
f — A load (uc) does not pass a previous store (wp,wt,wb,uc,wc,wc+).
g — A store (wp,wt,wb,uc) does not pass a previous store (wp,wt,wb,uc).
i — A load (wp,wt,wb,wc,wc+) does not pass a previous store (uc).

AMD does allow WC/WC+ to be weakly ordered WRT WC as well as UC, which
Intel seemingly does not. But we're alrady treating WC as special.

I can't actually find the definitive statement in the Intel manuals
saying UC is strongly ordered also WRT WB. Linus?

> I guess 
> that can't be true for uncached loads, but could a cacheable load be
> performed before an earlier uncached store?
> > - as strong as x86. guaranteed not to break drivers that work on x86,
> >   but slower on some archs. To me, this is most pleasing. It is much
> >   much easier to notice something is going a little slower and to work
> >   out how to use weaker ordering there, than it is to debug some
> >   once-in-a-bluemoon breakage caused by just the right architecture,
> >   driver, etc. It totally frees up the driver writer from thinking
> >   about barriers, provided they get the locking right.
>
> I just wish we had even one actual example of things going wrong with
> the current rules we have on powerpc to motivate changing to this
> model.

~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l
506
How sure are you that none of those forms part of a cobbled-together
locking scheme that hopes to constrain some IO access?

~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | grep while | wc -l
29
How about those?

~/usr/src/linux-2.6> git grep mutex_lock drivers/ | wc -l
3138
How sure are you that none of them is hoping to constrain IO operations
within the lock?

Also grep for down, down_write, write_lock, and maybe some others I've
forgotten. And then forget completely about locking and imagine some
of the open coded things you see around the place (or parts where drivers
try to get creative and open code their own locking or try lockless
things).

> > Now that doesn't leave waker ordering architectures lumped with "slow old
> > x86 semantics". Think of it as giving them the benefit of sharing x86
> > development and testing :) We can then formalise the relaxed __ accessors
> > to be more complete (ie. +/- byteswapping).
>
> That leaves a gulf between the extremely strongly ordered writel
> etc. and the extremely weakly ordered __writel etc.  The current
> powerpc scheme is fine for a lot of drivers but your proposal would
> leave us no way to deliver it to them.

But surely you have to audit the drivers anyway to ensure they are OK
with the current powerpc scheme. In which case, once you have audited
them and know they are safe, you can easily convert them to the even
_faster_ __readl/__writel, and just add the appropriate barriers.