Avoiding the dentry d_lock on final dput(), part deux: transactional memory

Tue Oct 1 10:56:15 EST 2013

On Mon, Sep 30, 2013 at 5:36 PM, Michael Neuling <mikey at neuling.org> wrote:
>
> The scary part is that we to make all register volatile.  You were not
> that keen on doing this as there are a lot of places in exception
> entry/exit where we only save/restore a subset of the registers.  We'd
> need to catch all these.

Ugh. It's very possible it's not worth using for the kernel then. The
example I posted is normally fine *without* any transactional support,
since it's a very local per-dentry lock, and since we only take that
lock when the last reference drops (so it's not some common directory
dentry, it's a end-point file dentry). In fact, on ARM they just made
the cmpxchg much faster by making it entirely non-serializing (since
it only updates a reference count, there is no locking involved apart
from checking that the lock state is unlocked)

So there is basically never any contention, and the transaction needs
to basically be pretty much the same cost as a "cmpxchg". It's not
clear if the intel TSX is good enough for that, and if you have to
save a lot of registers in order to use transactions on POWER8, I
doubt it's worthwhile.

We have very few - if any - locks where contention or even cache
bouncing is common or normal. Sure, we have a few particular loads
that can trigger it, but even that is becoming rare. So from a
performance standpoint, the target always needs to be "comparable to
hot spinlock in local cache".

>> They also have interesting ordering semantics vs. locks, we need to be
>> a tad careful (as long as we don't access a lock variable
>> transactionally we should be ok. If we do, then spin_unlock needs a
>> stronger barrier).
>
> Yep.

Well, just about any kernel transaction will at least read the state
of a lock. Without that, it's generally totally useless. My dput()
example sequence very much verified that the lock was not held, for
example.

I'm not sure how that affects anything. The actual transaction had
better not be visible inside the locked region (ie as far as any lock
users go, transactions better all happen fully before or after the
lock, if they read the lock and see it being unlocked).

That said, I cannot see how POWER8 could possibly violate that rule.
The whole "transactions are atomic" is kind of the whole and only
point of a transaction. So I'm not sure what odd lock restrictions
POWER8 could have.

> FWIW eg.
>
>      tbegin
>      beq abort /* passes first time through */
>      ....
>      transactional stuff
>      ....
>      tend
>      b pass
>
> abort:
>
> pass:

That's fine, and matches the x86 semantics fairly closely, except
"xbegin" kind of "contains" that "jump to abort address". But we could
definitely use the same models. Call it
"transaction_begin/abort/end()", and it should be architecture-neutral
naming-wise.

Of course, if tbegin then acts basically like some crazy
assembly-level setjmp (I'm guessing it does exactly, and presumably
precisely that kind of compiler support - ie a function with
"__attribute((returns_twice))" in gcc-speak), the overhead of doing it
may kill it.

            Linus