[PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

Wed Feb 27 22:11:34 EST 2013

Hi Srivatsa,

I think there is some elegance in Lai's proposal of using a local
trylock for the reader uncontended case and global rwlock to deal with
the contended case without deadlocks. He apparently didn't realize
initially that nested read locks are common, and he seems to have
confused you because of that, but I think his proposal could be
changed easily to account for that and result in short, easily
understood code. What about the following:

- local_refcnt is a local lock count; it indicates how many recursive
locks are taken using the local lglock
- lglock is used by readers for local locking; it must be acquired
before local_refcnt becomes nonzero and released after local_refcnt
goes back to zero.
- fallback_rwlock is used by readers for global locking; it is
acquired when fallback_reader_refcnt is zero and the trylock fails on
lglock

+void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
+{
+       preempt_disable();
+
+       if (__this_cpu_read(*lgrw->local_refcnt) ||
+           arch_spin_trylock(this_cpu_ptr(lgrw->lglock->lock))) {
+               __this_cpu_inc(*lgrw->local_refcnt);
+
rwlock_acquire_read(&lgrw->fallback_rwlock->lock_dep_map, 0, 0,
_RET_IP_);
+       } else {
+               read_lock(&lgrw->fallback_rwlock);
+       }
+}
+EXPORT_SYMBOL(lg_rwlock_local_read_lock);
+
+void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
+{
+       if (likely(__this_cpu_read(*lgrw->local_refcnt))) {
+               rwlock_release(&lgrw->fallback_rwlock->lock_dep_map,
1, _RET_IP_);
+               if (!__this_cpu_dec_return(*lgrw->local_refcnt))
+                       arch_spin_unlock(this_cpu_ptr(lgrw->lglock->lock));
+       } else {
+               read_unlock(&lgrw->fallback_rwlock);
+       }
+
+       preempt_enable();
+}
+EXPORT_SYMBOL(lg_rwlock_local_read_unlock);
+
+void lg_rwlock_global_write_lock(struct lgrwlock *lgrw)
+{
+       int i;
+
+       preempt_disable();
+
+       for_each_possible_cpu(i)
+               arch_spin_lock(per_cpu_ptr(lgrw->lglock->lock, i));
+       write_lock(&lgrw->fallback_rwlock);
+}
+EXPORT_SYMBOL(lg_rwlock_global_write_lock);
+
+void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw)
+{
+       int i;
+
+       write_unlock(&lgrw->fallback_rwlock);
+       for_each_possible_cpu(i)
+               arch_spin_unlock(per_cpu_ptr(lgrw->lglock->lock, i));
+
+       preempt_enable();
+}
+EXPORT_SYMBOL(lg_rwlock_global_write_unlock);

This is to me relatively easier to understand than Srivatsa's
proposal. Now I'm not sure who wins efficiency wise, but I think it
should be relatively close as readers at least don't touch shared
state in the uncontended case (even with some recursion going on).

There is an interesting case where lg_rwlock_local_read_lock could be
interrupted after getting the local lglock but before incrementing
local_refcnt to 1; if that happens any nested readers within that
interrupt will have to take the global rwlock read side. I think this
is perfectly acceptable as this should not be a common case though
(and thus the global rwlock cache line probably wouldn't even bounce
between cpus then).

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.