[patch] spin-nicer-2.6.11-rc1-A0

Sun Jan 16 01:25:37 EST 2005

* Paul Mackerras <paulus at samba.org> wrote:

> This patch fixes a problem I have been seeing since all the preempt
> changes went in, which is that ppc64 SMP systems would livelock
> randomly if preempt was enabled.
> 
> It turns out that what was happening was that one cpu was spinning in
> spin_lock_irq (the version at line 215 of kernel/spinlock.c) madly
> doing preempt_enable() and preempt_disable() calls.  The other cpu had
> the lock and was trying to set the TIF_NEED_RESCHED flag for the task
> running on the first cpu.  That is an atomic operation which has to be
> retried if another cpu writes to the same cacheline between the load
> and the store, which the other cpu was doing every time it did
> preempt_enable() or preempt_disable().

ahh ... indeed. Nice catch.

> I decided to move the thread_info flags field into the next cache
> line, since it is the only field that would regularly be modified by
> cpus other than the one running the task that owns the thread_info.
> (OK possibly the `cpu' field would be on a rebalance; I don't know the
> rebalancing code, but that should be pretty infrequent.)  Thus, moving
> the flags field seems like a good idea generally as well as solving
> the immediate problem.
> 
> For the record I am pretty unhappy with the code we use for spin_lock
> et al. with preemption turned on (the BUILD_LOCK_OPS stuff in
> spinlock.c).  For a start we do the atomic op (_raw_spin_trylock) each
> time around the loop.  That is going to be generating a lot of
> unnecessary bus (or fabric) traffic.  Instead, after we fail to get
> the lock we should poll it with simple loads until we see that it is
> clear and then retry the atomic op.  Assuming a reasonable cache
> design, the loads won't generate any bus traffic until another cpu
> writes to the cacheline containing the lock.

agreed. How about the patch below? (tested on x86)

> Secondly we have lost the __spin_yield call that we had on ppc64,
> which is an important optimization when we are running under the
> hypervisor.  I can't just put that in cpu_relax because I need to know
> which (virtual) cpu is holding the lock, so that I can tell the
> hypervisor which virtual cpu to give my time slice to.  That
> information is stored in the lock variable, which is why __spin_yield
> needs the address of the lock.

hm, how about calling __spin_yield() from _raw_spin_trylock(), if the
locking attempt was unsuccessful? This might be slightly incorrect if
the locking attempt is not connected to an actual spin-loop, but we do
have other spin-loops with open-coded trylocks that would benefit from
this optimization too.

	Ingo

Signed-off-by: Ingo Molnar <mingo at elte.hu>

--- linux/kernel/spinlock.c.orig
+++ linux/kernel/spinlock.c
@@ -173,7 +173,7 @@ EXPORT_SYMBOL(_write_lock);
  * (We do this in a function because inlining it would be excessive.)
  */
 
-#define BUILD_LOCK_OPS(op, locktype)					\
+#define BUILD_LOCK_OPS(op, locktype, is_locked_fn)			\
 void __lockfunc _##op##_lock(locktype *lock)				\
 {									\
 	preempt_disable();						\
@@ -183,7 +183,8 @@ void __lockfunc _##op##_lock(locktype *l
 		preempt_enable();					\
 		if (!(lock)->break_lock)				\
 			(lock)->break_lock = 1;				\
-		cpu_relax();						\
+		while (is_locked_fn(lock) && (lock)->break_lock)	\
+			cpu_relax();					\
 		preempt_disable();					\
 	}								\
 }									\
@@ -204,6 +205,8 @@ unsigned long __lockfunc _##op##_lock_ir
 		preempt_enable();					\
 		if (!(lock)->break_lock)				\
 			(lock)->break_lock = 1;				\
+		while (spin_is_locked(lock) && (lock)->break_lock)	\
+			cpu_relax();					\
 		cpu_relax();						\
 		preempt_disable();					\
 	}								\
@@ -244,9 +247,9 @@ EXPORT_SYMBOL(_##op##_lock_bh)
  *         _[spin|read|write]_lock_irqsave()
  *         _[spin|read|write]_lock_bh()
  */
-BUILD_LOCK_OPS(spin, spinlock_t);
-BUILD_LOCK_OPS(read, rwlock_t);
-BUILD_LOCK_OPS(write, rwlock_t);
+BUILD_LOCK_OPS(spin, spinlock_t, spin_is_locked);
+BUILD_LOCK_OPS(read, rwlock_t, rwlock_is_locked);
+BUILD_LOCK_OPS(write, rwlock_t, spin_is_locked);
 
 #endif /* CONFIG_PREEMPT */