[patch] spin-nicer-2.6.11-rc1-A0
Ingo Molnar
mingo at elte.hu
Sun Jan 16 01:25:37 EST 2005
* Paul Mackerras <paulus at samba.org> wrote:
> This patch fixes a problem I have been seeing since all the preempt
> changes went in, which is that ppc64 SMP systems would livelock
> randomly if preempt was enabled.
>
> It turns out that what was happening was that one cpu was spinning in
> spin_lock_irq (the version at line 215 of kernel/spinlock.c) madly
> doing preempt_enable() and preempt_disable() calls. The other cpu had
> the lock and was trying to set the TIF_NEED_RESCHED flag for the task
> running on the first cpu. That is an atomic operation which has to be
> retried if another cpu writes to the same cacheline between the load
> and the store, which the other cpu was doing every time it did
> preempt_enable() or preempt_disable().
ahh ... indeed. Nice catch.
> I decided to move the thread_info flags field into the next cache
> line, since it is the only field that would regularly be modified by
> cpus other than the one running the task that owns the thread_info.
> (OK possibly the `cpu' field would be on a rebalance; I don't know the
> rebalancing code, but that should be pretty infrequent.) Thus, moving
> the flags field seems like a good idea generally as well as solving
> the immediate problem.
>
> For the record I am pretty unhappy with the code we use for spin_lock
> et al. with preemption turned on (the BUILD_LOCK_OPS stuff in
> spinlock.c). For a start we do the atomic op (_raw_spin_trylock) each
> time around the loop. That is going to be generating a lot of
> unnecessary bus (or fabric) traffic. Instead, after we fail to get
> the lock we should poll it with simple loads until we see that it is
> clear and then retry the atomic op. Assuming a reasonable cache
> design, the loads won't generate any bus traffic until another cpu
> writes to the cacheline containing the lock.
agreed. How about the patch below? (tested on x86)
> Secondly we have lost the __spin_yield call that we had on ppc64,
> which is an important optimization when we are running under the
> hypervisor. I can't just put that in cpu_relax because I need to know
> which (virtual) cpu is holding the lock, so that I can tell the
> hypervisor which virtual cpu to give my time slice to. That
> information is stored in the lock variable, which is why __spin_yield
> needs the address of the lock.
hm, how about calling __spin_yield() from _raw_spin_trylock(), if the
locking attempt was unsuccessful? This might be slightly incorrect if
the locking attempt is not connected to an actual spin-loop, but we do
have other spin-loops with open-coded trylocks that would benefit from
this optimization too.
Ingo
Signed-off-by: Ingo Molnar <mingo at elte.hu>
--- linux/kernel/spinlock.c.orig
+++ linux/kernel/spinlock.c
@@ -173,7 +173,7 @@ EXPORT_SYMBOL(_write_lock);
* (We do this in a function because inlining it would be excessive.)
*/
-#define BUILD_LOCK_OPS(op, locktype) \
+#define BUILD_LOCK_OPS(op, locktype, is_locked_fn) \
void __lockfunc _##op##_lock(locktype *lock) \
{ \
preempt_disable(); \
@@ -183,7 +183,8 @@ void __lockfunc _##op##_lock(locktype *l
preempt_enable(); \
if (!(lock)->break_lock) \
(lock)->break_lock = 1; \
- cpu_relax(); \
+ while (is_locked_fn(lock) && (lock)->break_lock) \
+ cpu_relax(); \
preempt_disable(); \
} \
} \
@@ -204,6 +205,8 @@ unsigned long __lockfunc _##op##_lock_ir
preempt_enable(); \
if (!(lock)->break_lock) \
(lock)->break_lock = 1; \
+ while (spin_is_locked(lock) && (lock)->break_lock) \
+ cpu_relax(); \
cpu_relax(); \
preempt_disable(); \
} \
@@ -244,9 +247,9 @@ EXPORT_SYMBOL(_##op##_lock_bh)
* _[spin|read|write]_lock_irqsave()
* _[spin|read|write]_lock_bh()
*/
-BUILD_LOCK_OPS(spin, spinlock_t);
-BUILD_LOCK_OPS(read, rwlock_t);
-BUILD_LOCK_OPS(write, rwlock_t);
+BUILD_LOCK_OPS(spin, spinlock_t, spin_is_locked);
+BUILD_LOCK_OPS(read, rwlock_t, rwlock_is_locked);
+BUILD_LOCK_OPS(write, rwlock_t, spin_is_locked);
#endif /* CONFIG_PREEMPT */
More information about the Linuxppc64-dev
mailing list