[Cbe-oss-dev] [RFC] [PATCH 6:8] SPU Gang Scheduling - base functionality for gang scheduling

Tue Mar 4 06:50:06 EST 2008

Implement spu gang scheduling.

This patch provides the base support for gang scheudling, including spu 
mgmt, runqueue management, placement, activation, deactivation, time slicing, 
yield, and preemption.  Basically, all of the core scheduling capabilities.
 
All spu contexts belong to a gang.  For standalone spu contexts, an internal
gang structure is allocated to present a uniform data abstraction, so that
the gang can be queued on the runqueue.  The priority of the gang dictates
its position on the runqueue.  All gang's have a single priority, policy, 
and NUMA attachment which is inherited from creator of the spu context. These 
values do not currently change, although there is nothing to prohibit such 
support to be added in the future.

All contexts within a gang are scheduled and unscheduled at the same time. 
spu_schedule and spu_unschedule have been changed to invoke spu_bind_context
and spu_unbind_context in a loop.  The former is more complicated in that it
must allocate enough spus in a secure manner so that it can get successfully
through its critical section without running out of spus.  For this reason,
SPUs are preallocated.  A reserved spu has the following state. 

(spu->alloc_state == SPU_FREE and spu->gang != <gang>) 

Timeslicing follows a two step algorithm. 1) all running contexts are
timesliced. The tick counter is implemented at the context level to
simplify the logic as they are all set and decremented at the same time.
When a count goes to zero, the gang is unscheduled.  This frees up space
as much space as posible before the scheduler tries to place a job that
is queued on the runqueue. This is a critical as the size of the job 
waiting to run is not known apriori. 2) Sequentially place as many gangs 
as possible. Skip over gangs as necessary across all run levels. This is
consistent with spu_yield which unloads the spu across user mode calls.

A simple hueristic has been implemented to prevent too many contexts switches
in step 1.  A limit is based on the number of runnable contexts that are 
available on the runqueue.  If the count is less than the number of physical
spus, some spus may not be time sliced.  This is not guaranteed as they
may be part of a gang that is time sliced.  A simple one pass scan is used.

A new gang nstarted counter has been added to the gang structure to create
a synchronization point for gang start.  The counter is incremented, when
a context calls spu_run().  When all of the contexts have been started, 
the gang is considered runnable. 

The start synchronization point is implemented by passing the first
N-1 contexts directly through spu_run() to the spufs_wait() as before
where they wait on a private spe event word.  As before, they update
their csa area instead of hardware registers. The Nth thread through
spu_run() either places the gang or puts it on the runqueue.

Nearly all of the spu_run() critical section is the same.  It is context
based and runs almost entirely under the context lock.  The gang lock
is only taken when the context is in the SPU_SCHED_STATE, signifying that
the context needs to be activated.  This is an important optimization
that avoids lock contention in the controlling thread.

A gang nrunnable count has been implemented that is incremented and
decremented on entry and exit of spu_run respectively. This count is 
intended to provide a measure of whether all of the contexts in the
gang are executing user mode code.  In this case, all of the spus in
the gang are stopped and this is a good point to preempt the gang.  This
is implemented by spu_yield() which triggers a call to spu_deactivate
which unloads the gang.  Importantly, in this case, the gang is not added
to the runqueue as the contexts are stopped.  This is designed to prevent
the pollution of the runqueue with stopped jobs that could only be 
lazily loaded.  In this case, it is safe to not queue it as the 
application is expected to re-drive the context via spu_run.  

Finally, this means that a gang is eligible to be run as long as 
one context in the gang is runnable.  Major page faulting is the other 
event that may cause a gang to be preempted.  It is implemented via a
nfaulting count and a call to yield.  In this case, it is put on the
runqueue as the context is in kernel mode.  It is sort of a step down
scheduling technique to give something else a chance to run.

Signed-off-by: Luke Browning <lukebrowning at us.ibm.com>

---

Index: spufs/include/asm-powerpc/spu.h
===================================================================

--- spufs.orig/include/asm-powerpc/spu.h
+++ spufs/include/asm-powerpc/spu.h
@@ -136,6 +136,7 @@ struct spu {
 	unsigned int slb_replace;
 	struct mm_struct *mm;
 	struct spu_context *ctx;
+	struct spu_gang *gang;
 	struct spu_runqueue *rq;
 	unsigned long long timestamp;
 	pid_t pid;
Index: spufs/arch/powerpc/platforms/cell/spufs/spufs.h
===================================================================
--- spufs.orig/arch/powerpc/platforms/cell/spufs/spufs.h
+++ spufs/arch/powerpc/platforms/cell/spufs/spufs.h
@@ -40,10 +40,12 @@ enum {
 struct spu_context_ops;
 struct spu_gang;
 
-/* ctx->sched_flags */
+/* ctx->sched_flags and gang->sched_flags */
 enum {
 	SPU_SCHED_NOTIFY_ACTIVE,
 	SPU_SCHED_WAS_ACTIVE,	/* was active upon spu_acquire_saved()  */
+	SPU_SCHED_JUST_CREATED,
+	SPU_SCHED_DEACTIVATED,
 };
 
 struct spu_context {
@@ -90,7 +92,6 @@ struct spu_context {
 	pid_t tid;
 
 	/* scheduler fields */
-	struct list_head rq;
 	unsigned int time_slice;
 	unsigned long sched_flags;
 	cpumask_t cpus_allowed;
@@ -120,6 +121,9 @@ struct spu_context {
 	int aff_offset;
 };
 
+#define SPU_GANG_RUNNABLE(g) \
+	(g->contexts && (atomic_read(&g->nstarted) == g->contexts))
+
 struct spu_gang {
 	struct list_head list;
 	struct mutex mutex;
@@ -130,6 +134,10 @@ struct spu_gang {
 	cpumask_t cpus_allowed;
 	int policy;
 	int prio;
+	atomic_t nstarted;
+	atomic_t nrunnable;
+	unsigned long sched_flags;
+	struct list_head rq;
 
 	struct spu_context *aff_ref_ctx;
 	struct list_head aff_list_head;
@@ -137,6 +145,24 @@ struct spu_gang {
 	int aff_flags;
 	struct spu *aff_ref_spu;
 	atomic_t aff_sched_count;
+
+	/* statistics */
+	struct {
+		/* updates protected by ctx->state_mutex */
+		enum spu_utilization_state util_state;
+		unsigned long long tstamp;	/* time of last state switch */
+		unsigned long long times[SPU_UTIL_MAX];
+		unsigned long long vol_ctx_switch;
+		unsigned long long invol_ctx_switch;
+		unsigned long long min_flt;
+		unsigned long long maj_flt;
+		unsigned long long hash_flt;
+		unsigned long long slb_flt;
+		unsigned long long slb_flt_base; /* # at last ctx switch */
+		unsigned long long class2_intr;
+		unsigned long long class2_intr_base; /* # at last ctx switch */
+		unsigned long long libassist;
+	} stats;
 };
 
 /* Flag bits for spu_gang aff_flags */
@@ -255,9 +281,8 @@ int __must_check spu_acquire_saved(struc
 void spu_release_saved(struct spu_context *ctx);
 
 int spu_stopped(struct spu_context *ctx, u32 * stat);
-void spu_del_from_rq(struct spu_context *ctx);
 int spu_activate(struct spu_context *ctx, unsigned long flags);
-void spu_deactivate(struct spu_context *ctx);
+void spu_deactivate(struct spu_gang *gang);
 void spu_yield(struct spu_context *ctx);
 void spu_switch_notify(struct spu *spu, struct spu_context *ctx);
 void spu_set_timeslice(struct spu_context *ctx);
@@ -333,6 +358,12 @@ extern void spu_free_lscsa(struct spu_st
 extern void spuctx_switch_state(struct spu_context *ctx,
 		enum spu_utilization_state new_state);
 
+#define spu_sched_stats(name, maxcontexts, active, preempted) \
+	trace_mark(name, "%d %d %d", maxcontexts, active, preempted);
+#define spu_gang_trace(name, gang, cnt) \
+	trace_mark(name, "%p %d", gang, cnt);
+#define spu_gang_nospu_trace(name, gang, cnt, found) \
+	trace_mark(name, "%p %d %d", gang, cnt, found);
 #define spu_context_trace(name, ctx, spu) \
 	trace_mark(name, "%p %p", ctx, spu);
 #define spu_context_nospu_trace(name, ctx) \
Index: spufs/arch/powerpc/platforms/cell/spufs/sched.c
===================================================================
--- spufs.orig/arch/powerpc/platforms/cell/spufs/sched.c
+++ spufs/arch/powerpc/platforms/cell/spufs/sched.c
@@ -48,11 +48,23 @@
 #include <asm/spu_priv1.h>
 #include "spufs.h"
 
+struct spu_sched_stats {
+	int inode;
+	int inode_icnt;
+	int inode_pcnt;
+	int pnode;
+	int pnode_icnt;
+	int pnode_pcnt;
+	int total_icnt;
+	int total_pcnt;
+};
+
 struct spu_prio_array {
 	DECLARE_BITMAP(bitmap, MAX_PRIO);
 	struct list_head runq[MAX_PRIO];
 	spinlock_t runq_lock;
 	int nr_waiting;
+	int nr_contexts;
 };
 
 static unsigned long spu_avenrun[3];
@@ -60,6 +72,7 @@ static struct spu_prio_array *spu_prio;
 static struct task_struct *spusched_task;
 static struct timer_list spusched_timer;
 static struct timer_list spuloadavg_timer;
+static void spu_unschedule(struct spu_gang *gang);
 
 /*
  * Priority of a normal, non-rt, non-niced'd process (aka nice level 0).
@@ -85,6 +98,11 @@ static struct timer_list spuloadavg_time
 #define SCALE_PRIO(x, prio) \
 	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
 
+#define SPU_RESERVE		0
+#define SPU_RESERVE_UNDO	1
+
+#define SPU_NOPREEMPT		-2
+
 /*
  * scale user-nice values [ -20 ... 0 ... 19 ] to time slice values:
  * [800ms ... 100ms ... 5ms]
@@ -167,25 +185,25 @@ void spu_update_sched_info(struct spu_co
 	}
 }
 
-static int __node_allowed(struct spu_context *ctx, int node)
+static int __node_allowed(struct spu_gang *gang, int node)
 {
 	if (nr_cpus_node(node)) {
 		cpumask_t mask = node_to_cpumask(node);
 
-		if (cpus_intersects(mask, ctx->cpus_allowed))
+		if (cpus_intersects(mask, gang->cpus_allowed))
 			return 1;
 	}
 
 	return 0;
 }
 
-static int node_allowed(struct spu_context *ctx, int node)
+static int node_allowed(struct spu_gang *gang, int node)
 {
 	int rval;
 
-	spin_lock(&spu_prio->runq_lock);
-	rval = __node_allowed(ctx, node);
-	spin_unlock(&spu_prio->runq_lock);
+	spin_lock(&cbe_spu_info[node].list_lock);
+	rval = __node_allowed(gang, node);
+	spin_unlock(&cbe_spu_info[node].list_lock);
 
 	return rval;
 }
@@ -312,7 +330,7 @@ static struct spu *aff_ref_location(stru
 	node = cpu_to_node(raw_smp_processor_id());
 	for (n = 0; n < MAX_NUMNODES; n++, node++) {
 		node = (node < MAX_NUMNODES) ? node : 0;
-		if (!node_allowed(ctx, node))
+		if (!node_allowed(ctx->gang, node))
 			continue;
 		spin_lock(&cbe_spu_info[node].list_lock);
 		list_for_each_entry(spu, &cbe_spu_info[node].spus, cbe_list) {
@@ -381,13 +399,14 @@ static struct spu *ctx_location(struct s
  * affinity_check is called each time a context is going to be scheduled.
  * It returns the spu ptr on which the context must run.
  */
-static int has_affinity(struct spu_context *ctx)
+static int has_affinity(struct spu_gang *gang)
 {
-	struct spu_gang *gang = ctx->gang;
-
-	if (list_empty(&ctx->aff_list))
+	if (list_empty(&gang->aff_list_head))
 		return 0;
 
+	/*
+	 * TODO: fix SPU Affinity to work with gang scheduling.
+	 */
 	if (!gang->aff_ref_spu) {
 		if (!(gang->aff_flags & AFF_MERGED))
 			aff_merge_remaining_ctxs(gang);
@@ -413,15 +432,6 @@ static void spu_unbind_context(struct sp
  	if (spu->ctx->flags & SPU_CREATE_NOSCHED)
 		atomic_dec(&cbe_spu_info[spu->node].reserved_spus);
 
-	if (ctx->gang){
-		mutex_lock(&ctx->gang->aff_mutex);
-		if (has_affinity(ctx)) {
-			if (atomic_dec_and_test(&ctx->gang->aff_sched_count))
-				ctx->gang->aff_ref_spu = NULL;
-		}
-		mutex_unlock(&ctx->gang->aff_mutex);
-	}
-
 	spu_switch_notify(spu, NULL);
 	spu_unmap_mappings(ctx);
 	spu_save(&ctx->csa, spu);
@@ -448,283 +458,462 @@ static void spu_unbind_context(struct sp
 }
 
 /**
- * spu_add_to_rq - add a context to the runqueue
- * @ctx:       context to add
+ * spu_add_to_rq - add a gang to the runqueue
+ * @gang:       gang to add
  */
-static void __spu_add_to_rq(struct spu_context *ctx)
+static void __spu_add_to_rq(struct spu_gang *gang)
 {
-	/*
-	 * Unfortunately this code path can be called from multiple threads
-	 * on behalf of a single context due to the way the problem state
-	 * mmap support works.
-	 *
-	 * Fortunately we need to wake up all these threads at the same time
-	 * and can simply skip the runqueue addition for every but the first
-	 * thread getting into this codepath.
-	 *
-	 * It's still quite hacky, and long-term we should proxy all other
-	 * threads through the owner thread so that spu_run is in control
-	 * of all the scheduling activity for a given context.
-	 */
-	if (list_empty(&ctx->rq)) {
-		list_add_tail(&ctx->rq, &spu_prio->runq[ctx->prio]);
-		set_bit(ctx->prio, spu_prio->bitmap);
+	if (atomic_read(&gang->nrunnable) && list_empty(&gang->rq)) {
+		int prio = gang->prio;
+
+		list_add_tail(&gang->rq, &spu_prio->runq[prio]);
+		set_bit(prio, spu_prio->bitmap);
+		spu_prio->nr_contexts += gang->contexts;
 		if (!spu_prio->nr_waiting++)
 			__mod_timer(&spusched_timer, jiffies + SPUSCHED_TICK);
 	}
 }
 
-static void spu_add_to_rq(struct spu_context *ctx)
+static void spu_add_to_rq(struct spu_gang *gang)
 {
 	spin_lock(&spu_prio->runq_lock);
-	__spu_add_to_rq(ctx);
+	__spu_add_to_rq(gang);
 	spin_unlock(&spu_prio->runq_lock);
 }
 
-static void __spu_del_from_rq(struct spu_context *ctx)
+static void __spu_del_from_rq(struct spu_gang *gang)
 {
-	int prio = ctx->prio;
+	if (!list_empty(&gang->rq)) {
+		int prio = gang->prio;
 
-	if (!list_empty(&ctx->rq)) {
+		spu_prio->nr_contexts -= gang->contexts;
 		if (!--spu_prio->nr_waiting)
 			del_timer(&spusched_timer);
-		list_del_init(&ctx->rq);
-
+		list_del_init(&gang->rq);
 		if (list_empty(&spu_prio->runq[prio]))
 			clear_bit(prio, spu_prio->bitmap);
 	}
 }
 
-void spu_del_from_rq(struct spu_context *ctx)
+static void spu_del_from_rq(struct spu_gang *gang)
 {
 	spin_lock(&spu_prio->runq_lock);
-	__spu_del_from_rq(ctx);
+	__spu_del_from_rq(gang);
 	spin_unlock(&spu_prio->runq_lock);
 }
 
-static void spu_prio_wait(struct spu_context *ctx)
+static int spu_get_idle(struct spu_gang *gang, int node)
 {
-	DEFINE_WAIT(wait);
+	struct spu *spu;
+	int n, lnode, count, mode;
+	int ret = 0, found = 0;
 
-	/*
-	 * The caller must explicitly wait for a context to be loaded
-	 * if the nosched flag is set.  If NOSCHED is not set, the caller
-	 * queues the context and waits for an spu event or error.
-	 */
-	BUG_ON(!(ctx->flags & SPU_CREATE_NOSCHED));
+	spu_context_nospu_trace(spu_get_idle__enter, gang);
 
-	spin_lock(&spu_prio->runq_lock);
-	prepare_to_wait_exclusive(&ctx->stop_wq, &wait, TASK_INTERRUPTIBLE);
-	if (!signal_pending(current)) {
-		__spu_add_to_rq(ctx);
-		spin_unlock(&spu_prio->runq_lock);
-		mutex_unlock(&ctx->state_mutex);
-		schedule();
-		mutex_lock(&ctx->state_mutex);
-		spin_lock(&spu_prio->runq_lock);
-		__spu_del_from_rq(ctx);
-	}
-	spin_unlock(&spu_prio->runq_lock);
-	__set_current_state(TASK_RUNNING);
-	remove_wait_queue(&ctx->stop_wq, &wait);
-}
+	/* TO DO: SPU affinity scheduling. */
 
-static struct spu *spu_get_idle(struct spu_context *ctx)
-{
-	struct spu *spu, *aff_ref_spu;
-	int node, n;
+	mode = SPU_RESERVE;
 
-	spu_context_nospu_trace(spu_get_idle__enter, ctx);
+spu_get_idle_top:
 
-	if (ctx->gang) {
-		mutex_lock(&ctx->gang->aff_mutex);
-		if (has_affinity(ctx)) {
-			aff_ref_spu = ctx->gang->aff_ref_spu;
-			atomic_inc(&ctx->gang->aff_sched_count);
-			mutex_unlock(&ctx->gang->aff_mutex);
-			node = aff_ref_spu->node;
-
-			spin_lock(&cbe_spu_info[node].list_lock);
-			spu = ctx_location(aff_ref_spu, ctx->aff_offset, node);
-			if (spu && spu->alloc_state == SPU_FREE)
-				goto found;
-			spin_unlock(&cbe_spu_info[node].list_lock);
+	if (node == -1)
+		lnode = cpu_to_node(raw_smp_processor_id());
+	else
+		lnode = node;
 
-			mutex_lock(&ctx->gang->aff_mutex);
-			if (atomic_dec_and_test(&ctx->gang->aff_sched_count))
-				ctx->gang->aff_ref_spu = NULL;
-			mutex_unlock(&ctx->gang->aff_mutex);
-			goto not_found;
-		}
-		mutex_unlock(&ctx->gang->aff_mutex);
-	}
-	node = cpu_to_node(raw_smp_processor_id());
-	for (n = 0; n < MAX_NUMNODES; n++, node++) {
-		node = (node < MAX_NUMNODES) ? node : 0;
-		if (!node_allowed(ctx, node))
+	for (n = 0, count = 0;
+	     n < MAX_NUMNODES && count < gang->contexts;
+	     n++, lnode++) {
+
+		lnode = (lnode < MAX_NUMNODES) ? lnode : 0;
+		if (!node_allowed(gang, lnode))
 			continue;
 
-		spin_lock(&cbe_spu_info[node].list_lock);
-		list_for_each_entry(spu, &cbe_spu_info[node].spus, cbe_list) {
-			if (spu->alloc_state == SPU_FREE)
-				goto found;
+		spin_lock(&cbe_spu_info[lnode].list_lock);
+		list_for_each_entry(spu, &cbe_spu_info[lnode].spus, cbe_list) {
+			switch (mode) {
+			case SPU_RESERVE :
+				if ((spu->alloc_state == SPU_FREE) &&
+				    (spu->gang == NULL)) {
+					spu_init_channels(spu);
+					spu->gang = gang;
+					if (++count == gang->contexts)
+						goto spu_get_idle_found;
+				}
+				break;
+			case SPU_RESERVE_UNDO :
+				if ((spu->alloc_state == SPU_FREE) &&
+				    (spu->gang == gang)) {
+					spu->gang = NULL;
+					if (++count == found)
+						goto spu_get_idle_found;
+				}
+				break;
+			}
 		}
-		spin_unlock(&cbe_spu_info[node].list_lock);
+		spin_unlock(&cbe_spu_info[lnode].list_lock);
 	}
+	BUG_ON(mode != SPU_RESERVE);
 
- not_found:
-	spu_context_nospu_trace(spu_get_idle__not_found, ctx);
-	return NULL;
+	found = count;
+	ret = -1;
+	mode = SPU_RESERVE_UNDO;
 
- found:
-	spu->alloc_state = SPU_USED;
-	spin_unlock(&cbe_spu_info[node].list_lock);
-	spu_context_trace(spu_get_idle__found, ctx, spu);
-	spu_init_channels(spu);
-	return spu;
+	if (found)
+		goto spu_get_idle_top;
+	else
+		goto spu_get_idle_out;
+
+spu_get_idle_found:
+
+	found = count;
+
+	spin_unlock(&cbe_spu_info[lnode].list_lock);
+
+spu_get_idle_out:
+
+	if (ret)
+		spu_gang_nospu_trace(spu_get_idle__not_found,
+					gang, gang->contexts, found)
+	else
+		spu_gang_trace(spu_get_idle__found, gang, gang->contexts)
+
+	return ret;
 }
 
 /**
- * find_victim - find a lower priority context to preempt
- * @ctx:	canidate context for running
+ * find_victim - find a lower priority gang to preempt
+ * @gang:	canidate gang for running
  *
- * Returns the freed physical spu to run the new context on.
+ * Returns 0 for success (preemption occurred), -1 for failure
  */
-static struct spu *find_victim(struct spu_context *ctx)
+static int find_victim(struct spu_gang *gang, int node)
 {
-	struct spu_context *victim = NULL;
+	struct spu_gang *victim;
+	struct spu_context *ctx;
 	struct spu *spu;
-	int node, n;
+	int n, retry = 3;
 
-	spu_context_nospu_trace(spu_find_vitim__enter, ctx);
+	spu_context_nospu_trace(spu_find_victim__enter, gang);
+
+	if (node == -1)
+		node = cpu_to_node(raw_smp_processor_id());
 
-	/*
-	 * Look for a possible preemption candidate on the local node first.
-	 * If there is no candidate look at the other nodes.  This isn't
-	 * exactly fair, but so far the whole spu scheduler tries to keep
-	 * a strong node affinity.  We might want to fine-tune this in
-	 * the future.
-	 */
- restart:
-	node = cpu_to_node(raw_smp_processor_id());
 	for (n = 0; n < MAX_NUMNODES; n++, node++) {
+
 		node = (node < MAX_NUMNODES) ? node : 0;
-		if (!node_allowed(ctx, node))
+		if (!node_allowed(gang, node))
 			continue;
+restart_node:
+		/*
+		 * Retry to avoid algorithm deadlock.  The act of
+		 * unscheduling takes place before scheduling, so we can't
+		 * spin indefinitely in the unschedule (ie. find_victim)
+		 * waiting for a priority change to occur.
+		 */
+		if (!--retry)
+			continue;
+
+		victim = NULL;
 
 		spin_lock(&cbe_spu_info[node].list_lock);
 		list_for_each_entry(spu, &cbe_spu_info[node].spus, cbe_list) {
-			struct spu_context *tmp = spu->ctx;
+			struct spu_gang *tmp = spu->gang;
+			struct spu_context *tmpc = spu->ctx;
 
-			if (tmp && tmp->prio > ctx->prio &&
-			    !(tmp->flags & SPU_CREATE_NOSCHED) &&
+			if ((spu->alloc_state == SPU_USED) &&
+			    (tmp != NULL) &&
+			    (tmp->prio > gang->prio) &&
+			   !(tmpc->flags & SPU_CREATE_NOSCHED) &&
 			    (!victim || tmp->prio > victim->prio))
-				victim = spu->ctx;
+				victim = tmp;
 		}
 		spin_unlock(&cbe_spu_info[node].list_lock);
 
 		if (victim) {
 			/*
-			 * This nests ctx->state_mutex, but we always lock
+			 * This nests gang->mutex, but we always lock
 			 * higher priority contexts before lower priority
 			 * ones, so this is safe until we introduce
 			 * priority inheritance schemes.
 			 *
-			 * XXX if the highest priority context is locked,
-			 * this can loop a long time.  Might be better to
-			 * look at another context or give up after X retries.
+			 * XXX If victim no longer exists, we are referencing
+			 * stray kernel memory and changing it with the mutex
+			 * lock and subsequent actions.
 			 */
-			if (!mutex_trylock(&victim->state_mutex)) {
-				victim = NULL;
-				goto restart;
-			}
+			if (test_bit(SPU_SCHED_DEACTIVATED, &gang->sched_flags))
+				goto restart_node;
+
+			if (!mutex_trylock(&victim->mutex))
+				goto restart_node;
 
-			spu = victim->spu;
-			if (!spu || victim->prio <= ctx->prio) {
+			if (!SPU_GANG_RUNNABLE(victim)) {
 				/*
 				 * This race can happen because we've dropped
 				 * the active list mutex.  Not a problem, just
 				 * restart the search.
 				 */
-				mutex_unlock(&victim->state_mutex);
-				victim = NULL;
-				goto restart;
+				mutex_unlock(&victim->mutex);
+				goto restart_node;
 			}
 
-			spu_context_trace(__spu_deactivate__unload, ctx, spu);
+			ctx = list_first_entry(&victim->list,
+					struct spu_context, gang_list);
 
-			spu_unbind_context(spu, victim);
+			if (!ctx->spu || victim->prio <= gang->prio) {
+				/*
+				 * This race can happen because we've dropped
+				 * the active list mutex.  Not a problem, just
+				 * restart the search.
+				 */
+				mutex_unlock(&victim->mutex);
+				goto restart_node;
+			}
 
-			spin_lock(&cbe_spu_info[node].list_lock);
-			spu->ctx = NULL;
-			cbe_spu_info[node].nr_active--;
-			spin_unlock(&cbe_spu_info[node].list_lock);
+			spu_gang_trace(__spu_deactivate__unload,
+						victim, victim->contexts);
 
-			victim->stats.invol_ctx_switch++;
-			spu->stats.invol_ctx_switch++;
+			spu_unschedule(victim);
 			spu_add_to_rq(victim);
 
-			mutex_unlock(&victim->state_mutex);
+			mutex_unlock(&victim->mutex);
 
-			return spu;
+			return 0;
 		}
 	}
-
-	return NULL;
+	return -1;
 }
 
-static void __spu_schedule(struct spu *spu, struct spu_context *ctx)
+static struct spu *spu_bind(struct spu_gang *gang,
+				struct spu_context *ctx, int node)
 {
-	int node = spu->node;
-	int success = 0;
+	struct spu *spu;
+	int n;
 
-	spu_set_timeslice(ctx);
+	if (node == -1)
+		node = cpu_to_node(raw_smp_processor_id());
 
-	spin_lock(&cbe_spu_info[node].list_lock);
-	if (spu->ctx == NULL) {
-		cbe_spu_info[node].nr_active++;
-		spu->alloc_state = SPU_USED;
-		spu->ctx = ctx;
-		success = 1;
+	for (n = 0; n < MAX_NUMNODES; n++, node++) {
+		node = (node < MAX_NUMNODES) ? node : 0;
+		if (!node_allowed(gang, node))
+			continue;
+
+		spin_lock(&cbe_spu_info[node].list_lock);
+		list_for_each_entry(spu, &cbe_spu_info[node].spus, cbe_list) {
+			if ((spu->alloc_state == SPU_FREE) &&
+			    (spu->gang == gang))
+				goto found;
+		}
+		spin_unlock(&cbe_spu_info[node].list_lock);
 	}
+	return NULL;
+
+found:
+	cbe_spu_info[node].nr_active++;
+	spu->alloc_state = SPU_USED;
+	spu->ctx = ctx;
 	spin_unlock(&cbe_spu_info[node].list_lock);
 
-	spu_bind_context(spu, ctx);
+	return spu;
+}
+
+static void __spu_schedule(struct spu_gang *gang, int node_chosen)
+{
+	struct spu_context *ctx;
+	struct spu *spu;
 
-	if (success)
+	spu_del_from_rq(gang);
+
+	list_for_each_entry(ctx, &gang->list, gang_list) {
+		mutex_lock(&ctx->state_mutex);
+		BUG_ON(ctx->spu);
+		spu = spu_bind(gang, ctx, node_chosen);
+		BUG_ON(!spu);
+		spu_bind_context(spu, ctx);
+		spu_set_timeslice(ctx);
 		wake_up_all(&ctx->run_wq);
-	else
-		spu_add_to_rq(ctx);
+		mutex_unlock(&ctx->state_mutex);
+	}
 }
 
-static void spu_schedule(struct spu *spu, struct spu_context *ctx)
+static void spu_unschedule(struct spu_gang *gang)
 {
-	/* not a candidate for interruptible because it's called either
-	   from the scheduler thread or from spu_deactivate */
-	mutex_lock(&ctx->state_mutex);
-	__spu_schedule(spu, ctx);
-	spu_release(ctx);
+	struct spu_context *ctx;
+	struct spu *spu;
+	int node;
+
+	list_for_each_entry(ctx, &gang->list, gang_list) {
+
+		mutex_lock(&ctx->state_mutex);
+		spu = ctx->spu;
+		BUG_ON(!spu);
+		BUG_ON(spu->ctx != ctx);
+		BUG_ON(spu->gang != gang);
+		BUG_ON(spu->alloc_state != SPU_USED);
+		node = spu->node;
+		spu_unbind_context(spu, ctx);
+		spin_lock(&cbe_spu_info[node].list_lock);
+		cbe_spu_info[node].nr_active--;
+		spu->alloc_state = SPU_FREE;
+		spu->ctx = NULL;
+		spu->gang = NULL;
+		gang->stats.invol_ctx_switch++;
+		ctx->stats.invol_ctx_switch++;
+		spu->stats.invol_ctx_switch++;
+		spin_unlock(&cbe_spu_info[node].list_lock);
+		mutex_unlock(&ctx->state_mutex);
+	}
 }
 
-static void spu_unschedule(struct spu *spu, struct spu_context *ctx)
+static int spu_gang_placeable(struct spu_gang *gang,
+	struct spu_sched_stats *stats, int *node_chosen, int *node_preempt)
 {
-	int node = spu->node;
+	/*
+	 * Strategy is to minimize preemption and to place contexts on
+	 * a single node, if possible.  Affinity gangs must be scheduled
+	 * on one node.  Identify nodes to preempt if necessary.
+	 */
+	if (has_affinity(gang)) {
+		if (stats->inode_icnt + stats->inode_pcnt >= gang->contexts) {
+			*node_chosen = stats->inode;
+			if (stats->inode_icnt < gang->contexts)
+				*node_preempt = stats->inode_icnt;
+			else
+				*node_preempt = SPU_NOPREEMPT;
+			return 1;
+		}
+		if (stats->pnode_icnt + stats->pnode_pcnt >= gang->contexts) {
+			*node_chosen = stats->pnode;
+			if (stats->pnode_icnt < gang->contexts)
+				*node_preempt = stats->pnode;
+			else
+				*node_preempt = SPU_NOPREEMPT;
+			return 1;
+		}
+		return 0;
+	}
+	if (stats->inode_icnt >= gang->contexts) {
+		*node_chosen = stats->inode;	/* allocate idle node */
+		*node_preempt = SPU_NOPREEMPT;
+		return 1;
+	}
+	if (stats->total_icnt >= gang->contexts) {
+		*node_chosen = -1;		/* allocate all nodes */
+		*node_preempt = SPU_NOPREEMPT;
+		return 1;
+	}
+	if (stats->inode_icnt + stats->inode_pcnt >= gang->contexts) {
+		*node_chosen = stats->inode;	/* allocate idle node */
+		*node_preempt = stats->inode;	/* preempt on idle node */
+		return 1;
+	}
+	if (stats->pnode_icnt + stats->pnode_pcnt >= gang->contexts) {
+		*node_chosen = stats->pnode;	/* allocate other node */
+		*node_preempt = stats->pnode;	/* preempt on other node */
+		return 1;
+	}
+	if (stats->total_icnt + stats->inode_pcnt >= gang->contexts) {
+		*node_chosen = -1;		/* allocate all nodes */
+		*node_preempt = stats->inode;	/* preempt on idle node */
+		return 1;
+	}
+	if (stats->total_icnt + stats->pnode_pcnt >= gang->contexts) {
+		*node_chosen = -1;		/* allocate all nodes */
+		*node_preempt = stats->pnode;	/* preempt other node */
+		return 1;
+	}
+	if (stats->total_icnt + stats->total_pcnt >= gang->contexts) {
+		*node_chosen = -1;		/* allocate nodes */
+		*node_preempt = -1;		/* preempt all nodes */
+		return 1;
+	}
+	return 0;
+}
+
+static void spu_get_run_stats(struct spu_gang *gang,
+					struct spu_sched_stats *stats)
+{
+	struct spu *spu;
+	int n, node, count_idle, count_preempt;
 
-	spu_unbind_context(spu, ctx);
+	stats->inode = -1;		/* node with most idle spus */
+	stats->inode_icnt = 0; 		/* n idle on idle node */
+	stats->inode_pcnt = 0;		/* n preemptable on idle node */
+	stats->pnode = -1;		/* node with most preemptable spus */
+	stats->pnode_icnt = 0; 		/* n idle on preempt node */
+	stats->pnode_pcnt = 0; 		/* n preemptable on preempt node */
+	stats->total_icnt = 0; 		/* total idle across all nodes */
+	stats->total_pcnt = 0;		/* total preemptable across all nodes */
 
-	spin_lock(&cbe_spu_info[node].list_lock);
-	cbe_spu_info[node].nr_active--;
-	spu->alloc_state = SPU_FREE;
-	spu->ctx = NULL;
-	ctx->stats.invol_ctx_switch++;
-	spu->stats.invol_ctx_switch++;
-	spin_unlock(&cbe_spu_info[node].list_lock);
+	node = cpu_to_node(raw_smp_processor_id());
+	for (n = 0; n < MAX_NUMNODES; n++, node++) {
+
+		node = (node < MAX_NUMNODES) ? node : 0;
+		if (!node_allowed(gang, node))
+			continue;
+
+		count_idle = 0;
+		count_preempt = 0;
+
+		spin_lock(&cbe_spu_info[node].list_lock);
+		list_for_each_entry(spu, &cbe_spu_info[node].spus, cbe_list) {
+			struct spu_gang *tmp = spu->gang;
+			struct spu_context *tmpc = spu->ctx;
+
+			switch (spu->alloc_state) {
+			case SPU_FREE :
+				if (!tmp)
+					count_idle++;
+				break;
+			case SPU_USED :
+				if (tmpc &&
+				   !(tmpc->flags & SPU_CREATE_NOSCHED)) {
+					if (tmp == gang) /* yield/deactivate */
+						count_idle++;
+					else if (tmp->prio > gang->prio)
+						count_preempt++;
+				}
+				break;
+			}
+		}
+		spin_unlock(&cbe_spu_info[node].list_lock);
+
+		if (stats->inode == -1 || stats->inode_icnt < count_idle) {
+			stats->inode_icnt = count_idle;
+			stats->inode_pcnt = count_preempt;
+			stats->inode = node;
+		}
+
+		if (stats->pnode == -1 || stats->pnode_pcnt < count_preempt) {
+			stats->pnode_icnt = count_idle;
+			stats->pnode_pcnt = count_preempt;
+			stats->pnode = node;
+		}
+
+		stats->total_icnt += count_idle;
+		stats->total_pcnt += count_preempt;
+	}
+}
+
+static int spu_gang_schedulable(struct spu_gang *gang,
+		int *node_chosen, int *node_preempt)
+{
+	struct spu_sched_stats stats;
+	int ret;
+
+	spin_lock(&spu_prio->runq_lock);
+	spu_get_run_stats(gang, &stats);
+	spin_unlock(&spu_prio->runq_lock);
+	ret = spu_gang_placeable(gang, &stats, node_chosen, node_preempt);
+
+	return ret;
 }
 
+
 /**
  * spu_activate - find a free spu for a context and execute it
- * @ctx:	spu context to schedule
- * @flags:	flags (currently ignored)
+ * @ctx:		spu context to schedule
+ * @flags:		flags (currently ignored)
  *
  * Tries to find a free spu to run @ctx.  If no free spu is available
  * add the context to the runqueue so it gets woken up once an spu
@@ -732,107 +921,184 @@ static void spu_unschedule(struct spu *s
  */
 int spu_activate(struct spu_context *ctx, unsigned long flags)
 {
-	struct spu *spu;
+	struct spu_gang *gang = ctx->gang;
+	int ret, node_chosen, node_preempt;
 
-	/*
-	 * Activation assumes context is not on the runqueue as it is
-	 * about to be activated, but it could be on the runqueue if the
-	 * context was preempted while invoking a library routine.
-	 */
-	spu_del_from_rq(ctx);
+	if (signal_pending(current))
+		return -ERESTARTSYS;
 
-	/*
-	 * If there are multiple threads waiting for a single context
-	 * only one actually binds the context while the others will
-	 * only be able to acquire the state_mutex once the context
-	 * already is in runnable state.
-	 */
 	if (ctx->spu)
 		return 0;
 
-spu_activate_top:
-	if (signal_pending(current))
-		return -ERESTARTSYS;
+	if (!mutex_trylock(&gang->mutex)) {
+		spu_release(ctx);
+		mutex_lock(&gang->mutex);
+		mutex_lock(&ctx->state_mutex);
+	}
 
-	spu = spu_get_idle(ctx);
 	/*
-	 * If this is a realtime thread we try to get it running by
-	 * preempting a lower priority thread.
+	 * Recheck as we released lock. Context could have been activated
+	 * if it was previously started and sliced.
 	 */
-	if (!spu && rt_prio(ctx->prio))
-		spu = find_victim(ctx);
-	if (spu) {
-		unsigned long runcntl;
-
-		runcntl = ctx->ops->runcntl_read(ctx);
-		__spu_schedule(spu, ctx);
-		if (runcntl & SPU_RUNCNTL_RUNNABLE)
-			spuctx_switch_state(ctx, SPU_UTIL_USER);
-
+	if (ctx->spu) {
+		mutex_unlock(&gang->mutex);
 		return 0;
 	}
 
-	if (ctx->flags & SPU_CREATE_NOSCHED) {
-		spu_prio_wait(ctx);
-		goto spu_activate_top;
+	if (!SPU_GANG_RUNNABLE(gang)) {
+		mutex_unlock(&gang->mutex);
+		if (ctx->flags & SPU_CREATE_NOSCHED) {
+			ret = spufs_wait(ctx->run_wq,
+				ctx->state == SPU_STATE_RUNNABLE);
+			if (unlikely(ret)) {
+				spu_del_from_rq(gang);
+				mutex_lock(&ctx->state_mutex);
+			}
+			return ret;
+		}
+		return 0;
 	}
+	spu_release(ctx);
+
+	/*
+	 * Activation assumes gang is not on the runqueue as it is
+	 * about to be activated. It could be on the runqueue, if it
+	 * were time sliced while executing user mode code.
+	 */
+	spu_del_from_rq(gang);
+
+	clear_bit(SPU_SCHED_DEACTIVATED, &gang->sched_flags);
+
+spu_activate_top:
+
+	if (spu_gang_schedulable(gang, &node_chosen, &node_preempt)) {
 
-	spu_add_to_rq(ctx);
+		if (node_preempt != SPU_NOPREEMPT) {
+			ret = find_victim(gang, node_preempt);
+			if (unlikely(ret))
+				goto spu_activate_rq;
+		}
+
+		ret = spu_get_idle(gang, node_chosen);
+		if (ret)
+			goto spu_activate_top;
+		__spu_schedule(gang, node_chosen);
+	} else {
 
+spu_activate_rq:
+		spu_add_to_rq(gang);
+		if (ctx->flags & SPU_CREATE_NOSCHED) {
+			spu_acquire(ctx);
+			mutex_unlock(&gang->mutex);
+			ret = spufs_wait(ctx->run_wq,
+				ctx->state == SPU_STATE_RUNNABLE);
+			if (unlikely(ret)) {
+				spu_del_from_rq(gang);
+				mutex_lock(&ctx->state_mutex);
+			}
+			return ret;
+		}
+	}
+	mutex_lock(&ctx->state_mutex);
+	mutex_unlock(&gang->mutex);
 	return 0;
 }
 
 /**
- * grab_runnable_context - try to find a runnable context
+ * grab_runnable_gang - try to find a runnable gang
  *
- * Remove the highest priority context on the runqueue and return it
- * to the caller.  Returns %NULL if no runnable context was found.
+ * Remove the highest priority gang on the runqueue and return it
+ * to the caller.  Returns %NULL if no runnable gang was found.
  */
-static struct spu_context *grab_runnable_context(int prio, int node)
+static struct spu_gang *grab_runnable_gang(struct spu_gang *old,
+		int prio, int *node_chosen, int *node_preempt)
 {
+	struct spu_sched_stats stats, lstats;
 	struct spu_context *ctx;
-	int best;
+	struct spu_gang *gang;
+	int ret, best;
 
+	/*
+	 * If old != NULL, then caller is spu_yield or spu_deactivate.  We
+	 * can use idle spus and spus assigned to old.  Else the caller is the
+	 * scheduler thread (time slicer) and we may only use idle spus.  In
+	 * neither case are we allowed to preempt.  Preemption is only done by
+	 * spu_activation which assumes that the gang is not on the runqueue.
+	 * The fact that we are getting the gang from the runqueue implies
+	 * that it has a less favored priority than anything currently
+	 * running. spu_activate does not call this routine.
+	 */
 	spin_lock(&spu_prio->runq_lock);
+	if (old) {
+		ctx = list_first_entry(&old->list,
+				struct spu_context, gang_list);
+		BUG_ON(!ctx->spu);
+		spu_get_run_stats(old, &stats);
+	}
 	best = find_first_bit(spu_prio->bitmap, prio);
 	while (best < prio) {
 		struct list_head *rq = &spu_prio->runq[best];
 
-		list_for_each_entry(ctx, rq, rq) {
-			/* XXX(hch): check for affinity here aswell */
-			if (__node_allowed(ctx, node)) {
-				__spu_del_from_rq(ctx);
+		list_for_each_entry(gang, rq, rq) {
+			if (old) {
+				lstats = stats;
+				if (!node_allowed(gang, lstats.inode))
+					lstats.inode_icnt = 0;
+			} else {
+				spu_get_run_stats(gang, &lstats);
+			}
+			ret = spu_gang_placeable(gang, &lstats,
+						node_chosen, node_preempt);
+			if (ret && (*node_preempt == SPU_NOPREEMPT)) {
+				__spu_del_from_rq(gang);
 				goto found;
 			}
 		}
 		best++;
 	}
-	ctx = NULL;
- found:
+	gang = NULL;
+found:
 	spin_unlock(&spu_prio->runq_lock);
-	return ctx;
+	return gang;
 }
 
-static int __spu_deactivate(struct spu_context *ctx, int force, int max_prio)
+static int __spu_deactivate(struct spu_gang *gang, int force, int max_prio)
 {
-	struct spu *spu = ctx->spu;
-	struct spu_context *new = NULL;
+	struct spu_gang *new;
+	struct spu_context *ctx;
+	int ret, node_chosen, node_preempt;
 
-	if (spu) {
-		new = grab_runnable_context(max_prio, spu->node);
-		if (new || force) {
-			spu_unschedule(spu, ctx);
-			if (new) {
-				if (new->flags & SPU_CREATE_NOSCHED)
-					wake_up(&new->stop_wq);
-				else {
-					spu_release(ctx);
-					spu_schedule(spu, new);
-					/* this one can't easily be made
-					   interruptible */
-					mutex_lock(&ctx->state_mutex);
-				}
+	ctx = list_first_entry(&gang->list, struct spu_context, gang_list);
+	if (!ctx->spu)
+		return 0;
+
+	new = grab_runnable_gang(gang, max_prio, &node_chosen, &node_preempt);
+	if (new || force) {
+		spu_unschedule(gang);
+		if (new) {
+			/*
+			 * None of these locks can be easily made interruptible
+			 */
+			mutex_unlock(&gang->mutex);
+
+			/*
+			 * Schedule gang taken from runqueue.  Should fit, but
+			 * might not as we dropped the lock above.  Also, it
+			 * could have been activated by spu_run.
+			 */
+			mutex_lock(&new->mutex);
+			ctx = list_first_entry(&new->list,
+				struct spu_context, gang_list);
+			if (!ctx->spu) {
+				ret = spu_get_idle(new, node_chosen);
+				if (ret)
+					spu_add_to_rq(new);
+				else
+					__spu_schedule(new, node_chosen);
 			}
+			mutex_unlock(&new->mutex);
+
+			mutex_lock(&gang->mutex);
 		}
 	}
 
@@ -846,66 +1112,78 @@ static int __spu_deactivate(struct spu_c
  * Unbind @ctx from the physical spu it is running on and schedule
  * the highest priority context to run on the freed physical spu.
  */
-void spu_deactivate(struct spu_context *ctx)
+void spu_deactivate(struct spu_gang *gang)
 {
-	spu_context_nospu_trace(spu_deactivate__enter, ctx);
-	__spu_deactivate(ctx, 1, MAX_PRIO);
+	spu_context_nospu_trace(spu_deactivate__enter, gang);
+	set_bit(SPU_SCHED_DEACTIVATED, &gang->sched_flags);
+	__spu_deactivate(gang, 1, MAX_PRIO);
+	spu_del_from_rq(gang);
 }
 
 /**
  * spu_yield -	yield a physical spu if others are waiting
  * @ctx:	spu context to yield
  *
- * Check if there is a higher priority context waiting and if yes
- * unbind @ctx from the physical spu and schedule the highest
- * priority context to run on the freed physical spu instead.
+ * Check if there is a higher priority gang waiting and if yes
+ * unbind @ctx and any other within the same gang freeing one or
+ * more physical spus and schedule the highest priority gang
+ * to run on the freed physical spu(s) instead.  The gang is added
+ * back to the runqueue, so that it may be resumed by the scheduler
+ * as soon as there are idle spus.
  */
 void spu_yield(struct spu_context *ctx)
 {
-	spu_context_nospu_trace(spu_yield__enter, ctx);
+	struct spu_gang *gang = ctx->gang;
+	int ret;
+
+	spu_context_nospu_trace(spu_yield__enter, gang);
 	if (!(ctx->flags & SPU_CREATE_NOSCHED)) {
-		mutex_lock(&ctx->state_mutex);
-		__spu_deactivate(ctx, 0, MAX_PRIO);
-		mutex_unlock(&ctx->state_mutex);
+		mutex_lock(&gang->mutex);
+		ret = __spu_deactivate(gang, 0, MAX_PRIO);
+		if (ret)
+			spu_add_to_rq(gang);
+		mutex_unlock(&gang->mutex);
 	}
 }
 
-static noinline void spusched_tick(struct spu_context *ctx)
+static noinline int spusched_tick(struct spu_gang *gang,
+					struct spu_context *ctx,
+					int preempt)
 {
-	struct spu_context *new = NULL;
-	struct spu *spu = NULL;
+	int best;
+	int npreempted = 0;
 
-	if (spu_acquire(ctx))
-		BUG();	/* a kernel thread never has signals pending */
+	if (test_bit(SPU_SCHED_DEACTIVATED, &gang->sched_flags))
+		return 0;
 
-	if (ctx->state != SPU_STATE_RUNNABLE)
-		goto out;
-	if (ctx->flags & SPU_CREATE_NOSCHED)
-		goto out;
-	if (ctx->policy == SCHED_FIFO)
-		goto out;
+	mutex_lock(&gang->mutex);
 
-	if (--ctx->time_slice && ctx->policy != SCHED_IDLE)
+	if (!SPU_GANG_RUNNABLE(gang))
 		goto out;
 
-	spu = ctx->spu;
+	if (!ctx->spu)
+		goto out;
 
-	spu_context_trace(spusched_tick__preempt, ctx, spu);
+	if (ctx->flags & SPU_CREATE_NOSCHED)
+		goto out;
 
-	new = grab_runnable_context(ctx->prio + 1, spu->node);
-	if (new) {
-		spu_unschedule(spu, ctx);
-		if (ctx->policy != SCHED_IDLE)
-			spu_add_to_rq(ctx);
-	} else {
-		spu_context_nospu_trace(spusched_tick__newslice, ctx);
-		ctx->time_slice++;
+	if ((ctx->policy != SCHED_FIFO) && (!--ctx->time_slice)) {
+		best = find_first_bit(spu_prio->bitmap, gang->prio);
+		if (preempt && best <= gang->prio) {
+			spu_context_trace(spusched_tick__preempt,
+							ctx, ctx->spu);
+			npreempted = gang->contexts;
+			spu_unschedule(gang);
+			spu_add_to_rq(gang);
+		} else {
+			spu_context_nospu_trace(spusched_tick__newslice, ctx);
+			ctx->time_slice++;
+		}
 	}
 out:
-	spu_release(ctx);
+	mutex_unlock(&gang->mutex);
 
-	if (new)
-		spu_schedule(spu, new);
+	return npreempted;
 }
 
 /**
@@ -958,28 +1236,103 @@ static void spuloadavg_wake(unsigned lon
 
 static int spusched_thread(void *unused)
 {
+	struct spu_gang *gang;
+	struct spu_context *ctx;
 	struct spu *spu;
-	int node;
+	int node, active, pnode, preempted, maxcontexts, retry, deactivated;
 
 	while (!kthread_should_stop()) {
 		set_current_state(TASK_INTERRUPTIBLE);
 		schedule();
-		for (node = 0; node < MAX_NUMNODES; node++) {
+
+		/*
+		 * Try to limit the number of context switches because
+		 * we preempt all of the contexts first and then place
+		 * them.  We preempt first as the next gang to run may
+		 * not fit.  Also, it helps us place NUMA and SPU affinity
+		 * jobs as they have special requirements.  Not that we
+		 * are doing that now, but we could. The downside is
+		 * that we will have a little more idle time.  What we
+		 * want to avoid is preempting and then rescheduling
+		 * the same job as it creates hicups in the execution.
+		 */
+		maxcontexts = spu_prio->nr_contexts;
+		preempted = 0;
+
+		/*
+		 * Time slice contexts first assuming that there are
+		 * spe jobs waiting to run.  The next gang to be scheduled
+		 * is of indeterminate size, so it is not sufficient to
+		 * follow the traditional mode which assumes a 1:1
+		 * replacement (1 spu : 1 spe context).
+		 */
+		for (node = 0, active = 0; node < MAX_NUMNODES; node++) {
 			spinlock_t *l = &cbe_spu_info[node].list_lock;
 
 			spin_lock(l);
 			list_for_each_entry(spu, &cbe_spu_info[node].spus,
 					cbe_list) {
-				struct spu_context *ctx = spu->ctx;
+				if (spu->alloc_state == SPU_USED) {
+					gang = spu->gang;
+					BUG_ON(!gang);
+					ctx = spu->ctx;
 
-				if (ctx) {
 					spin_unlock(l);
-					spusched_tick(ctx);
+					preempted += spusched_tick(gang, ctx,
+						preempted < maxcontexts);
 					spin_lock(l);
+					active++;
 				}
 			}
 			spin_unlock(l);
 		}
+
+		spu_sched_stats(spusched_thread__timeslice,
+					maxcontexts, active, preempted);
+
+		/*
+		 * Place as many gangs as possible.  A gang might fail to
+		 * be placed if it has NUMA bindings, SPU Affinity, or is
+		 * to big to fit.  On failure, the gang is added to the
+		 * back of the runqueue, which effectively allows us to
+		 * skip over a job, assuming it is not the only one at a
+		 * given priority level.
+		 */
+		retry = 0;
+		active = 0;
+		while (retry < maxcontexts) {
+
+			gang = grab_runnable_gang(NULL, MAX_PRIO,
+					&node, &pnode);
+			if (!gang)
+				break;
+
+			/*
+			 * Must recheck state as we have dropped all locks.
+			 * It could be running, deactivated, or destroyed. The
+			 * latter is still a problem. See find_victim (XXX).
+			 */
+			deactivated = test_bit(SPU_SCHED_DEACTIVATED,
+						&gang->sched_flags);
+			if (!deactivated) {
+				mutex_lock(&gang->mutex);
+				ctx = list_first_entry(&gang->list,
+					struct spu_context, gang_list);
+				if (!ctx->spu) {
+					if (spu_get_idle(gang, node)) {
+						spu_add_to_rq(gang);
+						retry++;
+					} else {
+						__spu_schedule(gang, node);
+						active++;
+					}
+				}
+				mutex_unlock(&gang->mutex);
+			}
+		}
+
+		spu_sched_stats(spusched_thread__scheduled,
+				spu_prio->nr_contexts, active, retry);
 	}
 
 	return 0;
@@ -1010,6 +1363,9 @@ void spuctx_switch_state(struct spu_cont
 	 * Update the physical SPU utilization statistics.
 	 */
 	if (spu) {
+		struct spu_gang *gang = ctx->gang;
+
+		gang->stats.times[old_state] += delta;
 		ctx->stats.times[old_state] += delta;
 		spu->stats.times[old_state] += delta;
 		spu->stats.util_state = new_state;
@@ -1112,8 +1468,10 @@ void spu_sched_exit(void)
 	for (node = 0; node < MAX_NUMNODES; node++) {
 		spin_lock(&cbe_spu_info[node].list_lock);
 		list_for_each_entry(spu, &cbe_spu_info[node].spus, cbe_list)
-			if (spu->alloc_state != SPU_FREE)
+			if (spu->alloc_state != SPU_FREE) {
 				spu->alloc_state = SPU_FREE;
+				spu->gang = NULL;
+			}
 		spin_unlock(&cbe_spu_info[node].list_lock);
 	}
 	kfree(spu_prio);
Index: spufs/arch/powerpc/platforms/cell/spufs/run.c
===================================================================
--- spufs.orig/arch/powerpc/platforms/cell/spufs/run.c
+++ spufs/arch/powerpc/platforms/cell/spufs/run.c
@@ -160,11 +160,45 @@ out:
 static int spu_run_init(struct spu_context *ctx, u32 *npc)
 {
 	unsigned long runcntl = SPU_RUNCNTL_RUNNABLE;
+	struct spu_gang *gang = ctx->gang;
 	int ret;
 
 	spuctx_switch_state(ctx, SPU_UTIL_SYSTEM);
 
 	/*
+	 * Gang start.  The nstarted variable is incremented the first
+	 * time that a context is started.  When all ctxts in a gang have
+	 * been started, the gang is considered runnable.
+	 *
+	 * Gang runnable.  The nrunnable variable is the number of
+	 * contexts that are currently being processed by spufs_run_spu.
+	 * The count is incremented on entry and decremented on return
+	 * to user mode.  When a context is in user mode, the spu is
+	 * stopped.  When the count goes to zero, the gang is unloaded
+	 * if there is another gang waiting to run.  In this case, it
+	 * is unloaded, but it is not added to the runqueue as there
+	 * are no runnable contexts in the gang.  There has to be at
+	 * least one runnable context to be added to the runq.  In this
+	 * way, we prevent the runqueue from being over run with non-
+	 * runnable contexts.  The gang is guaranteed to be put back on
+	 * the runqueue, when a context is started again.  This is
+	 * driven explcitly by the program.
+	 *
+	 * Gang preemption. In addition to the case mentioned above
+	 * when all of the contexts are invoking user mode code, the
+	 * gang is also preempted when all of the contexts are either
+	 * in user mode or processing page faults.  The nfaulting variable
+	 * is the number of contexts in a gang that are currently
+	 * processing page faults.  When it equals nrunnable, the gang
+	 * is yielded and another gang runs if there is one on the
+	 * runqueue.  In this case, the gang is added to the runqueue
+	 * as the nrunnable field is still positive.
+	 */
+	if (test_and_clear_bit(SPU_SCHED_JUST_CREATED, &ctx->sched_flags))
+		atomic_inc(&gang->nstarted);
+	atomic_inc(&gang->nrunnable);
+
+	/*
 	 * NOSCHED is synchronous scheduling with respect to the caller.
 	 * The caller waits for the context to be loaded.
 	 */
@@ -228,15 +262,14 @@ static int spu_run_fini(struct spu_conte
 {
 	int ret = 0;
 
-	spu_del_from_rq(ctx);
-
 	*status = ctx->ops->status_read(ctx);
 	*npc = ctx->ops->npc_read(ctx);
 
 	spuctx_switch_state(ctx, SPU_UTIL_IDLE_LOADED);
-	ctx->policy = SCHED_IDLE;
 	spu_release(ctx);
 
+	atomic_dec(&ctx->gang->nrunnable);
+
 	if (signal_pending(current))
 		ret = -ERESTARTSYS;
 
@@ -341,6 +374,18 @@ long spufs_run_spu(struct spu_context *c
 	spu_enable_spu(ctx);
 	ctx->event_return = 0;
 
+	/*
+	 * This routine is almost entirely context based and runs mostly
+	 * under the context lock.  The locking strategy is driven by the
+	 * state of the context.  On input, the gang and context locks are
+	 * taken if the context is in the non-runnable state.  Both locks
+	 * are required to activate the context as it is a gang level
+	 * operation.  This occurs before the loop in spu_run_init. The
+	 * loop itself is processed just with the context lock and if the
+	 * system call returns and is re-entered as long as the context
+	 * remains in the runnable state, the gang lock is not required
+	 * as the context does not need to be activated.
+	 */
 	ret = spu_acquire(ctx);
 	if (ret)
 		goto out_unlock;
Index: spufs/arch/powerpc/platforms/cell/spufs/context.c
===================================================================
--- spufs.orig/arch/powerpc/platforms/cell/spufs/context.c
+++ spufs/arch/powerpc/platforms/cell/spufs/context.c
@@ -47,8 +47,11 @@ struct spu_context *alloc_spu_context(st
 			goto out_free;
 	}
 
+	/*
+	 * TODO: fix for SPU_CREATE_NOSCHED gangs and for SPU Affinity.
+	 */
 	for (n = 0, nspus = 0; n < MAX_NUMNODES; n++)
-		nspus += cbe_spu_info[n].n_spus -
+		 nspus += cbe_spu_info[n].n_spus -
 			atomic_read(&cbe_spu_info[n].reserved_spus);
 	if (nspus < gang->contexts + 1)
 		goto out_free_gang;
@@ -70,9 +73,9 @@ struct spu_context *alloc_spu_context(st
 	init_waitqueue_head(&ctx->mfc_wq);
 	init_waitqueue_head(&ctx->run_wq);
 	ctx->state = SPU_STATE_SAVED;
+	set_bit(SPU_SCHED_JUST_CREATED, &ctx->sched_flags);
 	ctx->ops = &spu_backing_ops;
 	ctx->owner = get_task_mm(current);
-	INIT_LIST_HEAD(&ctx->rq);
 	INIT_LIST_HEAD(&ctx->aff_list);
 	spu_gang_add_ctx(gang, ctx);
 
@@ -100,14 +103,18 @@ void destroy_spu_context(struct kref *kr
 	ctx = container_of(kref, struct spu_context, kref);
 	gang = ctx->gang;
 
-	mutex_lock(&ctx->state_mutex);
-	spu_deactivate(ctx);
-	mutex_unlock(&ctx->state_mutex);
+	/*
+	 * Deactivate and make it non-runnable while we work on it.
+	 */
+	mutex_lock(&gang->mutex);
+	WARN_ON(ctx->gang != gang);
+	spu_deactivate(gang);
+	mutex_unlock(&gang->mutex);
+
 	spu_fini_csa(&ctx->csa);
 	spu_gang_remove_ctx(gang, ctx);
 	if (ctx->prof_priv_kref)
 		kref_put(ctx->prof_priv_kref, ctx->prof_priv_release);
-	BUG_ON(!list_empty(&ctx->rq));
 	atomic_dec(&nr_spu_contexts);
 	kfree(ctx);
 }
@@ -127,21 +134,18 @@ int put_spu_context(struct spu_context *
 void spu_forget(struct spu_context *ctx)
 {
 	struct mm_struct *mm;
+	struct spu_gang *gang = ctx->gang;
 
 	/*
 	 * This is basically an open-coded spu_acquire_saved, except that
-	 * we don't acquire the state mutex interruptible.
+	 * we don't acquire the mutex interruptible.
 	 */
-	mutex_lock(&ctx->state_mutex);
-	if (ctx->state != SPU_STATE_SAVED) {
-		set_bit(SPU_SCHED_WAS_ACTIVE, &ctx->sched_flags);
-		spu_deactivate(ctx);
-	}
-
+	mutex_lock(&gang->mutex);
+	spu_deactivate(gang);
 	mm = ctx->owner;
 	ctx->owner = NULL;
 	mmput(mm);
-	spu_release(ctx);
+	mutex_unlock(&gang->mutex);
 }
 
 void spu_unmap_mappings(struct spu_context *ctx)
@@ -170,16 +174,19 @@ void spu_unmap_mappings(struct spu_conte
  */
 int spu_acquire_saved(struct spu_context *ctx)
 {
+	struct spu_gang *gang = ctx->gang;
 	int ret;
 
-	ret = spu_acquire(ctx);
+	ret = mutex_lock_interruptible(&gang->mutex);
 	if (ret)
 		return ret;
 
-	if (ctx->state != SPU_STATE_SAVED) {
-		set_bit(SPU_SCHED_WAS_ACTIVE, &ctx->sched_flags);
-		spu_deactivate(ctx);
-	}
+	/*
+	 * Deactivate unconditionally as we leave holding the gang lock.
+	 * If the gang is on the runqueue, the scheduler would block on it,
+	 * if it tried to dispatch it.
+	 */
+	spu_deactivate(gang);
 
 	return 0;
 }
@@ -190,11 +197,15 @@ int spu_acquire_saved(struct spu_context
  */
 void spu_release_saved(struct spu_context *ctx)
 {
+	struct spu_gang *gang = ctx->gang;
+
+	mutex_lock(&ctx->state_mutex);
+	mutex_unlock(&gang->mutex);
+
 	BUG_ON(ctx->state != SPU_STATE_SAVED);
 
-	if (test_and_clear_bit(SPU_SCHED_WAS_ACTIVE, &ctx->sched_flags))
+	if (SPU_GANG_RUNNABLE(gang))
 		spu_activate(ctx, 0);
 
-	spu_release(ctx);
+	mutex_unlock(&ctx->state_mutex);
 }
-
Index: spufs/arch/powerpc/platforms/cell/spufs/gang.c
===================================================================
--- spufs.orig/arch/powerpc/platforms/cell/spufs/gang.c
+++ spufs/arch/powerpc/platforms/cell/spufs/gang.c
@@ -38,6 +38,7 @@ struct spu_gang *alloc_spu_gang(void)
 	mutex_init(&gang->aff_mutex);
 	INIT_LIST_HEAD(&gang->list);
 	INIT_LIST_HEAD(&gang->aff_list_head);
+	INIT_LIST_HEAD(&gang->rq);
 
 	/*
 	 * Inherit scheduling parameters from the creator of the gang.
@@ -91,7 +92,16 @@ void spu_gang_remove_ctx(struct spu_gang
 	}
 	list_del_init(&ctx->gang_list);
 	gang->contexts--;
-	mutex_unlock(&gang->mutex);
+	atomic_dec(&gang->nstarted);
+	if (SPU_GANG_RUNNABLE(gang)) {
+		ctx = list_first_entry(&gang->list,
+				struct spu_context, gang_list);
+		mutex_lock(&ctx->state_mutex);
+		mutex_unlock(&gang->mutex);
+		spu_activate(ctx, 0);
+		mutex_unlock(&ctx->state_mutex);
+	} else
+		mutex_unlock(&gang->mutex);
 
 	put_spu_gang(gang);
 }