[Cbe-oss-dev] Gang scheduling [RFC] [PATCH 0:9]

Sat Mar 15 07:45:32 EST 2008

Implement spu gang scheduling.

These patches have been updated slightly from the previous set to address
the comments that I have received so far.  The patches are based on Jeremy's 
public git.  I have done some minimal testing and things appear to be 
working, so encourage others to participate and help me shake out the bugs.
I agree with Arnd this should be put on a separate branch until we can 
stabilize it.

Here's my previous high level description which still holds.

All spu contexts belong to a gang.  For standalone spu contexts, an internal
gang structure is allocated to present a uniform data abstraction, so that
the gang can be queued on the runqueue.  The priority of the gang dictates
its position on the runqueue.  All gang's have a single priority, policy, 
and NUMA attachment which is inherited from creator of the spu context. These 
values do not currently change, although there is nothing to prohibit such 
support to be added in the future.

All contexts within a gang are scheduled and unscheduled at the same time. 
spu_schedule and spu_unschedule have been changed to invoke spu_bind_context
and spu_unbind_context in a loop.  The former is more complicated in that it
must allocate enough spus in a secure manner so that it can get successfully
through its critical section without running out of spus.  For this reason,
SPUs are preallocated.  A reserved spu has the following state. 

(spu->alloc_state == SPU_FREE and spu->gang != <gang>) 

Time slicing follows a two step algorithm. 1) all running contexts are
time sliced. The tick counter is implemented at the context level to
simplify the logic as they are all set and decremented at the same time.
When a count goes to zero, the gang is unscheduled.  This frees up space
as much space as possible before the scheduler tries to place a job that
is queued on the runqueue. This is a critical as the size of the job 
waiting to run is not known apriori. 2) Sequentially place as many gangs 
as possible. Skip over gangs as necessary across all run levels. This is
consistent with spu_yield which unloads the spu across user mode calls.

A simple heuristic has been implemented to prevent too many contexts switches
in step 1.  A limit is based on the number of runnable contexts that are 
available on the run queue.  If the count is less than the number of physical
spus, some spus may not be time sliced.  This is not guaranteed as they
may be part of a gang that is time sliced.  A simple one pass scan is used.

A new gang 'nstarted' counter has been added to the gang structure to create
a synchronization point for gang start.  The counter is incremented, when
a context calls spu_run().  When all of the contexts have been started, 
the gang is considered runnable. 

The start synchronization point is implemented by passing the first N-1
contexts directly through spu_run() to the spufs_wait() as before there 
they wait on a ctx specific event word (ctx->stop_wq).  As before, they 
update their csa areas instead of hardware registers.  The Nth thread 
through invokes spu_run(), which either binds the ctxts to spus or 
puts the gang on the runqueue.

Nearly all of the spu_run() critical section is the same.  It is context
based and runs almost entirely under the context lock.  The gang lock
is only taken when the context is in the SPU_STATE_SAVED, signifying that
the context needs to be activated.  This is an important optimization
that avoids lock contention in the controlling thread.

A gang nrunnable count has been implemented that is incremented and
decremented on entry and exit of spu_run respectively. This count is 
intended to provide a measure of whether all of the contexts in the
gang are executing user mode code.  In this case, all of the spus in
the gang are stopped and this is a good point to preempt the gang.  This
is implemented by spu_yield() which triggers a call to spu_deactivate
which unloads the gang.  Importantly, in this case, the gang is not added
to the runqueue as the contexts are stopped.  This is designed to prevent
the pollution of the runqueue with stopped jobs that could only be 
lazily loaded.  In this case, it is safe to not queue it as the 
application is expected to re-drive the context via spu_run.  

Finally, this means that a gang is eligible to be run as long as 
one context in the gang is runnable.  Major page faulting is the other 
event that may cause a gang to be preempted.  It is implemented via a
nfaulting count and a call to yield.  In this case, the gang needs to
be put on the runqueue as the context is in kernel mode.  It is sort of 
a step down scheduling technique to give something else a chance to run.