[Cbe-oss-dev] gang scheduling: new branch in spufs git

Wed Jul 9 11:19:56 EST 2008

Hi all,

I've just added a new branch to the spufs git tree on kernel.org:

http://git.kernel.org/?p=linux/kernel/git/jk/spufs.git;a=shortlog;h=gangsched

This contains Luke Browning's and André Detsch's work on gang scheduling 
for SPE contexts.

The feature is still in development and is not yet ready for upstream. 
However, we'd like as much testing as possible, and feedback is always 
welcome. I've appended the commit message for 5f3ce61b if you're 
interested in an overview.

If you want to try out the gang scheduling, just do a:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/jk/spufs.git
git checkout gangsched

Cheers,

Jeremy

---- 5f3ce61b
Author: Luke Browning <lukebrowning at us.ibm.com>
Date:   Tue Jul 1 17:20:28 2008 -0300
powerpc/spufs: Implement spu gang scheduling.

This patch provides the base support for gang scheudling, including spu
mgmt, runqueue management, placement, activation, deactivation, time
slicing, yield, and preemption.  Basically, all of the core scheduling
capabilities.

All spu contexts belong to a gang.  For standalone spu contexts, an
internal gang structure is allocated to present a uniform data
abstraction, so that the gang can be queued on the runqueue.  The
priority of the gang dictates its position on the runqueue.  All gang's
have a single priority, policy, and NUMA attachment which is inherited
from creator of the spu context. These values do not currently change,
although there is nothing to prohibit such support to be added in the
future.

All contexts within a gang are scheduled and unscheduled at the same
time.  spu_schedule and spu_unschedule have been changed to invoke
spu_bind_context and spu_unbind_context in a loop.  The former is more
complicated in that it must allocate enough spus in a secure manner so
that it can get successfully through its critical section without
running out of spus.  For this reason, SPUs are preallocated.  A
reserved spu has the following state.

(spu->alloc_state == SPU_FREE and spu->gang != <gang>)

Timeslicing follows a two step algorithm. 1) all running contexts are
timesliced. The tick counter is implemented at the context level to
simplify the logic as they are all set and decremented at the same time.
When a count goes to zero, the gang is unscheduled.  This frees up space
as much space as posible before the scheduler tries to place a job that
is queued on the runqueue. This is a critical as the size of the job
waiting to run is not known apriori. 2) Sequentially place as many gangs
as possible. Skip over gangs as necessary across all run levels. This is
consistent with spu_yield which unloads the spu across user mode calls.

A simple hueristic has been implemented to prevent too many contexts
switches in step 1.  A limit is based on the number of runnable contexts
that are available on the runqueue.  If the count is less than the
number of physical spus, some spus may not be time sliced.  This is not
guaranteed as they may be part of a gang that is time sliced.  A simple
one pass scan is used.

A new gang nstarted counter has been added to the gang structure to
create a synchronization point for gang start.  The counter is
incremented, when a context calls spu_run().  When all of the contexts
have been started, the gang is considered runnable.

The start synchronization point is implemented by passing the first N-1
contexts directly through spu_run() to the spufs_wait() as before where
they wait on a private spe event word.  As before, they update their csa
area instead of hardware registers. The Nth thread through spu_run()
either places the gang or puts it on the runqueue.

Nearly all of the spu_run() critical section is the same.  It is context
based and runs almost entirely under the context lock.  The gang lock is
only taken when the context is in the SPU_SCHED_STATE, signifying that
the context needs to be activated.  This is an important optimization
that avoids lock contention in the controlling thread.

A gang nrunnable count has been implemented that is incremented and
decremented on entry and exit of spu_run respectively. This count is
intended to provide a measure of whether all of the contexts in the gang
are executing user mode code.  In this case, all of the spus in the gang
are stopped and this is a good point to preempt the gang.  This is
implemented by spu_yield() which triggers a call to spu_deactivate which
unloads the gang.  Importantly, in this case, the gang is not added to
the runqueue as the contexts are stopped.  This is designed to prevent
the pollution of the runqueue with stopped jobs that could only be
lazily loaded.  In this case, it is safe to not queue it as the
application is expected to re-drive the context via spu_run.

Finally, this means that a gang is eligible to be run as long as one
context in the gang is runnable.  Major page faulting is the other event
that may cause a gang to be preempted.  It is implemented via a
nfaulting count and a call to yield.  In this case, it is put on the
runqueue as the context is in kernel mode.  It is sort of a step down
scheduling technique to give something else a chance to run.