[Cbe-oss-dev] [RFC] [PATCH 0:8] SPU Gang Scheduling

Luke Browning lukebr at linux.vnet.ibm.com
Thu Mar 6 04:47:37 EST 2008


On Wed, 2008-03-05 at 07:20 +0100, Arnd Bergmann wrote:
> On Tuesday 04 March 2008, Luke Browning wrote:
> > Here's something to consider as a counter example.  If a higher priority
> > gang is started, it must preempt the lower priority gang.  The preempted
> > gang might have some contexts in user mode and some in kernel mode (ie.
> > the spus are acutally running).  In this case, the preempted gang has to
> > be put on the runqueue, so that the contexts that were stopped and
> > unloaded may be loaded and started again automatically.  Otherwise, the
> > application would hang.  
> 
> It would not hang indefinitely, but only until all the contexts have
> resolved their callbacks and page faults and want to run again.
> One case where it would indeed hang is when one thread makes a callback
> to the PPU that ends up waiting for another thread in the same gang
> to perform a specific action on the SPU before returning to spu_run.
> With your current code, that is allowed, while with the model I'm
> suggesting it would be a bug in the application.
> 
> Do you see that as a significant problem? With gang scheduling, there
> are already enough ways for an application to deadlock itself, so
> I don't have a problem with another one, as long as it's reasonable
> and well-documented.

I think it is a significant problem, particularly for parallel jobs.  I
think it will be fairly common to implement spu synchronization points
on the PPU where the synchronization point is used to aggregate spu
results and dispatch new work onto the spus.

A key point of scheduling is that you must be able to resume a context
at whatever point you interrupt it.  Time slicing and preemption in
general must be completely transparent to the application.  You can't
mandate that user code run and perform some action to make the preempted
context runnable again.  In that case, the application would be
logically part of the scheduler and the transparency rule is broken.    

> 
> > > The reason for that  is that running the gang when  only one of the
> > > threads in it would be somewhat unfair to  the other sleeping gangs
> > > in the system that have all contexts in a thread trying  to run SPU
> > > code.
> > 
> > You could add heuristics to handle unfairness by processing N elements
> > on the runqueue instead of just getting the next one.  Maybe look at
> > three or four gangs at the same priority and chose the one with the
> > highest percentage of contexts in the spu_run.
> 
> I'd like to avoid heuristics like this but rather half the behavior of
> the kernel predictable and obvious.
> 
> One potential option I can see is to actually load a half-idle gang if 
> there is nothing else that is trying to run, i.e. when all gangs in the
> system are half-idle, but I would like to avoid that complexity.
> 

You can't really tell anything about the efficiency of a program by
looking at its preempted state.  It is just a snap shot in time.  It may
be extremely efficient at using the spus in general and be caught at a
bad time such as the synchronization point described above.  

I don't think that the execution of PPU code by itself is statistically
significant.  On the other hand, if the PPU blocks for whatever reason I
agree it would be desirable to unschedule the gang, but I would
implement that as a follow on patch.  It is just an optimization, but it
would require interlocking with the mainline scheduler.  

Another way to deal with the fairness problem is to alter the timeslice
for gangs, so that it gets a few milliseconds less time.  The penalty is
small enough so that it is worth porting the application to gang
scheduling as the reward is greater, yet it acts as a disincentive to
the multitude who would just look to get something for free without any
extra work.  There must be a charge otherwise the developer who acts in
good faith and ports the application receives no benefit as there is a
point of diminishing returns.  Too many gangs is bad.  You have to wait
longer to run and the jobs make less efficient use of the available
resources.  Gangs lower physical spu utilization of the system as they
are harder to schedule and the scheduler is less efficient.  If you look
at the scheduler, you will notice that most of the algorithms are
altered.  Things are done in a more serial fashion.  In general, we
unschedule N contexts and then we schedules.  The algorithms are more
complicated and the critical sections are longer.  During this code,
spus are idle for a slightly longer period of time. Consider time
slicing, we slice all spus and then we place gangs.  This leaves spus
idle for a brief period of time whereas in the past under the old
algorithm we dealt with each spu one spu at a time. Gang scheduling
makes inherrently less efficient use of the physical spu resources so it
really should be charged somewhere.  Otherwise, it is absorbed by the
standalone contexts which make the most efficient use of the spus.  That
is not fair either.  

Sorry for the long answer, but you are asking good questions...

Luke




More information about the cbe-oss-dev mailing list