[Cbe-oss-dev] [RFC] [PATCH 0:8] SPU Gang Scheduling

Thu Mar 6 14:40:03 EST 2008

On Wednesday 05 March 2008, Luke Browning wrote:
> 
> On Wed, 2008-03-05 at 07:20 +0100, Arnd Bergmann wrote:
> >
> A key point of scheduling is that you must be able to resume a context
> at whatever point you interrupt it.  Time slicing and preemption in
> general must be completely transparent to the application.  You can't
> mandate that user code run and perform some action to make the preempted
> context runnable again.  In that case, the application would be
> logically part of the scheduler and the transparency rule is broken.    

Yes, good point. I haven't thought of that before, and it certainly
makes it a hard decision.

> > I'd like to avoid heuristics like this but rather half the behavior of
> > the kernel predictable and obvious.
> > 
> > One potential option I can see is to actually load a half-idle gang if 
> > there is nothing else that is trying to run, i.e. when all gangs in the
> > system are half-idle, but I would like to avoid that complexity.
> > 
> 
> You can't really tell anything about the efficiency of a program by
> looking at its preempted state.  It is just a snap shot in time.  It may
> be extremely efficient at using the spus in general and be caught at a
> bad time such as the synchronization point described above.  
> 
> I don't think that the execution of PPU code by itself is statistically
> significant.  On the other hand, if the PPU blocks for whatever reason I
> agree it would be desirable to unschedule the gang, but I would
> implement that as a follow on patch.  It is just an optimization, but it
> would require interlocking with the mainline scheduler.  

I think you assume more or less batch processing jobs to be on the SPU,
which is probably fair for most cases where you want gang scheduling,
and it's the majority of the workloads that we have seen so far, but I'd
want to make sure that we also deal well with interactive workloads that
actually spend most of their time not on the SPU but waiting in a syscall
or library call for outside triggers.

These are of course what make the scheduler interesting, and it's my
main concern in this discussion. For non-ganged threads, we should
always assume that they do not want to run when outside of spu_run
but are actually blocked for an extended amount of time.

How that assumption changes with gang scheduling, I'm not sure,
but I have a bad feeling with special-casing gangs in a way that
we always assume that the user will want to run again soon.

One simple but interesting question would be: what should the
gang do if one context does a nanosleep() to wait for many seconds?
I'd say we should suspend all threads in the gang after the end
of the time slice, but I guess you disagree with that, because
it disrupts the runtime behavior of the other contexts, right?

My point would be that by making one thread sleep, the user
explicitly wants the gang to not run any more, and the fact that
we let the other threads run on for the rest of the time slice
(or forever, if nothing else wants to run) is just an artifact
of our optimization not to stop the threads.

> Another way to deal with the fairness problem is to alter the timeslice
> for gangs, so that it gets a few milliseconds less time.  The penalty is
> small enough so that it is worth porting the application to gang
> scheduling as the reward is greater, yet it acts as a disincentive to
> the multitude who would just look to get something for free without any
> extra work.  There must be a charge otherwise the developer who acts in
> good faith and ports the application receives no benefit as there is a
> point of diminishing returns.  Too many gangs is bad.  You have to wait
> longer to run and the jobs make less efficient use of the available
> resources.  Gangs lower physical spu utilization of the system as they
> are harder to schedule and the scheduler is less efficient.  If you look
> at the scheduler, you will notice that most of the algorithms are
> altered.  Things are done in a more serial fashion.  In general, we
> unschedule N contexts and then we schedules.  The algorithms are more
> complicated and the critical sections are longer.  During this code,
> spus are idle for a slightly longer period of time. Consider time
> slicing, we slice all spus and then we place gangs.  This leaves spus
> idle for a brief period of time whereas in the past under the old
> algorithm we dealt with each spu one spu at a time. Gang scheduling
> makes inherrently less efficient use of the physical spu resources so it
> really should be charged somewhere.  Otherwise, it is absorbed by the
> standalone contexts which make the most efficient use of the spus.  That
> is not fair either.  

I agree with all this, but I'm also not too worried about these problems,
relative to the problem of a gang using up resources that doesn't want
because only part of it is actually trying to run.

	Arnd <><