[Cbe-oss-dev] [RFC] [PATCH 0:8] SPU Gang Scheduling

Thu Mar 6 23:33:32 EST 2008

On Thu, 2008-03-06 at 04:40 +0100, Arnd Bergmann wrote:
> On Wednesday 05 March 2008, Luke Browning wrote:
> > 
> > On Wed, 2008-03-05 at 07:20 +0100, Arnd Bergmann wrote:
> > >
> > A key point of scheduling is that you must be able to resume a context
> > at whatever point you interrupt it.  Time slicing and preemption in
> > general must be completely transparent to the application.  You can't
> > mandate that user code run and perform some action to make the preempted
> > context runnable again.  In that case, the application would be
> > logically part of the scheduler and the transparency rule is broken.    
> 
> Yes, good point. I haven't thought of that before, and it certainly
> makes it a hard decision.
> 
> > > I'd like to avoid heuristics like this but rather half the behavior of
> > > the kernel predictable and obvious.
> > > 
> > > One potential option I can see is to actually load a half-idle gang if 
> > > there is nothing else that is trying to run, i.e. when all gangs in the
> > > system are half-idle, but I would like to avoid that complexity.
> > > 
> > 
> > You can't really tell anything about the efficiency of a program by
> > looking at its preempted state.  It is just a snap shot in time.  It may
> > be extremely efficient at using the spus in general and be caught at a
> > bad time such as the synchronization point described above.  
> > 
> > I don't think that the execution of PPU code by itself is statistically
> > significant.  On the other hand, if the PPU blocks for whatever reason I
> > agree it would be desirable to unschedule the gang, but I would
> > implement that as a follow on patch.  It is just an optimization, but it
> > would require interlocking with the mainline scheduler.  
> 
> I think you assume more or less batch processing jobs to be on the SPU,
> which is probably fair for most cases where you want gang scheduling,
> and it's the majority of the workloads that we have seen so far, but I'd
> want to make sure that we also deal well with interactive workloads that
> actually spend most of their time not on the SPU but waiting in a syscall
> or library call for outside triggers.

I don't think it makes sense to have an interactive gang with more than
one context as interactive means wait for I/O and we reserve the right
to block gangs when a controlling PPE thread in the gang blocks.  We
need to document the intention here.  I don't think it is an issue as
cell is not designed for interactive spu processing.

But note for existing interactive workloads, ie. gangs of one, you get
the same behavior as before.  The nrunnable count goes to zero, it is
incremented /  decremented in spu_run_init / spu_run_fini, which
prevents the gang from being added to the runqueue. The next spu_run()
drives the re-activation of the gang.  

No gangs can be put on the runqueue if all of its contexts are in user
mode.    

> 
> These are of course what make the scheduler interesting, and it's my
> main concern in this discussion. For non-ganged threads, we should
> always assume that they do not want to run when outside of spu_run
> but are actually blocked for an extended amount of time.

yes, see above.

> 
> How that assumption changes with gang scheduling, I'm not sure,
> but I have a bad feeling with special-casing gangs in a way that
> we always assume that the user will want to run again soon.

I don't do that with gangs.

> 
> One simple but interesting question would be: what should the
> gang do if one context does a nanosleep() to wait for many seconds?
> I'd say we should suspend all threads in the gang after the end
> of the time slice, but I guess you disagree with that, because
> it disrupts the runtime behavior of the other contexts, right?

Good point!  I guess that would be a reason not to implement the follow
on patch I suggested to interlock the mainline thread scheduler with the
spu scheduler to block the gang when a PPE thread blocks.  I think it is
OK to implement heuristics that favor well coded gangs along the lines
of what I was talking about.  Shortening the time quantum.  But, I don't
think it is OK to hang a gang indefinitely.  Even poorly coded
applications have a right to run, albeit more slowly.  This makes it
more important to implement the penalty I was talking about earlier.
See the comments below.  I think the scheduler has a responsibility to
protect / promote the wise use of system resources. 

> 
> My point would be that by making one thread sleep, the user
> explicitly wants the gang to not run any more, and the fact that
> we let the other threads run on for the rest of the time slice
> (or forever, if nothing else wants to run) is just an artifact
> of our optimization not to stop the threads.
> 
> > Another way to deal with the fairness problem is to alter the timeslice
> > for gangs, so that it gets a few milliseconds less time.  The penalty is
> > small enough so that it is worth porting the application to gang
> > scheduling as the reward is greater, yet it acts as a disincentive to
> > the multitude who would just look to get something for free without any
> > extra work.  There must be a charge otherwise the developer who acts in
> > good faith and ports the application receives no benefit as there is a
> > point of diminishing returns.  Too many gangs is bad.  You have to wait
> > longer to run and the jobs make less efficient use of the available
> > resources.  Gangs lower physical spu utilization of the system as they
> > are harder to schedule and the scheduler is less efficient.  If you look
> > at the scheduler, you will notice that most of the algorithms are
> > altered.  Things are done in a more serial fashion.  In general, we
> > unschedule N contexts and then we schedules.  The algorithms are more
> > complicated and the critical sections are longer.  During this code,
> > spus are idle for a slightly longer period of time. Consider time
> > slicing, we slice all spus and then we place gangs.  This leaves spus
> > idle for a brief period of time whereas in the past under the old
> > algorithm we dealt with each spu one spu at a time. Gang scheduling
> > makes inherrently less efficient use of the physical spu resources so it
> > really should be charged somewhere.  Otherwise, it is absorbed by the
> > standalone contexts which make the most efficient use of the spus.  That
> > is not fair either.  
> 
> I agree with all this, but I'm also not too worried about these problems,
> relative to the problem of a gang using up resources that doesn't want
> because only part of it is actually trying to run.
> 
> 	Arnd <><