[Cbe-oss-dev] [RFC] [PATCH 6:8] SPU Gang Scheduling - base functionality for gang scheduling

Wed Mar 5 05:41:38 EST 2008

On Tue, 2008-03-04 at 09:33 +0100, Arnd Bergmann wrote:

> Thanks a lot for sending out these patches. In general, they look quite
> good already, but I obviously haven't had the chance to look at all the
> details of the scheduling so far, especially this main patch. The comments
> in here are mostly for the stuff that I noticed while browsing through it
> quickly.

Thanks.

> 
> The two main points I have are:
> 
> * As mentioned in the other replies, I really think a gang should never
>   be loaded while not all of its contexts have a thread waiting in spu_run.
>   I hope this change will also simplify some of the logic in your
>   implementation.

We need to think about that a little.  Please take a look at my other
comments.

> 
> * We're currently trying to stabilize the code in 2.6.25 after a number
>   of scheduler changes, and there are still known bugs. IMHO it would be
>   best to continue keeping the gang scheduling on a separate branch until
>   at least after 2.6.26 so that we can stabilize the two scheduler
>   implementation in parallel and improve the test cases we already have,
>   testing them out on both of them.

Agreed.  This is essential.

> 
> > @@ -137,6 +145,24 @@ struct spu_gang {
> >  	int aff_flags;
> >  	struct spu *aff_ref_spu;
> >  	atomic_t aff_sched_count;
> > +
> > +	/* statistics */
> > +	struct {
> > +		/* updates protected by ctx->state_mutex */
> > +		enum spu_utilization_state util_state;
> > +		unsigned long long tstamp;	/* time of last state switch */
> > +		unsigned long long times[SPU_UTIL_MAX];
> > +		unsigned long long vol_ctx_switch;
> > +		unsigned long long invol_ctx_switch;
> > +		unsigned long long min_flt;
> > +		unsigned long long maj_flt;
> > +		unsigned long long hash_flt;
> > +		unsigned long long slb_flt;
> > +		unsigned long long slb_flt_base; /* # at last ctx switch */
> > +		unsigned long long class2_intr;
> > +		unsigned long long class2_intr_base; /* # at last ctx switch */
> > +		unsigned long long libassist;
> > +	} stats;
> >  };
> >  
> >  /* Flag bits for spu_gang aff_flags */
> 
> Could you try to split out the statistics interfaces into a separate
> patch? I guess that will make reviewing and bisecting the patch a little
> easier if the statistics are a significant subset of the code change.

I can if you want.  I only made one statistic update though.  I was
going to leave the rest to andre and others.  

> 
> > @@ -60,6 +72,7 @@ static struct spu_prio_array *spu_prio;
> >  static struct task_struct *spusched_task;
> >  static struct timer_list spusched_timer;
> >  static struct timer_list spuloadavg_timer;
> > +static void spu_unschedule(struct spu_gang *gang);
> >  
> >  /*
> >   * Priority of a normal, non-rt, non-niced'd process (aka nice level 0).
> 
> Please avoid forward declarations for static functions. We normally
> define all functions in call order so that reading the code becomes
> easier (you know it only gets called from below) and it's obvious
> not to have any recursions.

ok.

> 
> > @@ -85,6 +98,11 @@ static struct timer_list spuloadavg_time
> >  #define SCALE_PRIO(x, prio) \
> >  	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
> >  
> > +#define SPU_RESERVE		0
> > +#define SPU_RESERVE_UNDO	1
> > +
> > +#define SPU_NOPREEMPT		-2
> > +
> >  /*
> >   * scale user-nice values [ -20 ... 0 ... 19 ] to time slice values:
> >   * [800ms ... 100ms ... 5ms]
> 
> These constants probably need some explanation, at least it wasn't
> obvious to me what they mean and why one is postive and the other
> is negative.

ok.

> It would be even better not to introduce more state variables
> because these need to be kept in sync with the actual state and
> often enough we can simply derive the state from the contents of
> some other variable.

They are not state variables.  They are mode variable controlling one
algorithm.  The algorithm perform one pass and if it fails it performs a
second pass with a different mode. The other variable, SPU_NOPREEMPT, is
not related to the first two.  It is essentially an output variable.

> 
> > +	/*
> > +	 * TODO: fix SPU Affinity to work with gang scheduling.
> > +	 */
> 
> This is really a funny aspect of this patch, as I always argued
> that spu affinity did not work at all as long as we don't have
> gang scheduling, since I consider gang scheduling as 'temporal
> affinity', which we now get right, together with the NUMA affinity
> that is still preserved I would argue that the SPU affinity
> with your patch works a lot better than before, even if it does
> not have perfect placement inside of a node just yet ;-)

True.

> 
> > -	spu = ctx->spu;
> > +	if (!ctx->spu)
> > +		goto out;
> >  
> > -	spu_context_trace(spusched_tick__preempt, ctx, spu);
> > +	if (ctx->flags & SPU_CREATE_NOSCHED)
> > +		goto out;
> >  
> 
> Seeing the check for SPU_CREATE_NOSCHED makes me wonder how
> we (should) treat them in a gang. I guess the easiest solution
> would be to forbid having NOSCHED contexts in any gang of
> more than one context. 

My intention was to allow NOSCHED gangs with N contexts.  It is not much
different, except the caller is woken when the context is loaded as
opposed to when an spu event occurs.  I put the caller to sleep on
ctx->run_wq so that it behaves like spufs_ps_nopf().  This actually
simplified the code and allowed me to remove code.

> Putting both NOSCHED and regular
> contexts into the same gang can't really work because it
> would violate the rule of loading all contexts simultaneously.
> 

Right.  This should be forbidden.

Luke