[Cbe-oss-dev] [RFC] [PATCH 6:8] SPU Gang Scheduling - base functionality for gang scheduling
Luke Browning
lukebr at linux.vnet.ibm.com
Wed Mar 5 05:41:38 EST 2008
On Tue, 2008-03-04 at 09:33 +0100, Arnd Bergmann wrote:
> Thanks a lot for sending out these patches. In general, they look quite
> good already, but I obviously haven't had the chance to look at all the
> details of the scheduling so far, especially this main patch. The comments
> in here are mostly for the stuff that I noticed while browsing through it
> quickly.
Thanks.
>
> The two main points I have are:
>
> * As mentioned in the other replies, I really think a gang should never
> be loaded while not all of its contexts have a thread waiting in spu_run.
> I hope this change will also simplify some of the logic in your
> implementation.
We need to think about that a little. Please take a look at my other
comments.
>
> * We're currently trying to stabilize the code in 2.6.25 after a number
> of scheduler changes, and there are still known bugs. IMHO it would be
> best to continue keeping the gang scheduling on a separate branch until
> at least after 2.6.26 so that we can stabilize the two scheduler
> implementation in parallel and improve the test cases we already have,
> testing them out on both of them.
Agreed. This is essential.
>
> > @@ -137,6 +145,24 @@ struct spu_gang {
> > int aff_flags;
> > struct spu *aff_ref_spu;
> > atomic_t aff_sched_count;
> > +
> > + /* statistics */
> > + struct {
> > + /* updates protected by ctx->state_mutex */
> > + enum spu_utilization_state util_state;
> > + unsigned long long tstamp; /* time of last state switch */
> > + unsigned long long times[SPU_UTIL_MAX];
> > + unsigned long long vol_ctx_switch;
> > + unsigned long long invol_ctx_switch;
> > + unsigned long long min_flt;
> > + unsigned long long maj_flt;
> > + unsigned long long hash_flt;
> > + unsigned long long slb_flt;
> > + unsigned long long slb_flt_base; /* # at last ctx switch */
> > + unsigned long long class2_intr;
> > + unsigned long long class2_intr_base; /* # at last ctx switch */
> > + unsigned long long libassist;
> > + } stats;
> > };
> >
> > /* Flag bits for spu_gang aff_flags */
>
> Could you try to split out the statistics interfaces into a separate
> patch? I guess that will make reviewing and bisecting the patch a little
> easier if the statistics are a significant subset of the code change.
I can if you want. I only made one statistic update though. I was
going to leave the rest to andre and others.
>
> > @@ -60,6 +72,7 @@ static struct spu_prio_array *spu_prio;
> > static struct task_struct *spusched_task;
> > static struct timer_list spusched_timer;
> > static struct timer_list spuloadavg_timer;
> > +static void spu_unschedule(struct spu_gang *gang);
> >
> > /*
> > * Priority of a normal, non-rt, non-niced'd process (aka nice level 0).
>
> Please avoid forward declarations for static functions. We normally
> define all functions in call order so that reading the code becomes
> easier (you know it only gets called from below) and it's obvious
> not to have any recursions.
ok.
>
> > @@ -85,6 +98,11 @@ static struct timer_list spuloadavg_time
> > #define SCALE_PRIO(x, prio) \
> > max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
> >
> > +#define SPU_RESERVE 0
> > +#define SPU_RESERVE_UNDO 1
> > +
> > +#define SPU_NOPREEMPT -2
> > +
> > /*
> > * scale user-nice values [ -20 ... 0 ... 19 ] to time slice values:
> > * [800ms ... 100ms ... 5ms]
>
> These constants probably need some explanation, at least it wasn't
> obvious to me what they mean and why one is postive and the other
> is negative.
ok.
> It would be even better not to introduce more state variables
> because these need to be kept in sync with the actual state and
> often enough we can simply derive the state from the contents of
> some other variable.
They are not state variables. They are mode variable controlling one
algorithm. The algorithm perform one pass and if it fails it performs a
second pass with a different mode. The other variable, SPU_NOPREEMPT, is
not related to the first two. It is essentially an output variable.
>
> > + /*
> > + * TODO: fix SPU Affinity to work with gang scheduling.
> > + */
>
> This is really a funny aspect of this patch, as I always argued
> that spu affinity did not work at all as long as we don't have
> gang scheduling, since I consider gang scheduling as 'temporal
> affinity', which we now get right, together with the NUMA affinity
> that is still preserved I would argue that the SPU affinity
> with your patch works a lot better than before, even if it does
> not have perfect placement inside of a node just yet ;-)
True.
>
> > - spu = ctx->spu;
> > + if (!ctx->spu)
> > + goto out;
> >
> > - spu_context_trace(spusched_tick__preempt, ctx, spu);
> > + if (ctx->flags & SPU_CREATE_NOSCHED)
> > + goto out;
> >
>
> Seeing the check for SPU_CREATE_NOSCHED makes me wonder how
> we (should) treat them in a gang. I guess the easiest solution
> would be to forbid having NOSCHED contexts in any gang of
> more than one context.
My intention was to allow NOSCHED gangs with N contexts. It is not much
different, except the caller is woken when the context is loaded as
opposed to when an spu event occurs. I put the caller to sleep on
ctx->run_wq so that it behaves like spufs_ps_nopf(). This actually
simplified the code and allowed me to remove code.
> Putting both NOSCHED and regular
> contexts into the same gang can't really work because it
> would violate the rule of loading all contexts simultaneously.
>
Right. This should be forbidden.
Luke
More information about the cbe-oss-dev
mailing list