[Cbe-oss-dev] [PATCH] spusched: fix spu utilization statistics

Tue Jul 10 23:20:31 EST 2007

Hi,

sorry for the late reply, I've been tied to my bed the last few days
because I've been sick (and still am a little)

On Wed, Jul 04, 2007 at 02:00:03PM -0300, Luke Browning wrote:
> SPU contexts only accrue SPU time while they are loaded or running on an
> SPU.  Time spent on the runqueue is not an interesting physical spu
> utilization statistic. This patch fixes that bug by only measure time
> when the context is associated with an SPU. 
> 
> This patch also use the timebase register to improve granularity of time
> stamping and provide a more repeatable statistic. Many of the critical
> sections that are measured fall into the few microsecond level.  ie.
> context switching.

Switching away from jiffies sounds fine to me, but get_tbl isn't an
exported kernel interface we should be using.  I was thinking about
using the include/linux/ktime.h interface for up-to nanosecond
timekeeping.

> For SPE contexts, utilization should be interpreted as:

> The "user" category measures the time spent executing SPE application
> code.

This is how it's defined in the spec.

> The "system" utilization category is a measure of system overhead
> and includes time spent context switching and executing remote system
> calls.

Originally we had only included time spent in syscalls (or rather any
userspace callout) and not the context switching time.  Adding the
context switching time is fine with me as long as we document this.

> The "iowait" time measures time spent
> processing page faults and provides an indication of the potential
> benefit that be realized by tuning the I/O subsystem and/or the
> application.

This is as defined in the spec, but not actually what your patch
does (see comment below in the code).

> The "loaded" statistic reflects the time spent lazily
> loaded.

This is as defined in the spec.

> Statistics are also collected at the Logical SPUs level.  In general,
> the SPE context statistics are simply added to the corresponding logical
> SPE utilization bucket, except the SPE context "loaded" category is
> added to the system spu "idle" bucket.

This is where we had very differently defined states.  I'd be more than
happy to simplify this as a lot of complication there arose from me
trying to interpret your earlier requirements.

>  	ctx->stats.hash_flt++;
> -	if (ctx->state == SPU_STATE_RUNNABLE) {
> +	if (ctx->state == SPU_STATE_RUNNABLE)
>  		ctx->spu->stats.hash_flt++;
> -		spu_switch_state(ctx->spu, SPU_UTIL_IOWAIT);
> -	}

We're in the page fault handler here which is the only thing defined to
be iowait.  With this patch we won't ever account any time to the iowait
bucket.

>  static int spu_run_init(struct spu_context *ctx, u32 * npc)
>  {
> +
> +        spuctx_switch_state(ctx, SPUCTX_UTIL_SYSTEM);
> +

Whitespace damage, also in a few other places.

>  	if (ctx->flags & SPU_CREATE_ISOLATE) {
>  		unsigned long runcntl;
>  
> @@ -153,6 +156,8 @@
>  		ctx->ops->runcntl_write(ctx, SPU_RUNCNTL_RUNNABLE);
>  	}
>  
> +        spuctx_switch_state(ctx, SPUCTX_UTIL_USER);
> +
>  	return 0;

If we follow your above interpretation this is wrong as we don't
start running when spu_run_init returns - we might still have to
wait quite a while until we actually get scheduled.

(this is spu_run_fini)
> @@ -161,8 +166,12 @@
>  {
>  	int ret = 0;
>  
> +	spuctx_switch_state(ctx, SPUCTX_UTIL_SYSTEM);
> +

As we're switching to SPUCTX_UTIL_SYSTEM in the beginning of
spufs_run_spu main loop this one shouldn't be needed as all pathes
leading here should be in this state already.

>  	*status = ctx->ops->status_read(ctx);
>  	*npc = ctx->ops->npc_read(ctx);
> +
> +        spuctx_switch_state(ctx, SPUCTX_UTIL_LOADED);

> @@ -404,6 +407,7 @@
>  	spu->flags = 0;
>  	ctx->spu = spu;
>  	ctx->ops = &spu_hw_ops;
> +	ctx->state = SPU_STATE_RUNNABLE;
>  	spu->pid = current->pid;
>  	spu->tgid = current->tgid;
>  	spu_associate_mm(spu, ctx->owner);
> @@ -418,8 +422,8 @@
>  	spu->timestamp = jiffies;
>  	spu_cpu_affinity_set(spu, raw_smp_processor_id());
>  	spu_switch_notify(spu, ctx);
> -	ctx->state = SPU_STATE_RUNNABLE;

I think this unrelated move of the assignment is wrong.  Before the
spu is runnable we really need a stable spu->mm value aswell as the
full context restore, etc.

> +        spuctx_switch_state(ctx, SPUCTX_UTIL_LOADED);

shouldn't this be SPUCTX_UTIL_USER because it's actually in use now?

>  /**
> @@ -432,7 +436,7 @@
>  	pr_debug("%s: unbind pid=%d SPU=%d NODE=%d\n", __FUNCTION__,
>  		 spu->pid, spu->number, spu->node);
>  
> -	spu_switch_state(spu, SPU_UTIL_IDLE);
> +        spuctx_switch_state(ctx, SPUCTX_UTIL_SYSTEM);
>  
>   	if (spu->ctx->flags & SPU_CREATE_NOSCHED)
>  		atomic_dec(&cbe_spu_info[spu->node].reserved_spus);
> @@ -443,7 +447,6 @@
>  	spu_unmap_mappings(ctx);
>  	spu_save(&ctx->csa, spu);
>  	spu->timestamp = jiffies;
> -	ctx->state = SPU_STATE_SAVED;

Moving these bits around again seems unsave to me.

> +        spuctx_switch_state(ctx, SPUCTX_UTIL_USER);

After we just context-switched it out it surely should not be _USER?

> --- linux-2.6.22-rc5.orig/arch/powerpc/platforms/cell/spufs/spufs.h
> 2007-07-04 11:57:31.000000000 -0300
> +++ linux-2.6.22-rc5/arch/powerpc/platforms/cell/spufs/spufs.h
> 2007-07-04 12:32:38.000000000 -0300
> @@ -49,12 +49,14 @@
>   * This is the state for spu utilization reporting to userspace.
>   * Because this state is visible to userspace it must never change and
> needs
>   * to be kept strictly separate from any internal state kept by the
> kernel.
> + *
> + * Must be kept in sync with spu_utilization_state.
>   */

We should just use one enum in that case and have two enum names
with the same value for idle/loaded.  That way they'll never get out
of sync.

p.s.  I suspect the patch might be better readable if it were against
a tree without any spu statistics as the whole approach is quite
different

p.p.s. please add -p to your diff option - that way diff writes out the
name of the function each hunk is in making email review a lot easier.