[Cbe-oss-dev] spufs: problem in scheduler

Sat Feb 25 10:53:00 EST 2006

Arnd,

This is the spufs problem I mentioned.  Could you
let me know your comments.

-Geoff

-------- Original Message --------
Subject: Re: spufs patch
Date: Mon, 30 Jan 2006 05:59:25 -0800
From: Masato Noguchi <Masato_Noguchi at hq.scei.sony.co.jp>
To: Levand, Geoff <Geoffrey.Levand at am.sony.com>

Geoff-san,

I'm facing bugs, I think it is bugs of spu's preemptive
scheduler.  It was found by stress tests:

The spu is working hard from several minutes to several
hours, then the kernel dies with "Oops: Kernel access
of bad area, sig: 11 [#1]".

I analyzed it and found it is caused by work_struct pushed
in by spu preemptive scheduler.  I wonder if this scenario
will trigger it:

1) running spu was scheduled to preempt by schedule_spu_reaper()
       * ctx->flags[SPU_CONTEXT_PREEMPT] was set.
       * ctx->reap_work was initialized and scheduled.

2)  But this work is delayed for some reason, and
following occurred in the meantime.

  2a) unbind_context() (spu_deactivate()) called.
         * ctx->flsgs was cleared.
       ( * ctx->reap_work was not cared. )

  2b) bind_context() (spu_activate()) called.

  2c) schedule_spu_reaper() called again.
    => ctx->reap_work was re-initialized,
        although it was in use by kernel work queue.

I'm not sure if this is the cause or not.

At the time of unbinding, ctx->reap_work should be flushed
from kernel's work queue I thought, but at this point the
kernel has write lock of spu context(ctx->state_sem), and
work job for spu reaper needs it too.  I think it may need
fundamental restructuring to fix it, but I have no idea to
do it well.

I don't understand the spu scheduler so well,  but it seems
it may need to edit many lines of scheduler's code.  It may
be good to ask Mark-san to fix it, since I heard he wrote it.