[Cbe-oss-dev] spufs: problem in scheduler
Geoff Levand
geoffrey.levand at am.sony.com
Sat Feb 25 10:53:00 EST 2006
Arnd,
This is the spufs problem I mentioned. Could you
let me know your comments.
-Geoff
-------- Original Message --------
Subject: Re: spufs patch
Date: Mon, 30 Jan 2006 05:59:25 -0800
From: Masato Noguchi <Masato_Noguchi at hq.scei.sony.co.jp>
To: Levand, Geoff <Geoffrey.Levand at am.sony.com>
Geoff-san,
I'm facing bugs, I think it is bugs of spu's preemptive
scheduler. It was found by stress tests:
The spu is working hard from several minutes to several
hours, then the kernel dies with "Oops: Kernel access
of bad area, sig: 11 [#1]".
I analyzed it and found it is caused by work_struct pushed
in by spu preemptive scheduler. I wonder if this scenario
will trigger it:
1) running spu was scheduled to preempt by schedule_spu_reaper()
* ctx->flags[SPU_CONTEXT_PREEMPT] was set.
* ctx->reap_work was initialized and scheduled.
2) But this work is delayed for some reason, and
following occurred in the meantime.
2a) unbind_context() (spu_deactivate()) called.
* ctx->flsgs was cleared.
( * ctx->reap_work was not cared. )
2b) bind_context() (spu_activate()) called.
2c) schedule_spu_reaper() called again.
=> ctx->reap_work was re-initialized,
although it was in use by kernel work queue.
I'm not sure if this is the cause or not.
At the time of unbinding, ctx->reap_work should be flushed
from kernel's work queue I thought, but at this point the
kernel has write lock of spu context(ctx->state_sem), and
work job for spu reaper needs it too. I think it may need
fundamental restructuring to fix it, but I have no idea to
do it well.
I don't understand the spu scheduler so well, but it seems
it may need to edit many lines of scheduler's code. It may
be good to ask Mark-san to fix it, since I heard he wrote it.
More information about the cbe-oss-dev
mailing list