[Cbe-oss-dev] PATCH [4/7] decouple spu scheduler from spufs_spu_run [asynchronous scheduling]

Sat Nov 24 07:51:14 EST 2007

On Fri, 2007-11-23 at 19:46 +0100, Christoph Hellwig wrote:
> On Fri, Nov 23, 2007 at 04:35:58PM -0200, Luke Browning wrote:
> > On Fri, 2007-11-23 at 19:08 +0100, Christoph Hellwig wrote:
> > > On Fri, Nov 23, 2007 at 04:07:33PM -0200, Luke Browning wrote:
> > > > I am a little puzzled by this failure.  The only difference that I could
> > > > see in my version of the code is that the return code would be different
> > > > if a signal was pending and you had a concurrent event like an spe error
> > > > or library call.
> > > 
> > > Yes, that's what the testcase is checking for an fails on.
> > > 
> > 
> > Isn't that a false assumption?    ie.  an invalid testcase
> 
> It seems reasonable to expect an EINTR when a signal was forced due to
> dma misalignments.

Makes sense, but I wonder if it wouldn't be better to use the return
code provided by spu_handle_class1() and spu_handle_class0_events() to
generate an errno that is specific to the failure instead of mapping
all failures to EINTR.  As it is currently coded, the controlling thread
cannot distinguish between recoverable dma errors, unrecoverable dma
errors, and stray signals like SIGUSR.  

What is it supposed to do?  If it knew it was a recoverable dma error,
it could query a spufs file to get information to handle the fault.  If
it was a harmless error, it could just re-start it.  But, it can't know
if all errors are mapped to EINTR.

As it is currently code, some information about the error is sent to a
signal handler, but is it enough.  What can the signal handler do?  Have
the appropriate routines in libspe been made signal safe.  How do we
expect signal handlers and controlling threads to communicate?  Are we
telling people to use sigwait().  How does the signal waiter thread know
which spe thread caused the fault?

Also, if the program is ignoring SIGBUS or SIGILL, then the system call
fails with EFAULT which is really odd as the failure is not related to
any copyin/copyout operations associated with system call parameters.  

It seems more intuitive to me to map return codes from the kernel
handlers for class 0 and 1 to specific errnos and return those values so
that the controlling thread knows what happened.  There is no standard
here that says we have to return EINTR, although I would return that for
asynchronous signals.  At least, we could explain the immediate
condition if not how to handle it.

Does anybody know how this is supposed to work?  Is it documented? 

Luke