[Cbe-oss-dev] oprofiled crashing on cell?

Michael Ellerman michael at ellerman.id.au
Tue Jan 8 09:31:57 EST 2008


On Mon, 2008-01-07 at 13:30 -0600, Bob Nelson wrote:
> On Monday 07 January 2008 09:13:13 am Maynard Johnson wrote:
> > Michael Ellerman wrote:
> > > Hi all,
> > >
> > > Running oprofile (0.9.3) on a cell machine (2.6.24-rc7 kernel) I see the
> > > oprofiled intermittently crashing. It only seems to happen when I run an
> > > SPU program.
> > >
> > > When it crashes I see this in the log:
> > >
> > > oprofiled started Mon Jan  7 18:23:21 2008
> > > kernel pointer size: 8
> > > Read buffer of 98307 entries.
> > > No anon map for pc 0, app anonymous.
> > >   
> > Well, that's definitely badness, but this, in itself, would not cause 
> > oprofiled to crash.  Is this the last thing you see in the log?  Does 
> > the daemon fail both with and without the --verbose option?
> > > Compared to a working run:
> > >
> > > oprofiled started Mon Jan  7 18:21:12 2008
> > > kernel pointer size: 8
> > > Read buffer of 11 entries.
> > > Dangling ESCAPE_CODE.
> > > <snip>
> > >   
> > A dangling ESCAPE code is badness, too.  For Cell, a buffer with 11 
> > entries could mean 3 entries for profiling start header info + 8 entries 
> > for SPU context info.  The 11th entry would be the offset of the SPU ELF 
> > data, if embedded; otherwise 0.  According to the above log snippet, the 
> > 11th entry is an ESCAPE_CODE.  This implies to me that another event 
> > record may be getting intermingled in the buffer.  There were locks and 
> > memory barriers in place to prevent this from happening.  Has there been 
> > a change in the Cell-oprofile kernel code recently that might be causing 
> > this?  Did you see this problem on earlier kernels?  Are there any more 
> > details you can provide to reproduce the problem?
> 
> Actually I think the dangling escape code message is is a bug I ran into a
> little while back but I haven't put out a patch for it yet.  I only saw it
> in one weird case IIRC.  I think it was when the only or last thing in the
> buffer was a context switch.  You indicate this was the 'working' run but
> it doesn't look like you are getting any data collected in this case.
> If you are you compiling OProfile from source it is a one-line change.
> 
> In the module oprofile-0.9.3/daemon/opd_spu.c in the following line the 7
> should be changed to a 6.
> 
>       if (!enough_remaining(trans, 7)) {

OK I can't reproduce it now so perhaps it is the same bug you saw once.
If I can build oprofile from source I'll try your patch.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.ozlabs.org/pipermail/cbe-oss-dev/attachments/20080108/7297f9ad/attachment.pgp>


More information about the cbe-oss-dev mailing list