[Cbe-oss-dev] oprofiled crashing on cell?
Michael Ellerman
michael at ellerman.id.au
Tue Jan 8 09:31:57 EST 2008
On Mon, 2008-01-07 at 13:30 -0600, Bob Nelson wrote:
> On Monday 07 January 2008 09:13:13 am Maynard Johnson wrote:
> > Michael Ellerman wrote:
> > > Hi all,
> > >
> > > Running oprofile (0.9.3) on a cell machine (2.6.24-rc7 kernel) I see the
> > > oprofiled intermittently crashing. It only seems to happen when I run an
> > > SPU program.
> > >
> > > When it crashes I see this in the log:
> > >
> > > oprofiled started Mon Jan 7 18:23:21 2008
> > > kernel pointer size: 8
> > > Read buffer of 98307 entries.
> > > No anon map for pc 0, app anonymous.
> > >
> > Well, that's definitely badness, but this, in itself, would not cause
> > oprofiled to crash. Is this the last thing you see in the log? Does
> > the daemon fail both with and without the --verbose option?
> > > Compared to a working run:
> > >
> > > oprofiled started Mon Jan 7 18:21:12 2008
> > > kernel pointer size: 8
> > > Read buffer of 11 entries.
> > > Dangling ESCAPE_CODE.
> > > <snip>
> > >
> > A dangling ESCAPE code is badness, too. For Cell, a buffer with 11
> > entries could mean 3 entries for profiling start header info + 8 entries
> > for SPU context info. The 11th entry would be the offset of the SPU ELF
> > data, if embedded; otherwise 0. According to the above log snippet, the
> > 11th entry is an ESCAPE_CODE. This implies to me that another event
> > record may be getting intermingled in the buffer. There were locks and
> > memory barriers in place to prevent this from happening. Has there been
> > a change in the Cell-oprofile kernel code recently that might be causing
> > this? Did you see this problem on earlier kernels? Are there any more
> > details you can provide to reproduce the problem?
>
> Actually I think the dangling escape code message is is a bug I ran into a
> little while back but I haven't put out a patch for it yet. I only saw it
> in one weird case IIRC. I think it was when the only or last thing in the
> buffer was a context switch. You indicate this was the 'working' run but
> it doesn't look like you are getting any data collected in this case.
> If you are you compiling OProfile from source it is a one-line change.
>
> In the module oprofile-0.9.3/daemon/opd_spu.c in the following line the 7
> should be changed to a 6.
>
> if (!enough_remaining(trans, 7)) {
OK I can't reproduce it now so perhaps it is the same bug you saw once.
If I can build oprofile from source I'll try your patch.
cheers
--
Michael Ellerman
OzLabs, IBM Australia Development Lab
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.ozlabs.org/pipermail/cbe-oss-dev/attachments/20080108/7297f9ad/attachment.pgp>
More information about the cbe-oss-dev
mailing list