[Cbe-oss-dev] oprofiled crashing on cell?

Michael Ellerman michael at ellerman.id.au
Fri Jan 11 12:06:08 EST 2008


On Wed, 2008-01-09 at 13:57 -0600, Maynard Johnson wrote:
> Michael Ellerman wrote:
> > On Mon, 2008-01-07 at 09:13 -0600, Maynard Johnson wrote:
> >   
> >> Michael Ellerman wrote:
> >>     
> >>> Hi all,
> >>>
> >>> Running oprofile (0.9.3) on a cell machine (2.6.24-rc7 kernel) I see the
> >>> oprofiled intermittently crashing. It only seems to happen when I run an
> >>> SPU program.
> >>>
> >>> When it crashes I see this in the log:
> >>>
> >>> oprofiled started Mon Jan  7 18:23:21 2008
> >>> kernel pointer size: 8
> >>> Read buffer of 98307 entries.
> >>> No anon map for pc 0, app anonymous.
> >>>       
> >
> >   
> >> Well, that's definitely badness, but this, in itself, would not cause 
> >> oprofiled to crash.  Is this the last thing you see in the log?  Does 
> >> the daemon fail both with and without the --verbose option?
> >>     
> >
> > This is the entire log and that's running with --verbose, which has no
> > effect on whether it crashes or not. Here's another run I just  did:
> >
> > oprofiled started Tue Jan  8 09:23:20 2008
> > kernel pointer size: 8
> > Read buffer of 76017 entries.
> > No anon map for pc 0, app anonymous.
> >
> > So the buffer size is changing but nothing else.
> >
> > Any hint as to what the message means? "no anon map?"
> >   
> Michael, I've had some chats with Bob Nelson and Philippe Elie (one of 
> the oprofile maintainers) about this.  Phil brought to my attention that 
> there's no protection to prevent the oprofile daemon from reading the 
> event buffer while the cell oprofile kernel driver is writing to it.  
> It's possible the daemon may process a partial SPU context switch (which 
> is 8 records written to the event buffer) -- i.e., reads the first part 
> of the SPU context switch until no more data avaialable, then comes back 
> and reads the rest.  There is no mechanism for the daemon to "remember" 
> what it read previously, so the second read picks up in the middle of 
> the SPU context switch, and since it's not an ESCAPE_CODE, it's 
> interpreted as a PC sample.  If the value read is '0' (which can be the 
> case for either the SPU number record or the SPU offset record), then I 
> believe you'd end up with this "No anon map for pc 0" message you're 
> getting.  The daemon may end up getting so confused that it crashes, 
> although I don't see exactly what might lead to that.
> 
> Nonetheless, we have a working theory as to what might be causing your 
> problem.  As this problem had not been seen previously, I believe some 
> new development feature you're working on or running with might be 
> exposing this hole.  Are you by chance running multiple SPU applications 
> when this problem occurs?

Hi Maynard,

Thanks for running with this. Unfortunately I'm not working on any new
features, I'm just debugging userspace. The kernel I'm using is
2.6.24-rc7 ish, and I'm just trying to make sure I've got the latest SDK
packages (takes ages to sync over to AU). Next week I can test with the
SDK kernel and see if I can reproduce with it.

I'm also only running a single SPU application, at least while I'm doing
oprofile. Could it be that there's something left over from a previous
multi-SPU run?

FWIW, yesterday I got about 10 runs in a row without a crash, but then I
did hit another crash. So it certainly looks like a race or something.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.ozlabs.org/pipermail/cbe-oss-dev/attachments/20080111/50dcdf23/attachment.pgp>


More information about the cbe-oss-dev mailing list