[Cbe-oss-dev] oprofiled crashing on cell?

Thu Jan 10 06:57:00 EST 2008

Michael Ellerman wrote:
> On Mon, 2008-01-07 at 09:13 -0600, Maynard Johnson wrote:
>   
>> Michael Ellerman wrote:
>>     
>>> Hi all,
>>>
>>> Running oprofile (0.9.3) on a cell machine (2.6.24-rc7 kernel) I see the
>>> oprofiled intermittently crashing. It only seems to happen when I run an
>>> SPU program.
>>>
>>> When it crashes I see this in the log:
>>>
>>> oprofiled started Mon Jan  7 18:23:21 2008
>>> kernel pointer size: 8
>>> Read buffer of 98307 entries.
>>> No anon map for pc 0, app anonymous.
>>>       
>
>   
>> Well, that's definitely badness, but this, in itself, would not cause 
>> oprofiled to crash.  Is this the last thing you see in the log?  Does 
>> the daemon fail both with and without the --verbose option?
>>     
>
> This is the entire log and that's running with --verbose, which has no
> effect on whether it crashes or not. Here's another run I just  did:
>
> oprofiled started Tue Jan  8 09:23:20 2008
> kernel pointer size: 8
> Read buffer of 76017 entries.
> No anon map for pc 0, app anonymous.
>
> So the buffer size is changing but nothing else.
>
> Any hint as to what the message means? "no anon map?"
>   
Michael, I've had some chats with Bob Nelson and Philippe Elie (one of 
the oprofile maintainers) about this.  Phil brought to my attention that 
there's no protection to prevent the oprofile daemon from reading the 
event buffer while the cell oprofile kernel driver is writing to it.  
It's possible the daemon may process a partial SPU context switch (which 
is 8 records written to the event buffer) -- i.e., reads the first part 
of the SPU context switch until no more data avaialable, then comes back 
and reads the rest.  There is no mechanism for the daemon to "remember" 
what it read previously, so the second read picks up in the middle of 
the SPU context switch, and since it's not an ESCAPE_CODE, it's 
interpreted as a PC sample.  If the value read is '0' (which can be the 
case for either the SPU number record or the SPU offset record), then I 
believe you'd end up with this "No anon map for pc 0" message you're 
getting.  The daemon may end up getting so confused that it crashes, 
although I don't see exactly what might lead to that.

Nonetheless, we have a working theory as to what might be causing your 
problem.  As this problem had not been seen previously, I believe some 
new development feature you're working on or running with might be 
exposing this hole.  Are you by chance running multiple SPU applications 
when this problem occurs?

-Maynard
>   
>>> Compared to a working run:
>>>
>>> oprofiled started Mon Jan  7 18:21:12 2008
>>> kernel pointer size: 8
>>> Read buffer of 11 entries.
>>> Dangling ESCAPE_CODE.
>>> <snip>
>>>       
>
>   
>> A dangling ESCAPE code is badness, too.  For Cell, a buffer with 11 
>> entries could mean 3 entries for profiling start header info + 8 entries 
>> for SPU context info.  The 11th entry would be the offset of the SPU ELF 
>> data, if embedded; otherwise 0.  According to the above log snippet, the 
>> 11th entry is an ESCAPE_CODE.  This implies to me that another event 
>> record may be getting intermingled in the buffer.  There were locks and 
>> memory barriers in place to prevent this from happening. Has there been 
>> a change in the Cell-oprofile kernel code recently that might be causing 
>> this?  Did you see this problem on earlier kernels? 
>>     
>
> OK. I haven't been able to reproduce the ESCAPE_CODE message this
> morning, but I definitely saw it multiple times yesterday. That may be a
> red herring.
>
>   
>> Are there any more details you can provide to reproduce the problem?
>>     
>
> Not really, sorry. I haven't been able to deduce any pattern to the
> crashes or anything interesting. I'll have a look at building oprofile
> from source today and see if I can turn on debugging or something.
>
> cheers
>
>