How to debug a hung multi-core system....

Thu May 21 23:52:59 EST 2009

<see middle post>

>> -----Original Message-----

>> From: Kumar Gala [mailto:galak at kernel.crashing.org]

>> Sent: Thursday, May 21, 2009 9:13 AM

>> To: Morrison, Tom

>> Cc: linuxppc-dev at ozlabs.org; Young, Andrew; Brown, Jeff

>> Subject: Re: How to debug a hung multi-core system....

>> 

>> 

>> On May 20, 2009, at 6:17 PM, Morrison, Tom wrote:

[Morrison, Tom] 

<snip some verbose explanations>

>> >

>> >   Core 1 seems to be Idle loop - happily doing nothing

>> >        (and not servicing TCP and/or the console)...

>> >

>> >   Core 0 seems to be 'stuck' at the "InstructionStorage"

>> >        Exception. And it seems to be going 'nowhere' fast

>> >

>> > SRR0 seems to point to this same spot (0xc00006C0)

>> > SRR1 value is 0x00021200

>> >

>> > I am at a loss to see how the kernel (and/or our kernel BSP)

>> > cause this exception, and I am even more of a loss on figuring

>> > out an application could cause this exception...

>> 

>> This is a bit odd as we shouldn't see an ISI from 0xc00006C0.

>> 

>> Are you able to single step Core0?  Can you dump the contents of the

>> TLBs on Core0

[Morrison, Tom] 

[Morrison, Tom] 

<snip some of verbose explanation>

Yes, very odd...

And I am able to get TLB entries from the core that is in 

Instruction Storage Exception, I made

[Morrison, Tom] 

>BKM>tat

Entry  EPN          RPN    TID  TMASK   WIMGE  TSIZ U0:3  X0:1   PID  TS
PROT SHEN   UR   UW   UX   SR   SW   SX  TIDZ VAL

IT0  0000C000     00000000 00     000     0A     0     0     0     0
0    U    P    D    D    D    D    D    D    D    I 

IT1  0000C000     00000000 00     000     0A     0     0     0     0
0    U    P    D    D    D    D    D    D    D    I 

IT2  0000C000     00000000 00     000     0A     0     0     0     0
0    U    P    D    D    D    D    D    D    D    I 

IT3  0000C000     00000000 00     000     0A     0     0     0     0
0    U    P    D    D    D    D    D    D    D    I 

DT0  0011C000     00000000 00     000     06     0     0     0     0
0    U    P    D    D    D    D    D    D    D    I 

DT1  D435C000     20000000 00     000     1E     0     0     0     0
0    U    P    D    D    D    D    D    D    D    I 

DT2  0011C000     00000000 00     000     06     0     0     0     0
0    U    P    D    D    D    D    D    D    D    I 

DT3  D435C000     20000000 00     000     1E     0     0     0     0
0    U    P    D    D    D    D    D    D    D    I 

LT0  C0000000     00000000 00     0FF     04     9     0     0     0
0    P    P    E    E    D    E    E    D    D    V 

LT1  D0000000     01000000 00     0FF     04     9     0     0     0
0    P    P    E    E    D    E    E    D    D    V 

LT2  E0000000     02000000 00     0FF     04     9     0     0     0
0    P    P    E    E    D    E    E    D    D    V 

LT3  39A40000     027FF700 0D     000     06     E     A     3     0
1    U    S    D    D    D    E    E    D    D    I 

LT4  F924E000     7C054500 BA     000     0B     E     0     3     0
0    P    S    E    E    D    E    E    D    D    V 

LT5  82A9F000     46664C00 FB     000     1A     F     4     2     0
0    U    S    E    E    D    D    E    D    D    I 

LT6  80000000     1F000000 F2     0FF     1D     9     B     3     0
0    U    S    D    E    D    E    E    E    D    V 

LT7  64000000     1F000000 B3     07F     02     8     B     0     0
1    U    S    D    E    D    D    E    E    D    V 

LT8  E5BF1000     995EA900 96     000     0C     D     8     0     0
1    U    S    D    E    E    E    E    D    D    V 

LT9  7F3BF000     C6DF7300 DF     000     15     1     2     3     0
1    U    S    E    D    D    E    E    E    D    I 

LT10 917C7000     EEA67F00 7F     000     17     C     5     3     0
1    P    S    E    E    E    E    E    E    D    I 

LT11 6B000000     F5700000 BC     03F     04     7     D     0     0
1    P    S    E    E    E    E    E    E    D    V 

LT12 712DB000     F1B59100 2A     000     19     C     F     1     0
1    P    S    E    E    E    E    D    E    D    V 

LT13 00000000     F0000000 7F     0FF     07     B     0     0     0
1    P    S    D    D    E    E    E    E    D    V 

LT14 A3000000     FDD00000 C5     03F     16     7     E     3     0
1    P    S    E    E    E    D    D    E    D    V 

LT15 F7F00000     B0B80000 82     00F     1F     5     F     0     0
1    P    P    E    E    D    D    D    D    D    V

To answer your 2nd question - we have about 10 processes, and

about 60-70 threads total (30+ for the main processing process)...

>> > Anybody have any ideas - and/or ways to re-configure our

>> > setup to obtain more data? Or does this sound familiar to

>> > a bug somebody has already found in the kernel?

>> >

>> > We are even having trouble defining a test program that can

>> > cause (on purpose) the 'InstructionStorage' Exception (does

>> > anybody have an simple 'c' (or ppc assembly) program that

>> > causes this exception (so we can run in user application land

>> > and see if the symptoms are similar))?

>> >

>> > Thank you in advance for any / all help you can provide....

>> > because I am completely stumped on even how to proceed!

>> 

>> 

>> Is your application generating a lot of processes or have a lot of

>> concurrent processes on the 8572?

>> 

>> - k

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20090521/97af320a/attachment.htm>