How to debug a hung multi-core system....
Morrison, Tom
tmorrison at empirix.com
Thu May 21 23:52:59 EST 2009
<see middle post>
>> -----Original Message-----
>> From: Kumar Gala [mailto:galak at kernel.crashing.org]
>> Sent: Thursday, May 21, 2009 9:13 AM
>> To: Morrison, Tom
>> Cc: linuxppc-dev at ozlabs.org; Young, Andrew; Brown, Jeff
>> Subject: Re: How to debug a hung multi-core system....
>>
>>
>> On May 20, 2009, at 6:17 PM, Morrison, Tom wrote:
[Morrison, Tom]
<snip some verbose explanations>
>> >
>> > Core 1 seems to be Idle loop - happily doing nothing
>> > (and not servicing TCP and/or the console)...
>> >
>> > Core 0 seems to be 'stuck' at the "InstructionStorage"
>> > Exception. And it seems to be going 'nowhere' fast
>> >
>> > SRR0 seems to point to this same spot (0xc00006C0)
>> > SRR1 value is 0x00021200
>> >
>> > I am at a loss to see how the kernel (and/or our kernel BSP)
>> > cause this exception, and I am even more of a loss on figuring
>> > out an application could cause this exception...
>>
>> This is a bit odd as we shouldn't see an ISI from 0xc00006C0.
>>
>> Are you able to single step Core0? Can you dump the contents of the
>> TLBs on Core0
[Morrison, Tom]
[Morrison, Tom]
<snip some of verbose explanation>
Yes, very odd...
And I am able to get TLB entries from the core that is in
Instruction Storage Exception, I made
[Morrison, Tom]
>BKM>tat
Entry EPN RPN TID TMASK WIMGE TSIZ U0:3 X0:1 PID TS
PROT SHEN UR UW UX SR SW SX TIDZ VAL
IT0 0000C000 00000000 00 000 0A 0 0 0 0
0 U P D D D D D D D I
IT1 0000C000 00000000 00 000 0A 0 0 0 0
0 U P D D D D D D D I
IT2 0000C000 00000000 00 000 0A 0 0 0 0
0 U P D D D D D D D I
IT3 0000C000 00000000 00 000 0A 0 0 0 0
0 U P D D D D D D D I
DT0 0011C000 00000000 00 000 06 0 0 0 0
0 U P D D D D D D D I
DT1 D435C000 20000000 00 000 1E 0 0 0 0
0 U P D D D D D D D I
DT2 0011C000 00000000 00 000 06 0 0 0 0
0 U P D D D D D D D I
DT3 D435C000 20000000 00 000 1E 0 0 0 0
0 U P D D D D D D D I
LT0 C0000000 00000000 00 0FF 04 9 0 0 0
0 P P E E D E E D D V
LT1 D0000000 01000000 00 0FF 04 9 0 0 0
0 P P E E D E E D D V
LT2 E0000000 02000000 00 0FF 04 9 0 0 0
0 P P E E D E E D D V
LT3 39A40000 027FF700 0D 000 06 E A 3 0
1 U S D D D E E D D I
LT4 F924E000 7C054500 BA 000 0B E 0 3 0
0 P S E E D E E D D V
LT5 82A9F000 46664C00 FB 000 1A F 4 2 0
0 U S E E D D E D D I
LT6 80000000 1F000000 F2 0FF 1D 9 B 3 0
0 U S D E D E E E D V
LT7 64000000 1F000000 B3 07F 02 8 B 0 0
1 U S D E D D E E D V
LT8 E5BF1000 995EA900 96 000 0C D 8 0 0
1 U S D E E E E D D V
LT9 7F3BF000 C6DF7300 DF 000 15 1 2 3 0
1 U S E D D E E E D I
LT10 917C7000 EEA67F00 7F 000 17 C 5 3 0
1 P S E E E E E E D I
LT11 6B000000 F5700000 BC 03F 04 7 D 0 0
1 P S E E E E E E D V
LT12 712DB000 F1B59100 2A 000 19 C F 1 0
1 P S E E E E D E D V
LT13 00000000 F0000000 7F 0FF 07 B 0 0 0
1 P S D D E E E E D V
LT14 A3000000 FDD00000 C5 03F 16 7 E 3 0
1 P S E E E D D E D V
LT15 F7F00000 B0B80000 82 00F 1F 5 F 0 0
1 P P E E D D D D D V
To answer your 2nd question - we have about 10 processes, and
about 60-70 threads total (30+ for the main processing process)...
>> > Anybody have any ideas - and/or ways to re-configure our
>> > setup to obtain more data? Or does this sound familiar to
>> > a bug somebody has already found in the kernel?
>> >
>> > We are even having trouble defining a test program that can
>> > cause (on purpose) the 'InstructionStorage' Exception (does
>> > anybody have an simple 'c' (or ppc assembly) program that
>> > causes this exception (so we can run in user application land
>> > and see if the symptoms are similar))?
>> >
>> > Thank you in advance for any / all help you can provide....
>> > because I am completely stumped on even how to proceed!
>>
>>
>> Is your application generating a lot of processes or have a lot of
>> concurrent processes on the 8572?
>>
>> - k
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20090521/97af320a/attachment.htm>
More information about the Linuxppc-dev
mailing list