How to debug a hung multi-core system....

Morrison, Tom tmorrison at empirix.com
Thu May 21 09:17:00 EST 2009


All,

First off, we turned SPE off completely in our build - so we 
could debug a much deeper problem that seems to be occurring 
in our application (before we try to find a potential test 
case for corruption of GPR registers).

We have had this problem for 3 weeks, and just recently have 
come down to a single test case that makes it fail (although 
extremely complicated test case)...

Setup:   
   Master Blade (8548E) with Linux 2.6.23 (and custom BSP)
   Slave Blade (8572E) with Linux 2.6.23 (and similar custom BSP).

The Master Blade works flawlessly (and also works in a slave 
capacity too flawlessly). The single 'slave' 8572E blades 
communicates with the 'master' blade over TCP/IP & PCI Express
(and is running a similar application)...

Running Single Core on slave 8572E (nosmp option on command line) 
the application works in all conditions (from modestly loaded to 
well oversubscribed/pegged CPU).

In Multi-core option, the application also works flawlessly. The 
problem comes when we oversubscribe our application and push 
this 'slave' blade to the extreme edge of processing (falling 
behind in our processing...etc). 

Eventually, sometime between 5-15 minutes, this board becomes 
hung (where the console becomes completely unresponsive and 
you cannot 'ping' the box).

I have a JTAG WindRiver ICE and connect to this blade after it 
is hung, and it appears that both cores are running to some 
extent:

   Core 1 seems to be Idle loop - happily doing nothing 
		(and not servicing TCP and/or the console)...

   Core 0 seems to be 'stuck' at the "InstructionStorage" 
		Exception. And it seems to be going 'nowhere' fast

	SRR0 seems to point to this same spot (0xc00006C0)
	SRR1 value is 0x00021200 

I am at a loss to see how the kernel (and/or our kernel BSP) 
cause this exception, and I am even more of a loss on figuring 
out an application could cause this exception...

Anybody have any ideas - and/or ways to re-configure our 
setup to obtain more data? Or does this sound familiar to 
a bug somebody has already found in the kernel?

We are even having trouble defining a test program that can
cause (on purpose) the 'InstructionStorage' Exception (does 
anybody have an simple 'c' (or ppc assembly) program that 
causes this exception (so we can run in user application land
and see if the symptoms are similar))?

Thank you in advance for any / all help you can provide....
because I am completely stumped on even how to proceed!

Sincerely,

Tom Morrison
Principal Software Engineer


EMPIRIX 
20 Crosby Drive - Bedford, MA  01730
p: 781.266.3567 f: 781.266.3670 
email: tmorrison at empirix.com 
www.empirix.com






More information about the Linuxppc-dev mailing list