How to debug a hung multi-core system....
Morrison, Tom
tmorrison at empirix.com
Thu May 21 09:17:00 EST 2009
All,
First off, we turned SPE off completely in our build - so we
could debug a much deeper problem that seems to be occurring
in our application (before we try to find a potential test
case for corruption of GPR registers).
We have had this problem for 3 weeks, and just recently have
come down to a single test case that makes it fail (although
extremely complicated test case)...
Setup:
Master Blade (8548E) with Linux 2.6.23 (and custom BSP)
Slave Blade (8572E) with Linux 2.6.23 (and similar custom BSP).
The Master Blade works flawlessly (and also works in a slave
capacity too flawlessly). The single 'slave' 8572E blades
communicates with the 'master' blade over TCP/IP & PCI Express
(and is running a similar application)...
Running Single Core on slave 8572E (nosmp option on command line)
the application works in all conditions (from modestly loaded to
well oversubscribed/pegged CPU).
In Multi-core option, the application also works flawlessly. The
problem comes when we oversubscribe our application and push
this 'slave' blade to the extreme edge of processing (falling
behind in our processing...etc).
Eventually, sometime between 5-15 minutes, this board becomes
hung (where the console becomes completely unresponsive and
you cannot 'ping' the box).
I have a JTAG WindRiver ICE and connect to this blade after it
is hung, and it appears that both cores are running to some
extent:
Core 1 seems to be Idle loop - happily doing nothing
(and not servicing TCP and/or the console)...
Core 0 seems to be 'stuck' at the "InstructionStorage"
Exception. And it seems to be going 'nowhere' fast
SRR0 seems to point to this same spot (0xc00006C0)
SRR1 value is 0x00021200
I am at a loss to see how the kernel (and/or our kernel BSP)
cause this exception, and I am even more of a loss on figuring
out an application could cause this exception...
Anybody have any ideas - and/or ways to re-configure our
setup to obtain more data? Or does this sound familiar to
a bug somebody has already found in the kernel?
We are even having trouble defining a test program that can
cause (on purpose) the 'InstructionStorage' Exception (does
anybody have an simple 'c' (or ppc assembly) program that
causes this exception (so we can run in user application land
and see if the symptoms are similar))?
Thank you in advance for any / all help you can provide....
because I am completely stumped on even how to proceed!
Sincerely,
Tom Morrison
Principal Software Engineer
EMPIRIX
20 Crosby Drive - Bedford, MA 01730
p: 781.266.3567 f: 781.266.3670
email: tmorrison at empirix.com
www.empirix.com
More information about the Linuxppc-dev
mailing list