Exception in kernel mode

Sat Mar 17 02:33:19 EST 2007

On Mar 16, 2007, at 9:45 AM, Charles Krinke wrote:

> It this a system you are just bringing up or one that's been running
> for a while.  It really seems like memory corruption of some form.
> I'd suggest checking memory controller settings.
>
> Also, what happens if you disassemble the kernel image and look at
> the addresses pointed to by NIP:
> C00DEE18 & C002CE68.
>
> - k
> Dear Kumar:
>
> We have two systems. One based on an 8241, and one based on an  
> 8541. The 8241 has been running for some time with Linux 2.4 and  
> the 8541 is coming up. Both are using the 2.6.17.11 kernel from  
> kernel.org with modifications for our hardware.
>
> In the case of the 8241, I started out with the 2.4 modifications,  
> which were originally based on the 8260 and ported them to 2.6. In  
> the case of the 8541, I started out with the embedded planet 8555EP  
> 2.6 kernel source and added that to the 2.6.
>
> I dont see this exception in the 8541, although extensive testing  
> has not yet been completed. The 8241 exhibits this exception on  
> three different 8241 boards, so I dont suspect the hardware.
>
> We are using the Montavista toolchain and their root filesystem  
> including 'tar' and 'cp' which are the programs that currently  
> exhibit the fault.
>
> Yesterday, when I saw an NIP at 0x900, I was ready to jump on the  
> interrupts not being setup correctly, but after a few hours of  
> going through that, I am now convinced the interrupts are setup  
> correctly, so it is something more subtle.
>
> Certainly, memory corruption is the next thing to be concerned with.
>
> One thing that has concerned me a bit is that we have no swap space  
> available at all. This is an embedded system with 64MByte of RAM  
> and JFFS2 NAND flash with no swap partitions.
>
> I suspect auditing the MMU setup differences between the original  
> 2.4 kernel and the new 2.6 kernel for the 8241 board is the next step.
>
> The three exceptions I saw yesterday were 1)0x900 in the  
> timer_interrupt, 2) C00DEE18 (inside the tar program) and 3)  
> C002CE68 (in one of the kernel routines).

#2 is inside the kernel as well.  Look at the System.map or objdump - 
d vmlinux to see what exactly is at those instructions.

> I suspect the actual addresses are red-herrings and this exception  
> can occur at any address. This certainly would tend to indicate  
> some sort of memory setup issue.

I think it's useful to know if the instructions at the two offsets  
C00DEE18 & C002CE68 are similar in some way before jumping to that  
conclusion.

> Changing the Oops logic to printout the NextInstruction as well as  
> the NIP might be helpful so I could discern the difference between  
> what the program is trying to do and what it is really doing.
>
> Are there any other thoughts you might have on diagnosis techniques  
> at this point?

Try turning on KALLSYMS, this should provide more info on the oops as  
well.

- k