kernel access of bad area, sig: 11 ( mpc852t)

Wed Apr 19 22:58:35 EST 2006

>>> Hi,
>>> Im having problem porting linux kernel 2.4.21 to our mpc852T custom
>>> board.The kernel
>>> panics randomly with sig 11.
>>> The board boots up fine and we also get to the prompt.When we open
3-4
>>> telnet sessions
>>> and try to run some command the kernel panics.This is completely
>>> random.Sometimes it
>>> even panics before opening the telnet session.
>>>

>>> <oops dump snipped>
>>>
>>You almost certainly have SDRAM problems.  If you have thoroughly
checked
>>out the
>>complete address range statically, remember that burst accesses will
not
>>occur until the
>>cache is turned on, so your problem may be with bursting.  But you can
also
>>have severe
>>problems like a missing address line and linux still run for a few
seconds.
>>
>>Mark Chambers

>We've checked the SDRAM. The timings (UPM) look fine. The problem
>however is that linux does not hang until after a few processes are
>started.
>If we boot to linux and leave it as it is, everything is fine and the
>board remains working. However each time a few processes (4-5 telnet
>sessions for eg.) are started the system either panics or hangs (goes
>dead).

>Thanks in advance,
>Akshay

We have been experiencing this same issue with random boards in
production. The exact same version of software will run for months on
other instances of the exact same board design, but a few percent get
'random' trap 300s. When they do occur, it's only after Linux has booted
and address translation and caching are turned on. Examining the oops-es
and memory shows that some location in SDRAM has a bogus value, but I
don't have the tools to trace back how it got that way.

I have ported a rigorous moving-inversions memory test into our
firmware, and have run it extensively across the entire SDRAM address
space (the test code executes from flash). I have let this test run
continuously for hours and hours, but never found a memory problem.
Unfortunately, I do not have test software that enables the MMU address
translation or caching, so as Mark said, I can't test memory using
bursting. Our hardware engineers have reviewed the designs very
carefully and are quite confident that there is plenty of margin in the
memory timing. Signal quality has also been carefully checked.

Our manufacturing people have replaced the CPU on some of these boards,
and the problem went away.

If anyone else on the mailing list has experienced this issue, or has
developed a virtual address memory test, please let us know.

Ken Poole

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://ozlabs.org/pipermail/linuxppc-embedded/attachments/20060419/8f7f5351/attachment.htm