Parsing a bus fault message?

Wed Sep 29 18:49:39 EST 2010

Scott Wood wrote:
> On Tue, 28 Sep 2010 08:31:54 -0700
> "Ira W. Snyder" <iws at ovro.caltech.edu> wrote:
> 
>> On Tue, Sep 28, 2010 at 09:26:51AM -0500, david.hagood at gmail.com wrote:
>>> Alternatively, can somebody see a hint in the message that I don't know
>>> enough to pick out? At this point, my code is trying to memcpy() from the
>>> PCIe bus (mapped via the outbound ATMU) to local memory, so the fault is
>>> either a) the ATMU is not accessible b) the ATMU is accessible but not
>>> mapped (which I would have thought the ioremap call I made would have
>>> handled) or c) the chip is not able to bus master on the PCI bus.
> 
> Check the LAWs, the outbound ATMU, and the PCI device's BAR.  Make sure

I also meet machine check exception if configure LAW improperly for PCI. (i.e.
unmatched PCIe controller id.)

>From you log looks 0xexxxxxxx should be your PCI space. So you can check if that
 fall into appropriate LAW configuration. Maybe you can post your boot log and
error log here.

> the address goes where you're expecting at each level.
> 
>>> Machine check in kernel mode.
>>> Caused by (from SRR1=149030): Transfer error ack signal
>> ^^^ this is the line that contains some critical info
>>
>> In the 86xx CPU manual, you should be able to find information about the
>> SRR1 register. Decoding the hex SRR1=0x149030 may help.

Actually 'Transfer error ack signal' is the result just after kernel decode
SRR1/MSSSR0.

>>
>> The kernel is telling you this is a TEA (transfer error acknowledge)
>> error. I've only seen this when I get an unhandled timeout on the local
>> bus. For example, a FPGA that has died in the middle of a request.
> 

I met this only one time when kernel access USB host controller REGs on one
mpc837x. But the same kernel is fine on another same version target. So I think
sometimes you have to check the hardware.

> I've seen it when you access a physical address that has no device
> backing it up.
>

Yes. This should be most common reason for machine check exception when we
access one address with cache inhibited.

>> On the PCI bus, I haven't seen this error. The 83xx PCI controller is
>> smart enough to return 0xffffffff when reading a non-existent device.
> 
> I believe that behavior is configurable.

I know 0xfffffffff will be returned by some PCI controller when PCI controller
access non-existent device. Because PCI controller can't get any response from
that non-existed device. So PCI controller think this 'read' should be aborted
by asserting bus to one known state, 0xffffffff. But I have to admit I really am
not sure if this is configured. I prefer to this behavior should be associated
to the given PCI controller fixed feature.

Tiejun

> 
> -Scott
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>