SATA hang on 8315E triggered by heavy flash write?
Scott Wood
scottwood at freescale.com
Wed May 22 07:44:03 EST 2013
On 05/15/2013 03:12:21 AM, Anthony Foiani wrote:
> At this point, /dev/sda is pretty much unusable, and I have to do at
> least a reboot to recover. (I don't recall if I had to do a power
> cycle at this point, though.)
>
> I suspect that it is related to errata eLBC-A001 (from MPC8315E Chip
> Errata, Rev. 3, 09/2011):
>
> eLBC-A001:
>
> Simultaneous FCM and GPCM or UPM operation may erroneously trigger
> bus monitor timeout
>
> Description: Devices: MPC8315E, MPC8314E
> When the FCM is in the middle of a long transaction, such as NAND
> erase or write, another transaction on the GPCM or UPM triggers the
> bus monitor to start immediately for the GPCM or UPM, even though
> the GPCM or UPM is still waiting for the FCM to finish and has not
> yet started its transaction. If the bus monitor timeout value is not
> programmed for a sufficiently large value, the local bus monitor may
> time out. This timeout corrupts the current NAND Flash operation and
> terminate the GPCM or UPM operation.
>
> Impact: Local bus monitor may time out unexpectedly and corrupt the
> NAND transaction.
>
> Workaround: Set the local bus monitor timeout value to the maximum
> by setting LBCR[BMT] = 0 and LBCR[BMTPS] = 0xF.
>
> Fix plan: No plans to fix
>
> But it seems that erratum is already fixed:
>
> http://patchwork.ozlabs.org/patch/96339/
> (git patch d08e44570e)
>
> Am I reading that correctly?
Yes, that erratum has been worked around.
> (I'm already writing only one flash
> sector at a time, but it might be that even a single 0x10000-byte
> sector takes long enough to trigger the issue.)
I don't think this erratum is relevant. Unlike NAND, NOR flash does
not involve holding the localbus for extended periods of time. I also
don't see how it would interact with SATA, which is separate from the
localbus. Are you seeing any errors on the localbus, or just on SATA?
> I also verified that
> I have the relevant property in my device tree:
>
> localbus at e0005000 {
> ...
> compatible = "fsl,mpc8315-elbc", "fsl,elbc", "simple-bus";
>
> So, my questions are:
>
> 1. Is anyone else seeing something like this?
>
> 2. Is there an obvious way for our code to detect that we're in the
> middle of error recovery, so we can not write to the disk until
> recovery is complete?
>
> 3. Is there any chance that the 1.5Gbps limiting code might have
> exacerbated the problems?
>
> 4. Should I open a support request with Freescale, or if someone from
> Freescale is already reading this, could you look to see if anyone
> else has reported it?
Hopefully Shaohui (our SATA person) can answer these. If you don't get
an answer, go ahead and open an official support request.
-Scott
More information about the Linuxppc-dev
mailing list