SATA hang on 8315E triggered by heavy flash write?

Wed May 22 07:44:03 EST 2013

On 05/15/2013 03:12:21 AM, Anthony Foiani wrote:
> At this point, /dev/sda is pretty much unusable, and I have to do at
> least a reboot to recover.  (I don't recall if I had to do a power
> cycle at this point, though.)
> 
> I suspect that it is related to errata eLBC-A001 (from MPC8315E Chip
> Errata, Rev. 3, 09/2011):
> 
>   eLBC-A001:
> 
>   Simultaneous FCM and GPCM or UPM operation may erroneously trigger
>   bus monitor timeout
> 
>   Description: Devices: MPC8315E, MPC8314E
>   When the FCM is in the middle of a long transaction, such as NAND
>   erase or write, another transaction on the GPCM or UPM triggers the
>   bus monitor to start immediately for the GPCM or UPM, even though
>   the GPCM or UPM is still waiting for the FCM to finish and has not
>   yet started its transaction. If the bus monitor timeout value is not
>   programmed for a sufficiently large value, the local bus monitor may
>   time out. This timeout corrupts the current NAND Flash operation and
>   terminate the GPCM or UPM operation.
> 
>   Impact: Local bus monitor may time out unexpectedly and corrupt the
>   NAND transaction.
> 
>   Workaround: Set the local bus monitor timeout value to the maximum
>   by setting LBCR[BMT] = 0 and LBCR[BMTPS] = 0xF.
> 
>   Fix plan: No plans to fix
> 
> But it seems that erratum is already fixed:
> 
>   http://patchwork.ozlabs.org/patch/96339/
>   (git patch d08e44570e)
> 
> Am I reading that correctly?

Yes, that erratum has been worked around.

> (I'm already writing only one flash
> sector at a time, but it might be that even a single 0x10000-byte
> sector takes long enough to trigger the issue.)

I don't think this erratum is relevant.  Unlike NAND, NOR flash does  
not involve holding the localbus for extended periods of time.  I also  
don't see how it would interact with SATA, which is separate from the  
localbus.  Are you seeing any errors on the localbus, or just on SATA?

> I also verified that
> I have the relevant property in my device tree:
> 
>   localbus at e0005000 {
>     ...
>     compatible = "fsl,mpc8315-elbc", "fsl,elbc", "simple-bus";
> 
> So, my questions are:
> 
> 1. Is anyone else seeing something like this?
> 
> 2. Is there an obvious way for our code to detect that we're in the
>    middle of error recovery, so we can not write to the disk until
>    recovery is complete?
> 
> 3. Is there any chance that the 1.5Gbps limiting code might have
>    exacerbated the problems?
> 
> 4. Should I open a support request with Freescale, or if someone from
>    Freescale is already reading this, could you look to see if anyone
>    else has reported it?

Hopefully Shaohui (our SATA person) can answer these.  If you don't get  
an answer, go ahead and open an official support request.

-Scott