SATA hang on 8315E triggered by heavy flash write?

Xie Shaohui-B21989 B21989 at freescale.com
Wed May 22 16:15:14 EST 2013


Hi, Anthony Foiani,

Please confirm what is the key operation to reproduce the error.
1. only update NOR for a long enough time, for ex. tens of seconds, see if error happens;
2. only r/w SSD without NOR operation, see if error happens;
3. r/w SSD first and keep it run, then start to read NOR, if no error for a long time, then start to write NOR, see how long the error will happen.

Best Regards, 
Shaohui Xie


> -----Original Message-----
> From: Anthony Foiani [mailto:tkil at scrye.com]
> Sent: Wednesday, May 22, 2013 12:17 PM
> To: Wood Scott-B07421
> Cc: linuxppc-dev at lists.ozlabs.org; Xie Shaohui-B21989
> Subject: Re: SATA hang on 8315E triggered by heavy flash write?
> 
> 
> Scott --
> 
> Scott Wood <scottwood at freescale.com> writes:
> 
> > On 05/15/2013 03:12:21 AM, Anthony Foiani wrote:
> >> At this point, /dev/sda is pretty much unusable, and I have to do at
> >> least a reboot to recover.  (I don't recall if I had to do a power
> >> cycle at this point, though.)
> 
> For whatever it's worth, a hard boot (full power cycle) is indeed
> necessary at this point.
> 
> >> I suspect that it is related to errata eLBC-A001 (from MPC8315E Chip
> >> Errata, Rev. 3, 09/2011):
> >> ...
> >> But it seems that erratum is already fixed:
> >>
> >>   http://patchwork.ozlabs.org/patch/96339/
> >>   (git patch d08e44570e)
> >>
> >> Am I reading that correctly?
> >
> > Yes, that erratum has been worked around.
> 
> Ok, thanks for the confirmation.
> 
> >> (I'm already writing only one flash sector at a time, but it might be
> >> that even a single 0x10000-byte sector takes long enough to trigger
> >> the issue.)
> >
> > I don't think this erratum is relevant.  Unlike NAND, NOR flash does
> > not involve holding the localbus for extended periods of time.
> 
> I wasn't sure about the mechanism of the erratum, and it seemed awfully
> close, so I thought I'd go fishing.  Guess I missed.  :(
> 
> It is NOR writes, btw; I do both in my application, but the initial error
> always seems to occur during a NOR write.  (In this device, kernel +
> devtree go into NOR flash, ramdisk goes into NAND flash, and data goes to
> SSD... stop laughing.)
> 
> Here's the most recent hang.  First, to compare the application log
> timestamps with the kernel log timestamps:
> 
>   # mix of kernel and application log, note that kernel is about +12s.
>   +0.537506 main.0 [0]: rc: fork took 9.376ms
>   [   12.892323] PHY: mdio at e0024520:01 - Link is Up - 100/Full
>   +1.603034 main.0 [0]: schs: ctor: done
> 
> The console output is:
> 
>   # console log
>   [318334.294126] ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x0 action
> 0xe frozen
>   [318334.301515] ata2.00: PHY RDY changed
>   [318334.305301] ata2.00: failed command: WRITE DMA
>   [318334.309991] ata2.00: cmd ca/00:08:b0:00:18/00:00:00:00:00/e1 tag 0
> dma 4096 out
>   [318334.310015]          res 50/00:00:08:61:25/00:00:00:00:00/e1 Emask
> 0x10 (ATA bus error)
>   [318334.325689] ata2.00: status: { DRDY }
>   [318334.329717] ata2: hard resetting link
>   [318334.836038] ata2: Hardreset failed, not off-lined 0
>   [318334.848407] ata2: setting speed (in hard reset)
>   [318344.456050] ata2: No Signature Update
>   [318344.631916] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
>   [318344.638354] ata2.00: link online but device misclassified
>   [318349.643897] ata2.00: qc timeout (cmd 0xec)
>   [318349.648268] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
>   [318349.654562] ata2.00: revalidation failed (errno=-5)
>   [318349.659667] ata2: hard resetting link
>   [318350.163864] ata2: Hardreset failed, not off-lined 0
>   [318350.175869] ata2: setting speed (in hard reset)
>   [318359.771956] ata2: No Signature Update
>   [318359.947901] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
>   [318359.954342] ata2.00: link online but device misclassified
>   [318369.959921] ata2.00: qc timeout (cmd 0xec)
>   [318369.964279] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
>   [318369.970567] ata2.00: revalidation failed (errno=-5)
>   [318369.975658] ata2: hard resetting link
>   [318370.479933] ata2: Hardreset failed, not off-lined 0
>   [318370.491880] ata2: setting speed (in hard reset)
>   [318380.083892] ata2: No Signature Update
> 
> And my application log:
> 
>   # application log
>   +318320.957019 sw-upd.0 [29]: fm: nor0: write: writing 0x10000
> @0x180000 from buf[0x80000]; attempt 1/3
>   +318322.498346 sw-upd.0 [29]: fm: nor0: write: writing 0x10000
> @0x190000 from buf[0x90000]; attempt 1/3
>   +318323.849995 sw-upd.0 [29]: fm: nor0: write: writing 0x10000
> @0x1a0000 from buf[0xa0000]; attempt 1/3
>   +318325.262559 sw-upd.0 [29]: fm: nor0: write: writing 0x10000
> @0x1b0000 from buf[0xb0000]; attempt 1/3
>   +318326.703213 sw-upd.0 [29]: fm: nor0: write: writing 0x10000
> @0x1c0000 from buf[0xc0000]; attempt 1/3
> 
> > I also don't see how it would interact with SATA, which is separate
> > from the localbus.
> 
> No idea.  Is there some other shared resource that might be taxed by this
> type of load?
> 
> I do get a few other errors, usually just once or twice per boot:
> 
>   [ 4231.619368] NOHZ: local_softirq_pending 100
>   [ 4232.249935] NOHZ: local_softirq_pending 100
>   [ 4232.312241] NOHZ: local_softirq_pending 100
>   [ 4232.424523] NOHZ: local_softirq_pending 100
>   [ 4233.139146] NOHZ: local_softirq_pending 100
>   [ 4233.328540] NOHZ: local_softirq_pending 100
>   [ 4233.655909] NOHZ: local_softirq_pending 100
>   [ 4234.106578] NOHZ: local_softirq_pending 100
>   [ 4234.853966] NOHZ: local_softirq_pending 100
>   [ 4235.375208] NOHZ: local_softirq_pending 100
>   [11072.027818] hrtimer: interrupt took 126210 ns
> 
> They seem harmless, though, and (as the timestamps indicate) the machine
> happily ran for 3-4 days after those issues.
> 
> > Are you seeing any errors on the localbus, or just on SATA?
> 
> I'm not seeing any errors in the console log -- but I'm not using the LBC
> for anything other than flash writes, SFAIK.  (Unless I2C is handled
> through the LBC, in which case, I have frequent (~50-100/s) small
> transactions all the time -- but the hangs always coincide with flash
> writes, and not with the I2C traffic that is going on all the
> time...)
> 
> > Hopefully Shaohui (our SATA person) can answer these.  If you don't
> > get an answer, go ahead and open an official support request.
> 
> I have a (lousy) workaround in hand: don't touch the disk during flash
> updates.  (The flash writes are software updates, which will hopefully be
> fairly rare once I'm done developing this thing.  Until then, though, I'm
> updating it multiple times a day, and have hit this quite a few times by
> now.)
> 
> So there's no great hurry.  If Shaohui can find something in the next
> week or so, that'd be fantastic; otherwise, I'll open a request.
> 
> Thanks again!
> 
> Best regards,
> Anthony Foiani




More information about the Linuxppc-dev mailing list