OOPS on MPC8548 board when writing RAID5 array

hank peng pengxihan at gmail.com
Fri Nov 13 13:45:25 EST 2009


2009/11/13 Dan Williams <dan.j.williams at intel.com>:
> Hi Hank,
>
> Thanks for testing.
>
> On Tue, Nov 10, 2009 at 4:44 AM, hank peng <pengxihan at gmail.com> wrote:
>> CPU is MPC8548, kernel version is 2.6.31.5,CONFIG_FSL_DMA and
>> CONFIG_ASYNC_TX_DMA options are all enabled.
>> #mdadm -C /dev/md0 --assume-clean -l5 -n3 /dev/sd{a,b,c}
>> #dd if=/dev/zero of=/dev/md0 bs=1M count=1000
>> Oops: Exception in kernel mode, sig: 5 [#1]
>> MPC85xx CDS
>> Modules linked in:
>> NIP: c01c45d8 LR: c01c4d48 CTR: 00000000
>> REGS: c2dd5c80 TRAP: 0700   Not tainted  (2.6.31.5)
>> MSR: 00029000 <EE,ME,CE>  CR: 22004028  XER: 00000000
>> TASK = e820a580[3804] 'md0_raid5' THREAD: c2dd4000
>> GPR00: 00000001 c2dd5d30 e820a580 c2fb1088 00000001 00000000 00000002 00001000
>> GPR08: 00000001 c0485a20 00000000 ef8092f8 22002024 55555555 c2d67870 c0282d2c
>> GPR16: 00001000 e8355c00 c2eff964 00000000 00000000 00000019 01000040 c2dd5e00
>> GPR24: c2dd5dfc 00000001 c2dd5dc0 c099c420 00000000 c2d67838 00000002 c2dd5d58
>> NIP [c01c45d8] async_tx_quiesce+0x28/0x74
> [..]
>> I checked the kernel source code, and find that this OOPS was caused
>> by the following BUG_ON code:
>> It is in crypto/async_tx/async_tx.c:
>> void async_tx_quiesce(struct dma_async_tx_descriptor **tx)
>> {
>>        if (*tx) {
>>                /* if ack is already set then we cannot be sure
>>                 * we are referring to the correct operation
>>                 */
>>                BUG_ON(async_tx_test_ack(*tx));
>>   /* OOPS occured */
>
> Yes, this looks like a manifestation of the issue I brought up in my
> review of the driver [1].  The talitos_prep_dma_xor routine is always
> acknowledging its descriptors, which it should not because that is the
> responsibility of the client of the api.  When the raid code tries to
> attach a memcpy that depends on the xor it sees that it needs to
> switch to from talitos to fsldma (or software if fsldma is turned
> off).  Since talitos does not have the DMA_INTERRUPT capability to
> trigger the channel switch we need to perform a synchronous wait for
> the xor to complete before submitting the memcpy.  When the ack bit is
> not set the xor descriptor might be recycled by the dma device driver
> while we are waiting for it, hence the BUG_ON().
>
Thanks for reply, Dan.
Forgot to say, when this OOPS happened, I have not applied talitos XOR
patch. I only enabled async_xx api and FSL_DMA, so here
I think XOR was done by CPU and memcpy was done by DMA using async_xx api.
Another interseting thing I should say is that I have tried latest
stable kernel 2.6.31.6, this problem didn't exist. After I applied
talitos XOR patch, it was OK too. I checked the related souce codes
and it seems that there were no changes which make me feel very
confused.

I have been testing latest serials of kernels about XOR patch on
MPC8548 board and I hope Freescale guys also can give me help.

> --
> Dan
>
> See the final comment:
> [1]: http://marc.info/?l=linux-raid&m=125685641412112&w=2
>



-- 
The simplest is not all best but the best is surely the simplest!


More information about the Linuxppc-dev mailing list