[RFC PATCH] fsldma: Add DMA_SLAVE support

Ira Snyder iws at ovro.caltech.edu
Fri Jun 19 06:50:24 EST 2009


On Thu, Jun 18, 2009 at 11:16:03AM -0700, Dan Williams wrote:
> Ira Snyder wrote:
>> I can do something similar by calling device_prep_dma_memcpy() lots of
>> times. Once for each page in the scatterlist.
>>
>> This has the advantage of not changing the DMAEngine API.
>>
>> This has the disadvantage of not being a single transaction. The DMA
>> controller will get an interrupt after each memcpy() transaction, clean
>> up descriptors, etc.
>>
>
> Why would it need an interrupt per memcpy?  There is a  
> DMA_PREP_INTERRUPT flag to gate interrupt generation to the last  
> descriptor.  This is how async_tx delays callbacks until the last  
> operation in a chain has completed.  Also, I am not currently seeing how  
> your implementation achieves a single hardware transaction.  It still  
> calls fsl_dma_alloc_descriptor() per address pair?
>

Ok, there are a few things here:

1) an interrupt per memcpy

The *software* using DMAEngine doesn't need one interrupt per memcpy.
That is controlled by the DMA_PREP_INTERRUPT flag, exactly as you
describe.

The Freescale DMA driver DOES use one interrupt per async_tx descriptor.
See drivers/dma/fsldma.c dma_init() and append_ld_queue().

The FSL_DMA_MR_EOTIE flag in dma_init() tells the controller to generate
an interrupt at the end of each transfer. A transfer is (potentially
long) list of address pairs / hardware descriptors.

The FSL_DMA_MR_EOSIE flag in append_ld_queue() tells the controller to
generate an interrupt at the end of transferring this hardware
descriptor (AKA segment). The driver combines multiple memcpy operations
into a single transfer. When the driver combines operations, it adds the
EOSIE flag to the descriptor that would-have-been the end of a transfer.
It uses this interrupt to update the DMA cookie, presumably to speed up
users of dma_sync_wait() when there are lots of users sharing the DMA
controller.

Let me try to explain what will happen with some code:

=== Case 1: Two seperate transfers ===

dma_cookie_t t1, t2;
t1 = dma_async_memcpy_buf_to_buf(...);
dma_async_memcpy_issue_pending();

/*
 * some time goes by, the DMA transfer completes,
 * and the controller gets the end-of-transfer interrupt
 */

t2 = dma_async_memcpy_buf_to_buf(...);
dma_async_memcpy_issue_pending();

/*
 * some time goes by, the DMA transfer completes,
 * and the controller gets the end-of-transfer interrupt
 */

This is exactly what I would expect from the API. There are two seperate
transfers, and there are two hardware interrupts. Here is a crude
timeline.

----|----------|----------------------------|----------|-------
    |          |                            |          |
    t1 start   t1 end, EOT interrupt        t2 start   t2 end, EOT interrupt

=== Case 2: Two seperate transfers, merged by the driver ===

t1 = dma_async_memcpy_buf_to_buf(...);
dma_async_memcpy_issue_pending();

/*
 * the controller starts executing the transfer, but has not
 * finished yet
 */
t2 = dma_async_memcpy_buf_to_buf(...);

/*
 * append_ld_queue() sets the EOSIE flag on the last hardware
 * descriptor in t1, then sets the next link descriptor to the
 * first descriptor in t2
 */

Now there are two transfers, and again two hardware interrupts. Again,
not really a problem. Every part of the API still works as expected.

----|-----------|---------------|---------------------------------|--------
    |           |               |                                 |
    t1 start    t2 tx_submit()  t1 end, EOS interrupt, t2 start   t2 end, EOT interrupt

=== Case 3: Two transfers, merged by the driver ===

t1 = dma_async_memcpy_buf_to_buf(...);
t2 = dma_async_memcpy_buf_to_buf(...);
dma_async_memcpy_issue_pending();

After a very careful reading of the driver, I just noticed that if there
is no transfer in progress (as would be expected on a private channel)
then the EOS interrupt never gets set, and the requests are simply
merged together. This would lead to a timeline like this:

----|----------------|----------------------|--------------------------
    |                |                      |
    t1 start         t1 end, t2 start       t2 end, EOT interrupt

This is perfectly fine as well. It is the behavior I want.

Some more study of the code shows that the Freescale DMA driver will not
halt the channel as long as the channel is busy, meaning that it will
not clear the external start bits during a transfer.

So, in conclusion, I can call memcpy multiple times and have it all
merged into a single transfer by the driver, with a single hardware
interrupt (at the end-of-transfer) and have everything complete without
halting the DMA controller.

2) Single transaction

I think we're using different terms here. I define a single transaction
to be the time while the DMA controller is busy transferring things.

In case #1 above, there are two transfers. In case #2 above, one
transfer, and two interrupts. In case #3 above, one transfer, one
interrupt.

3) Hardware descriptor per address pair

Yes, there can be many hardware descriptors per address pair. And
therefore many hardware descriptors per transaction, with my definition.

> The api currently allows a call to ->prep_* to generate multiple  
> internal descriptors for a single input address (i.e. memcpy in the case  
> where the transfer length exceeds the hardware maximum).  The slave api  
> allows for transfers from a scatterlist to a slave context.  I think  
> what is bothering me is that the fsldma slave implementation is passing  
> a list of sideband addresses rather than a constant address context like  
> the existing dw_dmac, so it is different.  However, I can now see that  
> trying to enforce uniformity in this area is counterproductive.  The  
> DMA_SLAVE interface will always have irreconcilable differences across  
> architectures.
>

Yep, we're in agreement here.

In another driver, I used this DMA_SLAVE API to transfer from hardware
addresses to a scatterlist. I have ~50 blocks, all at different
non-contiguous addresses that I want transferred into a single
scatterlist.

It was awfully convenient to have this happen in a single call, rather
than 50 seperate calls to dma_async_memcpy_buf_to_buf().

>> It also has the disadvantage of not being able to change the
>> controller-specific features I'm using: external start. This feature
>> lets the "3rd device" control the DMA controller. It looks like the
>> atmel-mci driver has a similar feature. The DMAEngine API has no way to
>> expose this type of feature except through DMA_SLAVE.
>
> Yeah, an example of an architecture specific quirk that has no chance of  
> being uniformly handled in a generic api.
>

Again, we're in agreement.

>> If it is just the 3 helper routines that are the issue with this patch,
>> I have no problem dropping them and letting each user re-create them
>> themselves.
>
> I think this makes the most sense at this point.  We can reserve  
> judgement on the approach until the next fsldma-slave user arrives and  
> tries to use this implementation for their device.  Can we move the  
> header file under arch/powerpc/include?
>

Sure, that would be fine.

> [..]
>> A single-transaction scatterlist-to-scatterlist copy would be nice.
>>
>> One of the major advantages of working with the DMA controller is that
>> it automatically handles scatter/gather. Almost all DMA controllers have
>> the feature, but it is totally missing from the high-level API.
>
> The engines I am familiar with (ioatdma and iop-adma) still need a  
> hardware descriptor per address pair I do not see how fsldma does this  
> any more automatically than those engines?  I could see having a helper  
> routine that does the mapping and iterating, but in the end the driver  
> still sees multiple calls to ->prep (unless of course you are doing this  
> in a DMA_SLAVE context, then anything goes).
>

Yep. The Freescale DMA controller behaves exactly as you describe the
ioatdma and iop-dma engines. A helper routine that does the mapping and
iterating would be nice, in my opinion. It took me a while to convince
myself that the nested loops in device_prep_slave_sg() were correct.

Following on the earlier observations in this email, it would be
possible to emulate the slave behavior using just
dma_async_memcpy_buf_to_buf(). With the exception of the external start
bit, of course.

Now, the only thing I would use device_prep_slave_sg() for would be to
set the external start bit. No actual data transfer needs to happen, no
descriptors need to be allocated. Just the "Enable extra controller
features" block from my patch. This seems like an abuse of the DMA_SLAVE
API. What would be returned in that case?  An async_tx descriptor is
wrong, because the controller won't be able to free() it...

So, I see a couple of ways of moving forward:
1) Keep my implementation, moving the includes to arch/powerpc/include.
   Do we drop the allocation functions?

2) Drop the data transfer part of device_prep_slave_sg(), and just use
   it for setting device-specific bits.

Thanks for all the input Dan. I finally feel like we're getting
somewhere :)

Ira


More information about the Linuxppc-dev mailing list