[PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

Shi Xuelin-B29237 B29237 at freescale.com
Tue Nov 29 14:41:11 EST 2011


Hi Ira, 

see my comments below.

Thanks,
Forrest

-----Original Message-----
From: Ira W. Snyder [mailto:iws at ovro.caltech.edu] 
Sent: 2011年11月29日 0:38
To: Shi Xuelin-B29237
Cc: vinod.koul at intel.com; dan.j.williams at intel.com; linuxppc-dev at lists.ozlabs.org; Li Yang-R58472; linux-kernel at vger.kernel.org
Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote:
> Hi Ira,
> 
> Thanks for your review.
> 
> After second thought, I think your scenario may not occur.
> Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in practice. 
> We never query a cookie not returned by fsl_dma_tx_submit(...).
> 

I agree about this part.

> When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as 20 and cpu2 could not read as 19.
> 

This is what I don't agree about. However, I'm not an expert on CPU cache vs. memory accesses in an multi-processor system. The section titled "CACHE COHERENCY" in Documentation/memory-barriers.txt leads me to believe that the scenario I described is possible.

 
[Shi Xuelin-B29237] If query is used without rules, your scenario may happen. But in DMA usage here, the query is used something like sequentially. Only after chan->common.cookie is updated in fsl_dma_tx_submit(...) and returned, then it could be queried. So you scenario will not happen.

What happens if CPU1's write of chan->common.cookie only goes into CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19.
 
[Shi Xuelin-B29237] As you see, chan->common.cookie is updated in fsl_dma_tx_submit(...) and enclosed by spinlock. The spinlock implementation in PPC will guarantee the cache coherence among cores, something like you called implicit smp_mb.

I don't think you should see any performance impact from the smp_mb() operation.

[Shi Xuelin-B29237] Only add smp_mb() doesn't work. It only sync on this step, but next step is the same as just getting into this function without smp_mb operation.

Thanks,
Ira

> -----Original Message-----
> From: Ira W. Snyder [mailto:iws at ovro.caltech.edu]
> Sent: 2011年11月23日 2:59
> To: Shi Xuelin-B29237
> Cc: dan.j.williams at intel.com; Li Yang-R58472; zw at zh-kernel.org; 
> vinod.koul at intel.com; linuxppc-dev at lists.ozlabs.org; 
> linux-kernel at vger.kernel.org
> Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.
> 
> On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237 at freescale.com wrote:
> > From: Forrest Shi <b29237 at freescale.com>
> > 
> >     dma status check function fsl_tx_status is heavily called in
> >     a tight loop and the desc lock in fsl_tx_status contended by
> >     the dma status update function. this caused the dma performance
> >     degrades much.
> > 
> >     this patch releases the lock in the fsl_tx_status function.
> >     I believe it has no neglect impact on the following call of
> >     dma_async_is_complete(...).
> > 
> >     we can see below three conditions will be identified as success
> >     a)  x < complete < use
> >     b)  x < complete+N < use+N
> >     c)  x < complete < use+N
> >     here complete is the completed_cookie, use is the last_used
> >     cookie, x is the querying cookie, N is MAX cookie
> > 
> >     when chan->completed_cookie is being read, the last_used may
> >     be incresed. Anyway it has no neglect impact on the dma status
> >     decision.
> > 
> >     Signed-off-by: Forrest Shi <xuelin.shi at freescale.com>
> > ---
> >  drivers/dma/fsldma.c |    5 -----
> >  1 files changed, 0 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 
> > 8a78154..1dca56f 100644
> > --- a/drivers/dma/fsldma.c
> > +++ b/drivers/dma/fsldma.c
> > @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan,
> >  	struct fsldma_chan *chan = to_fsl_chan(dchan);
> >  	dma_cookie_t last_complete;
> >  	dma_cookie_t last_used;
> > -	unsigned long flags;
> > -
> > -	spin_lock_irqsave(&chan->desc_lock, flags);
> >  
> 
> This will cause a bug. See below for a detailed explanation. You need this instead:
> 
> 	/*
> 	 * On an SMP system, we must ensure that this CPU has seen the
> 	 * memory accesses performed by another CPU under the
> 	 * chan->desc_lock spinlock.
> 	 */
> 	smp_mb();
> >  	last_complete = chan->completed_cookie;
> >  	last_used = dchan->cookie;
> >  
> > -	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > -
> >  	dma_set_tx_state(txstate, last_complete, last_used, 0);
> >  	return dma_async_is_complete(cookie, last_complete, last_used);  }
> 
> Facts:
> - dchan->cookie is the same member as chan->common.cookie (same memory 
> location)
> - chan->common.cookie is the "last allocated cookie for a pending transaction"
> - chan->completed_cookie is the "last completed transaction"
> 
> I have replaced "dchan->cookie" with "chan->common.cookie" in the below explanation, to keep everything referenced from the same structure.
> 
> Variable usage before your change. Everything is used locked.
> - RW chan->common.cookie		(fsl_dma_tx_submit)
> - R  chan->common.cookie		(fsl_tx_status)
> - R  chan->completed_cookie		(fsl_tx_status)
> - W  chan->completed_cookie		(dma_do_tasklet)
> 
> Variable usage after your change:
> - RW chan->common.cookie		LOCKED
> - R  chan->common.cookie		NO LOCK
> - R  chan->completed_cookie		NO LOCK
> - W  chan->completed_cookie             LOCKED
> 
> What if we assume that you have a 2 CPU system (such as a P2020). After your changes, one possible sequence is:
> 
> === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === 
> spin_lock_irqsave
> descriptor->cookie = 20		(x in your example)
> chan->common.cookie = 20	(used in your example)
> spin_unlock_irqrestore
> 
> === CPU2 - immediately calls fsl_tx_status() ===
> chan->common.cookie == 19
> chan->completed_cookie == 19
> descriptor->cookie == 20
> 
> Since we don't have locks anymore, CPU2 may not have seen the write to
> chan->common.cookie yet.
> 
> Also assume that the DMA hardware has not started processing the 
> transaction yet. Therefore dma_do_tasklet() has not been called, and
> chan->completed_cookie has not been updated.
> 
> In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even though the DMA operation has not succeeded. The DMA operation has not even started yet!
> 
> The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations that happened before CPU1 released the spinlock. Spinlocks are implicit SMP memory barriers.
> 
> Therefore, the above example becomes:
> smp_mb();
> chan->common.cookie == 20
> chan->completed_cookie == 19
> descriptor->cookie == 20
> 
> Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct.
> 
> Thanks,
> Ira
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev



More information about the Linuxppc-dev mailing list