[PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes

Fri Aug 14 17:13:09 AEST 2015

On Thu, Aug 13, 2015 at 01:44:43PM -0500, Scott Wood wrote:
> On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote:
> > It makes no sense to put the instructions for calculating the lock
> > value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx
> > loop. And when the lock is acquired by the other thread, the current
> > lock value has no chance to equal with the lock value used by current
> > cpu. So we can skip the comparing for these two lock values in the
> > lbz/bne loop.
> > 
> > Signed-off-by: Kevin Hao <haokexin at gmail.com>
> > ---
> >  arch/powerpc/mm/tlb_low_64e.S | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
> > index 765b419883f2..e4185581c5a7 100644
> > --- a/arch/powerpc/mm/tlb_low_64e.S
> > +++ b/arch/powerpc/mm/tlb_low_64e.S
> > @@ -308,11 +308,11 @@ BEGIN_FTR_SECTION               /* CPU_FTR_SMT */
> >        *
> >        * MAS6:IND should be already set based on MAS4
> >        */
> > -1:   lbarx   r15,0,r11
> >       lhz     r10,PACAPACAINDEX(r13)
> > -     cmpdi   r15,0
> > -     cmpdi   cr1,r15,1       /* set cr1.eq = 0 for non-recursive */
> >       addi    r10,r10,1
> > +     crclr   cr1*4+eq        /* set cr1.eq = 0 for non-recursive */
> > +1:   lbarx   r15,0,r11
> > +     cmpdi   r15,0
> >       bne     2f
> 
> You're optimizing the contended case at the expense of introducing stalls in 
> the uncontended case.

Before the patch, the uncontended case code sequence are:
1:	lbarx	r15,0,r11
	lhz	r10,PACAPACAINDEX(r13)
	cmpdi	r15,0
	cmpdi	cr1,r15,1	/* set cr1.eq = 0 for non-recursive */
	addi	r10,r10,1
	bne	2f
	stbcx.	r10,0,r11
	bne	1b

After the patch:
	lhz	r10,PACAPACAINDEX(r13)
	addi	r10,r10,1
	crclr	cr1*4+eq	/* set cr1.eq = 0 for non-recursive */
1:	lbarx	r15,0,r11
	cmpdi	r15,0
	bne	2f
	stbcx.	r10,0,r11
	bne	1b

As you know, the lbarx is a Presync instruction and stbcx is a Presync and
Postsync instruction. Putting the unnecessary instructions in the lbarx/stbcx
loop also serialize these instructions execution. The execution latency of
lbarx is only 3 cycles and there are still two instructions after it.
Considering the out of order execution optimization after this patch, do you
really think that new uncontended path will become slower due to this
additional stall?

>  Does it really matter if there are more instructions 
> in the loop?

I really think we should minimize the window of lbarx/stbcx, for following two
reasons:
  - The bigger of this window, the more possible conflicts between the two
     threads run into this loop simultaneously.
  - The reservation used by lbarx may be cleared by another thread due to
     store to the same reservation granule. The smaller the window of
     lbarx/stbcx, the less possibility to be affected by this.

>  This change just means that you'll spin in the loop for more 
> iterations (if it even does that -- I think the cycles per loop iteration 
> might be the same before and after, due to load latency and pairing) while 
> waiting for the other thread to release the lock.

Besides the optimization for the contended case, it also make the code more
readable with these changes:
  - It always seem a bit weird to calculate the lock value for the current
    cpu in the lbarx/stbcx loop.
  - The "cmpdi   cr1,r15,1" seems pretty confusion. It doesn't always do what
    the comment said (set cr1.eq = 0). In some cases, it does set the
    crq.eq = 1, such as when running on cpu 1 with lock is acquired by cpu0.
    All we need to do just clear the cr1.eq unconditionally.

> 
> Do you have any benchmark results for this patch?

I doubt it will get any visible difference. Anyway I will gave it a try.

Thanks,
Kevin
> 
> -Scott
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20150814/d1ce2bc2/attachment.sig>