READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

Peter Zijlstra peterz at infradead.org
Thu Dec 12 19:01:05 AEDT 2019


On Thu, Dec 12, 2019 at 04:42:13PM +1100, Michael Ellerman wrote:
> [ trimmed CC a bit ]
> 
> Peter Zijlstra <peterz at infradead.org> writes:
> > On Fri, Dec 06, 2019 at 11:46:11PM +1100, Michael Ellerman wrote:
> ...
> > you write:
> >
> >   "Currently bitops-instrumented.h assumes that the architecture provides
> > atomic, non-atomic and locking bitops (e.g. both set_bit and __set_bit).
> > This is true on x86 and s390, but is not always true: there is a
> > generic bitops/non-atomic.h header that provides generic non-atomic
> > operations, and also a generic bitops/lock.h for locking operations."
> >
> > Is there any actual benefit for PPC to using their own atomic bitops
> > over bitops/lock.h ? I'm thinking that the generic code is fairly
> > optimal for most LL/SC architectures.
> 
> Yes and no :)
> 
> Some of the generic versions don't generate good code compared to our
> versions, but that's because READ_ONCE() is triggering stack protector
> to be enabled.

Bah, there's never anything simple, is there :/

> For example, comparing an out-of-line copy of the generic and ppc
> versions of test_and_set_bit_lock():
> 
>    1 <generic_test_and_set_bit_lock>:           1 <ppc_test_and_set_bit_lock>:
>    2         addis   r2,r12,361
>    3         addi    r2,r2,-4240
>    4         stdu    r1,-48(r1)
>    5         rlwinm  r8,r3,29,3,28
>    6         clrlwi  r10,r3,26                   2         rldicl  r10,r3,58,6
>    7         ld      r9,3320(r13)
>    8         std     r9,40(r1)
>    9         li      r9,0
>   10         li      r9,1                        3         li      r9,1
>                                                  4         clrlwi  r3,r3,26
>                                                  5         rldicr  r10,r10,3,60
>   11         sld     r9,r9,r10                   6         sld     r3,r9,r3
>   12         add     r10,r4,r8                   7         add     r4,r4,r10
>   13         ldx     r8,r4,r8
>   14         and.    r8,r9,r8
>   15         bne     34f
>   16         ldarx   r7,0,r10                    8         ldarx   r9,0,r4,1
>   17         or      r8,r9,r7                    9         or      r10,r9,r3
>   18         stdcx.  r8,0,r10                   10         stdcx.  r10,0,r4
>   19         bne-    16b                        11         bne-    8b
>   20         isync                              12         isync
>   21         and     r9,r7,r9                   13         and     r3,r3,r9
>   22         addic   r7,r9,-1                   14         addic   r9,r3,-1
>   23         subfe   r7,r7,r9                   15         subfe   r3,r9,r3
>   24         ld      r9,40(r1)
>   25         ld      r10,3320(r13)
>   26         xor.    r9,r9,r10
>   27         li      r10,0
>   28         mr      r3,r7
>   29         bne     36f
>   30         addi    r1,r1,48
>   31         blr                                16         blr
>   32         nop
>   33         nop
>   34         li      r7,1
>   35         b       24b
>   36         mflr    r0
>   37         std     r0,64(r1)
>   38         bl      <__stack_chk_fail+0x8>
> 
> 
> If you squint, the generated code for the actual logic is pretty similar, but
> the stack protector gunk makes a big mess. It's particularly bad here
> because the ppc version doesn't even need a stack frame.
> 
> I've also confirmed that even when test_and_set_bit_lock() is inlined
> into an actual call site the stack protector logic still triggers.

> If I change the READ_ONCE() in test_and_set_bit_lock():
> 
> 	if (READ_ONCE(*p) & mask)
> 		return 1;
> 
> to a regular pointer access:
> 
> 	if (*p & mask)
> 		return 1;
> 
> Then the generated code looks more or less the same, except for the extra early
> return in the generic version of test_and_set_bit_lock(), and different handling
> of the return code by the compiler.

So given that the function signature is:

static inline int test_and_set_bit_lock(unsigned int nr,
					volatile unsigned long *p)

@p already carries the required volatile qualifier, so READ_ONCE() does
not add anything here (except for easier to read code and poor code
generation).

So your proposed change _should_ be fine. Will, I'm assuming you never
saw this on your ARGH64 builds when you did this code ?

---
diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h
index dd90c9792909..10264e8808f8 100644
--- a/include/asm-generic/bitops/atomic.h
+++ b/include/asm-generic/bitops/atomic.h
@@ -35,7 +35,7 @@ static inline int test_and_set_bit(unsigned int nr, volatile unsigned long *p)
 	unsigned long mask = BIT_MASK(nr);
 
 	p += BIT_WORD(nr);
-	if (READ_ONCE(*p) & mask)
+	if (*p & mask)
 		return 1;
 
 	old = atomic_long_fetch_or(mask, (atomic_long_t *)p);
@@ -48,7 +48,7 @@ static inline int test_and_clear_bit(unsigned int nr, volatile unsigned long *p)
 	unsigned long mask = BIT_MASK(nr);
 
 	p += BIT_WORD(nr);
-	if (!(READ_ONCE(*p) & mask))
+	if (!(*p & mask))
 		return 0;
 
 	old = atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h
index 3ae021368f48..9baf0a0055b8 100644
--- a/include/asm-generic/bitops/lock.h
+++ b/include/asm-generic/bitops/lock.h
@@ -22,7 +22,7 @@ static inline int test_and_set_bit_lock(unsigned int nr,
 	unsigned long mask = BIT_MASK(nr);
 
 	p += BIT_WORD(nr);
-	if (READ_ONCE(*p) & mask)
+	if (*p & mask)
 		return 1;
 
 	old = atomic_long_fetch_or_acquire(mask, (atomic_long_t *)p);
@@ -60,7 +60,7 @@ static inline void __clear_bit_unlock(unsigned int nr,
 	unsigned long old;
 
 	p += BIT_WORD(nr);
-	old = READ_ONCE(*p);
+	old = *p;
 	old &= ~BIT_MASK(nr);
 	atomic_long_set_release((atomic_long_t *)p, old);
 }


More information about the Linuxppc-dev mailing list