powerpc Linux scv support and scv system call ABI proposal

Wed Jan 29 04:26:04 AEDT 2020

On 28/01/2020 11:05, Nicholas Piggin wrote:
> Florian Weimer's on January 28, 2020 11:09 pm:
>> * Nicholas Piggin:
>>
>>> * Proposal is for PPC_FEATURE2_SCV to indicate 'scv 0' support, all other
>>>   vectors will return -ENOSYS, and the decision for how to add support for
>>>   a new vector deferred until we see the next user.
>>
>> Seems reasonable.  We don't have to decide this today.
>>
>>> * Proposal is for scv 0 to provide the standard Linux system call ABI with some
>>>   differences:
>>>
>>> - LR is volatile across scv calls. This is necessary for support because the
>>>   scv instruction clobbers LR.
>>
>> I think we can express this in the glibc system call assembler wrapper
>> generators.  The mcount profiling wrappers already have this property.
>>
>> But I don't think we are so lucky for the inline system calls.  GCC
>> recognizes an "lr" clobber with inline asm (even though it is not
>> documented), but it generates rather strange assembler output as a
>> result:
>>
>> long
>> f (long x)
>> {
>>   long y;
>>   asm ("#" : "=r" (y) : "r" (x) : "lr");
>>   return y;
>> }
>>
>> 	.abiversion 2
>> 	.section	".text"
>> 	.align 2
>> 	.p2align 4,,15
>> 	.globl f
>> 	.type	f, @function
>> f:
>> .LFB0:
>> 	.cfi_startproc
>> 	mflr 0
>> 	.cfi_register 65, 0
>> #APP
>>  # 5 "t.c" 1
>> 	#
>>  # 0 "" 2
>> #NO_APP
>> 	std 0,16(1)
>> 	.cfi_offset 65, 16
>> 	ori 2,2,0
>> 	ld 0,16(1)
>> 	mtlr 0
>> 	.cfi_restore 65
>> 	blr
>> 	.long 0
>> 	.byte 0,0,0,1,0,0,0,0
>> 	.cfi_endproc
>> .LFE0:
>> 	.size	f,.-f
>>
>>
>> That's with GCC 8.3 at -O2.  I don't understand what the ori is about.
> 
> ori 2,2,0 is the group terminating nop hint for POWER8 type cores
> which had dispatch grouping rules.

It worth to note that it aims to mitigate a load-hit-store cpu stall
on some powerpc chips.

> 
>>
>> I don't think we can save LR in a regular register around the system
>> call, explicitly in the inline asm statement, because we still have to
>> generate proper unwinding information using CFI directives, something
>> that you cannot do from within the asm statement.
>>
>> Supporting this in GCC should not be impossible, but someone who
>> actually knows this stuff needs to look at it.
> 
> The generated assembler actually seems okay to me. If we compile
> something like a syscall and with -mcpu=power9:
> 
> long
> f (long _r3, long _r4, long _r5, long _r6, long _r7, long _r8, long _r0)
> {
>   register long r0 asm ("r0") = _r0;
>   register long r3 asm ("r3") = _r3;
>   register long r4 asm ("r4") = _r4;
>   register long r5 asm ("r5") = _r5;
>   register long r6 asm ("r6") = _r6;
>   register long r7 asm ("r7") = _r7;
>   register long r8 asm ("r8") = _r8;
> 
>   asm ("# scv" : "=r"(r3) : "r"(r0), "r"(r4), "r"(r5), "r"(r6), "r"(r7), "r"(r8) : "lr", "ctr", "cc", "xer");
> 
>   return r3;
> }
> 
> 
> f:
> .LFB0:
>         .cfi_startproc
>         mflr 0
>         std 0,16(1)
>         .cfi_offset 65, 16
>         mr 0,9
> #APP
>  # 12 "a.c" 1
>         # scv
>  # 0 "" 2
> #NO_APP
>         ld 0,16(1)
>         mtlr 0
>         .cfi_restore 65
>         blr
>         .long 0
>         .byte 0,0,0,1,0,0,0,0
>         .cfi_endproc
> 
> That gets the LR save/restore right when we're also using r0.
> 
>>
>>> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>>>   system call exit to avoid restoring the CR register.
>>
>> This sounds reasonable, but I don't know what kind of knock-on effects
>> this has.  The inline system call wrappers can handle this with minor
>> tweaks.
> 
> Okay, good. In the end we would have to check code trace through the
> kernel and libc of course, but I think there's little to no opportunity
> to take advantage of current extra non-volatile cr regs.
> 
> mtcr has to write 8 independently renamed registers so it's cracked into
> 2 insns on POWER9 (and likely to always be a bit troublesome). It's not
> much in the scheme of a system call, but while we can tweak the ABI...

We don't really need a mfcr/mfocr to implement the Linux syscall ABI on
powerpc, we can use a 'bns+' plus a neg instead as:

--
#define internal_syscall6(name, err, nr, arg1, arg2, arg3, arg4, arg5,  \
                          arg6)                                         \
  ({                                                                    \
    register long int r0  __asm__ ("r0") = (long int) (name);           \
    register long int r3  __asm__ ("r3") = (long int) (arg1);           \
    register long int r4  __asm__ ("r4") = (long int) (arg2);           \
    register long int r5  __asm__ ("r5") = (long int) (arg3);           \
    register long int r6  __asm__ ("r6") = (long int) (arg4);           \
    register long int r7  __asm__ ("r7") = (long int) (arg5);           \
    register long int r8  __asm__ ("r8") = (long int) (arg6);           \
    __asm__ __volatile__                                                \
      ("sc\n\t"                                                         \
       "bns+ 1f\n\t"                                                    \
       "neg %1, %1\n\t"                                                 \
       "1:\n\t"                                                         \
       : "+r" (r0), "+r" (r3), "+r" (r4), "+r" (r5), "+r" (r6),         \
         "+r" (r7), "+r" (r8)                                           \
       :                                                                \
       : "r9", "r10", "r11", "r12",                                     \
         "cr0", "memory");                                              \
    r3;                                                                 \
  })
--

And change INTERNAL_SYSCALL_ERROR_P to check for the expected invalid
range (((unsigned long) (val) >= (unsigned long) -4095)) and 
INTERNAL_SYSCALL_ERRNO to return a negative value (since the value will
be negated by INTERNAL_SYSCALL_ERROR_P).

The powerpc kernel ABI to use a different constraint to signal error
also requires glibc to reimplement the vDSO symbol call to be arch
specific instead a straight function call (since it might fallbacks
to a syscall).

Even for POWER-specific system call that uses all result bits, either
it should not fail or it would require a arch-specific implementation
to setup the expected error value (since the information would require
another source or a pre-defined value). 

In fact I think we make the assumption that INTERNAL_SYSCALL returns
a negative errno value in case or an error and make all the handling
to check for a syscall failure and errno setting generic. This will
required change ia64, mips, nios2, and sparc though.

> 
>>
>>> - Error handling: use of CR0[SO] to indicate error requires a mtcr / mtocr
>>>   instruction on the kernel side, and it is currently not implemented well
>>>   in glibc, requiring a mfcr (mfocr should be possible and asm goto support
>>>   would allow a better implementation). Is it worth continuing this style of
>>>   error handling? Or just move to -ve return means error? Using a different
>>>   bit would allow the kernel to piggy back the CR return code setting with
>>>   a test for the error case exit.
>>
>> GCC does not model the condition registers, so for inline system calls,
>> we have to produce a value anyway that the subsequence C code can check.
>> The assembler syscall wrappers do not need to do this, of course, but
>> I'm not sure which category of interfaces is more important.
> 
> Right. asm goto can improve this kind of pattern if it's inlined
> into the C code which tests the result, it can branch using the flags
> to the C error handling label, rather than move flags into GPR, test
> GPR, branch. However...
> 
>> But the kernel uses the -errno convention internally, so I think it
>> would make sense to pass this to userspace and not convert back and
>> forth.  This would align with what most of the architectures do, and
>> also avoids the GCC oddity.
> 
> Yes I would be interested in opinions for this option. It seems like
> matching other architectures is a good idea. Maybe there are some
> reasons not to.
> 
>>> - Should this be for 64-bit only? 'scv 1' could be reserved for 32-bit
>>>   calls if there was interest in developing an ABI for 32-bit programs.
>>>   Marginal benefit in avoiding compat syscall selection.
>>
>> We don't have an ELFv2 ABI for 32-bit.  I doubt it makes sense to
>> provide an ELFv1 port for this given that it's POWER9-specific.
> 
> Okay. There's no reason not to enable this for BE, at least for the
> kernel it's no additional work so it probably remains enabled (unless
> there is something really good we could do with the ABI if we exclude
> ELFv1 but I don't see anything).
> 
> But if glibc only builds for ELFv2 support that's probably reasonable.
> 
>>
>> From the glibc perspective, the major question is how we handle run-time
>> selection of the system call instruction sequence.  On i386, we use a
>> function pointer in the TCB to call an instruction sequence in the vDSO.
>> That's problematic from a security perspective.  I expect that on
>> POWER9, using a pointer in read-only memory would be equally
>> non-attractive due to a similar lack of PC-relative addressing.  We
>> could use the HWCAP bit in the TCB, but that would add another (easy to
>> predict) conditional branch to every system call.
> 
> I would have to defer to glibc devs on this. Conditional branch
> should be acceptable I think, scv improves speed as much as several
> mispredicted branches (about 90 cycles).
> 
>> I don't think it matters whether both system call variants use the same
>> error convention because we could have different error code extraction
>> code on the two branches.
> 
> That's one less difficulty.

We already had to push a similar hack where glibc used to abort transactions
prior syscalls to avoid some side-effects on kernel (commit 56cf2763819d2f).
It was eventually removed from syscall handling by f0458cf4f9ff3d870, where
we only enable TLE if kernel suppors PPC_FEATURE2_HTM_NOSC.

The transaction syscall abort used to read a variable directly from TCB,
so this could be an option. I would expect that we could optimize it where
if glibc is building against a recent kernel and compiler is building
for a ISA 3.0+ cpu we could remove the 'sc' code.