[musl] Powerpc Linux 'scv' system call ABI proposal take 2

Wed Apr 22 16:18:36 AEST 2020

Excerpts from Rich Felker's message of April 21, 2020 3:27 am:
> On Mon, Apr 20, 2020 at 02:31:58PM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 20, 2020 2:09 pm:
>> > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
>> >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
>> >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
>> >> >> > Note that because lr is clobbered we need at least once normally
>> >> >> > call-clobbered register that's not syscall clobbered to save lr in.
>> >> >> > Otherwise stack frame setup is required to spill it.
>> >> >> 
>> >> >> The kernel would like to use r9-r12 for itself. We could do with fewer 
>> >> >> registers, but we have some delay establishing the stack (depends on a
>> >> >> load which depends on a mfspr), and entry code tends to be quite store
>> >> >> heavy whereas on the caller side you have r1 set up (modulo stack 
>> >> >> updates), and the system call is a long delay during which time the 
>> >> >> store queue has significant time to drain.
>> >> >> 
>> >> >> My feeling is it would be better for kernel to have these scratch 
>> >> >> registers.
>> >> > 
>> >> > If your new kernel syscall mechanism requires the caller to make a
>> >> > whole stack frame it otherwise doesn't need and spill registers to it,
>> >> > it becomes a lot less attractive. Some of those 90 cycles saved are
>> >> > immediately lost on the userspace side, plus you either waste icache
>> >> > at the call point or require the syscall to go through a
>> >> > userspace-side helper function that performs the spill and restore.
>> >> 
>> >> You would be surprised how few cycles that takes on a high end CPU. Some 
>> >> might be a couple of %. I am one for counting cycles mind you, I'm not 
>> >> being flippant about it. If we can come up with something faster I'd be 
>> >> up for it.
>> > 
>> > If the cycle count is trivial then just do it on the kernel side.
>> 
>> The cycle count for user is, because you have r1 ready. Kernel does not 
>> have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to 
>> save into.
>> 
>> Which is also wasted work for a userspace.
>> 
>> Now that I think about it, no stack frame is even required! lr is saved 
>> into the caller's stack when its clobbered with an asm, just as when 
>> it's used for a function call.
> 
> No. If there is a non-clobbered register, lr can be moved to the
> non-clobbered register rather than saved to the stack. However it
> looks like (1) gcc doesn't take advantage of that possibility, but (2)
> the caller already arranged for there to be space on the stack to save
> lr, so the cost is only one store and one load, not any stack
> adjustment or other frame setup. So it's probably not a really big
> deal. However, just adding "lr" clobber to existing syscall in musl
> increased the size of a simple syscall function (getuid) from 20 bytes
> to 36 bytes.

Yeah I had a bit of a play around with musl (which is very nice code I
must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
Fortunately adding it doesn't change code generation for me, but it 
should be fixed. glibc had the same bug at one point I think (probably 
due to syscall ABI documentation not existing -- something now lives in 
linux/Documentation/powerpc/syscall64-abi.rst).

Yes lr needs to be saved, I didn't see any new requirement for stack
frames, and it was often already saved, but it does hurt the small 
wrapper functions.

I did look at entirely replacing sc with scv though, just as an 
experiment. One day you might make sc optional! Text size impoves by 
about 3kB with the proposed ABI.

Mostly seems to be the bns+ ; neg sequence. __syscall1/2/3 get
out-of-lined by the compiler in a lot of cases. Linux's bloat-o-meter 
says:

add/remove: 0/5 grow/shrink: 24/260 up/down: 220/-3428 (-3208)
Function                                     old     new   delta
fcntl                                        400     424     +24
popen                                        600     620     +20
times                                         32      40      +8
[...]
alloc_rev                                    816     784     -32
alloc_fwd                                    812     780     -32
__syscall1.constprop                          32       -     -32
__fdopen                                     504     472     -32
__expand_heap                                628     592     -36
__syscall2                                    40       -     -40
__syscall3                                    44       -     -44
fchmodat                                     372     324     -48
__wake.constprop                             408     360     -48
child                                       1116    1064     -52
checker                                      220     156     -64
__bin_chunk                                 1576    1512     -64
malloc                                      1940    1860     -80
__syscall3.constprop                          96       -     -96
__syscall1                                   108       -    -108
Total: Before=613379, After=610171, chg -0.52%

Now if we go a step further we could preserve r0,r4-r8. That gives the 
kernel r9-r12 as scratch while leaving userspace with some spare 
volatile GPRs except in the uncommon syscall6 case.

  static inline long __syscall0(long n)
  {
      register long r0 __asm__("r0") = n;
      register long r3 __asm__("r3");
      __asm__ __volatile__("scv 0"
      : "=r"(r3)
      : "r"(r0)
      : "memory", "cr0", "cr1", "cr5", "cr6", "cr7", "lr", "ctr", "r9", "r10", "r11", "r12"
      return r3;
  }

That saves another ~400 bytes, reducing some of the register shuffling 
for futex loops etc:

[...]
__pthread_cond_timedwait                     964     944     -20
__expand_heap                                592     572     -20
socketpair                                   292     268     -24
__wake.constprop                             360     336     -24
malloc                                      1860    1828     -32
__bin_chunk                                 1512    1472     -40
fcntl                                        424     376     -48
Total: Before=610171, After=609723, chg -0.07%

As you say, the compiler doesn't do a good job of saving lr in a spare 
GPR unfortunately. Saving it ourselves to eliminate the lr clobber is no 
good because it's almost always already saved. At least having 
non-clobbered volatile GPRs could let a future smarter compiler take 
advantage.

If we go further and try to preserve r3 as well by putting the return 
value in r9 or r0, we go backwards about 300 bytes. It's good for the 
lock loops and complex functions, but hurts a lot of simpler functions 
that have to add 'mr r3,r9' etc.  

Most of the time there are saved non-volatile GPRs around anyway though, 
so not sure which way to go on this. Text size savings can't be ignored
and it's pretty easy for the kernel to do (we already save r3-r8 and
zero them on exit, so we could load them instead from cache line that's
should be hot).

So I may be inclined to go this way, even if we won't see benefit now.

Thanks,
Nick