[musl] Powerpc Linux 'scv' system call ABI proposal take 2
npiggin at gmail.com
Wed Apr 22 16:18:36 AEST 2020
Excerpts from Rich Felker's message of April 21, 2020 3:27 am:
> On Mon, Apr 20, 2020 at 02:31:58PM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 20, 2020 2:09 pm:
>> > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
>> >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
>> >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
>> >> >> > Note that because lr is clobbered we need at least once normally
>> >> >> > call-clobbered register that's not syscall clobbered to save lr in.
>> >> >> > Otherwise stack frame setup is required to spill it.
>> >> >>
>> >> >> The kernel would like to use r9-r12 for itself. We could do with fewer
>> >> >> registers, but we have some delay establishing the stack (depends on a
>> >> >> load which depends on a mfspr), and entry code tends to be quite store
>> >> >> heavy whereas on the caller side you have r1 set up (modulo stack
>> >> >> updates), and the system call is a long delay during which time the
>> >> >> store queue has significant time to drain.
>> >> >>
>> >> >> My feeling is it would be better for kernel to have these scratch
>> >> >> registers.
>> >> >
>> >> > If your new kernel syscall mechanism requires the caller to make a
>> >> > whole stack frame it otherwise doesn't need and spill registers to it,
>> >> > it becomes a lot less attractive. Some of those 90 cycles saved are
>> >> > immediately lost on the userspace side, plus you either waste icache
>> >> > at the call point or require the syscall to go through a
>> >> > userspace-side helper function that performs the spill and restore.
>> >> You would be surprised how few cycles that takes on a high end CPU. Some
>> >> might be a couple of %. I am one for counting cycles mind you, I'm not
>> >> being flippant about it. If we can come up with something faster I'd be
>> >> up for it.
>> > If the cycle count is trivial then just do it on the kernel side.
>> The cycle count for user is, because you have r1 ready. Kernel does not
>> have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to
>> save into.
>> Which is also wasted work for a userspace.
>> Now that I think about it, no stack frame is even required! lr is saved
>> into the caller's stack when its clobbered with an asm, just as when
>> it's used for a function call.
> No. If there is a non-clobbered register, lr can be moved to the
> non-clobbered register rather than saved to the stack. However it
> looks like (1) gcc doesn't take advantage of that possibility, but (2)
> the caller already arranged for there to be space on the stack to save
> lr, so the cost is only one store and one load, not any stack
> adjustment or other frame setup. So it's probably not a really big
> deal. However, just adding "lr" clobber to existing syscall in musl
> increased the size of a simple syscall function (getuid) from 20 bytes
> to 36 bytes.
Yeah I had a bit of a play around with musl (which is very nice code I
must say). The powerpc64 syscall asm is missing ctr clobber by the way.
Fortunately adding it doesn't change code generation for me, but it
should be fixed. glibc had the same bug at one point I think (probably
due to syscall ABI documentation not existing -- something now lives in
Yes lr needs to be saved, I didn't see any new requirement for stack
frames, and it was often already saved, but it does hurt the small
I did look at entirely replacing sc with scv though, just as an
experiment. One day you might make sc optional! Text size impoves by
about 3kB with the proposed ABI.
Mostly seems to be the bns+ ; neg sequence. __syscall1/2/3 get
out-of-lined by the compiler in a lot of cases. Linux's bloat-o-meter
add/remove: 0/5 grow/shrink: 24/260 up/down: 220/-3428 (-3208)
Function old new delta
fcntl 400 424 +24
popen 600 620 +20
times 32 40 +8
alloc_rev 816 784 -32
alloc_fwd 812 780 -32
__syscall1.constprop 32 - -32
__fdopen 504 472 -32
__expand_heap 628 592 -36
__syscall2 40 - -40
__syscall3 44 - -44
fchmodat 372 324 -48
__wake.constprop 408 360 -48
child 1116 1064 -52
checker 220 156 -64
__bin_chunk 1576 1512 -64
malloc 1940 1860 -80
__syscall3.constprop 96 - -96
__syscall1 108 - -108
Total: Before=613379, After=610171, chg -0.52%
Now if we go a step further we could preserve r0,r4-r8. That gives the
kernel r9-r12 as scratch while leaving userspace with some spare
volatile GPRs except in the uncommon syscall6 case.
static inline long __syscall0(long n)
register long r0 __asm__("r0") = n;
register long r3 __asm__("r3");
__asm__ __volatile__("scv 0"
: "memory", "cr0", "cr1", "cr5", "cr6", "cr7", "lr", "ctr", "r9", "r10", "r11", "r12"
That saves another ~400 bytes, reducing some of the register shuffling
for futex loops etc:
__pthread_cond_timedwait 964 944 -20
__expand_heap 592 572 -20
socketpair 292 268 -24
__wake.constprop 360 336 -24
malloc 1860 1828 -32
__bin_chunk 1512 1472 -40
fcntl 424 376 -48
Total: Before=610171, After=609723, chg -0.07%
As you say, the compiler doesn't do a good job of saving lr in a spare
GPR unfortunately. Saving it ourselves to eliminate the lr clobber is no
good because it's almost always already saved. At least having
non-clobbered volatile GPRs could let a future smarter compiler take
If we go further and try to preserve r3 as well by putting the return
value in r9 or r0, we go backwards about 300 bytes. It's good for the
lock loops and complex functions, but hurts a lot of simpler functions
that have to add 'mr r3,r9' etc.
Most of the time there are saved non-volatile GPRs around anyway though,
so not sure which way to go on this. Text size savings can't be ignored
and it's pretty easy for the kernel to do (we already save r3-r8 and
zero them on exit, so we could load them instead from cache line that's
should be hot).
So I may be inclined to go this way, even if we won't see benefit now.
More information about the Linuxppc-dev