[musl] Powerpc Linux 'scv' system call ABI proposal take 2

Nicholas Piggin npiggin at gmail.com
Mon Apr 20 14:31:58 AEST 2020


Excerpts from Rich Felker's message of April 20, 2020 2:09 pm:
> On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
>> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
>> >> > Note that because lr is clobbered we need at least once normally
>> >> > call-clobbered register that's not syscall clobbered to save lr in.
>> >> > Otherwise stack frame setup is required to spill it.
>> >> 
>> >> The kernel would like to use r9-r12 for itself. We could do with fewer 
>> >> registers, but we have some delay establishing the stack (depends on a
>> >> load which depends on a mfspr), and entry code tends to be quite store
>> >> heavy whereas on the caller side you have r1 set up (modulo stack 
>> >> updates), and the system call is a long delay during which time the 
>> >> store queue has significant time to drain.
>> >> 
>> >> My feeling is it would be better for kernel to have these scratch 
>> >> registers.
>> > 
>> > If your new kernel syscall mechanism requires the caller to make a
>> > whole stack frame it otherwise doesn't need and spill registers to it,
>> > it becomes a lot less attractive. Some of those 90 cycles saved are
>> > immediately lost on the userspace side, plus you either waste icache
>> > at the call point or require the syscall to go through a
>> > userspace-side helper function that performs the spill and restore.
>> 
>> You would be surprised how few cycles that takes on a high end CPU. Some 
>> might be a couple of %. I am one for counting cycles mind you, I'm not 
>> being flippant about it. If we can come up with something faster I'd be 
>> up for it.
> 
> If the cycle count is trivial then just do it on the kernel side.

The cycle count for user is, because you have r1 ready. Kernel does not 
have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to 
save into.

Which is also wasted work for a userspace.

Now that I think about it, no stack frame is even required! lr is saved 
into the caller's stack when its clobbered with an asm, just as when 
it's used for a function call.

>> > The right way to do this is to have the kernel preserve enough
>> > registers that userspace can avoid having any spills. It doesn't have
>> > to preserve everything, probably just enough to save lr. (BTW are
>> 
>> Again, the problem is the kernel doesn't have its dependencies 
>> immediately ready to spill, and spilling (may be) more costly 
>> immediately after the call because we're doing a lot of stores.
>> 
>> I could try measure this. Unfortunately our pipeline simulator tool 
>> doesn't model system calls properly so it's hard to see what's happening 
>> across the user/kernel horizon, I might check if that can be improved
>> or I can hack it by putting some isync in there or something.
> 
> I think it's unlikely to make any real difference to the total number
> of cycles spent which side it happens on, but putting it on the kernel
> side makes it easier to avoid wasting size/icache at each syscall
> site.
> 
>> > syscall arg registers still preserved? If not, this is a major cost on
>> > the userspace side, since any call point that has to loop-and-retry
>> > (e.g. futex) now needs to make its own place to store the original
>> > values.)
>> 
>> Powerpc system calls never did. We could have scv preserve them, but 
>> you'd still need to restore r3. We could make an ABI which does not
>> clobber r3 but puts the return value in r9, say. I'd like to see what
>> the user side code looks like to take advantage of such a thing though.
> 
> Oh wow, I hadn't realized that, but indeed the code we have now is
> allowing for the kernel to clobber them all. So at least this isn't
> getting any worse I guess. I think it was a very poor choice of
> behavior though and a disadvantage vs what other archs do (some of
> them preserve all registers; others preserve only normally call-saved
> ones plus the syscall arg ones and possibly a few other specials).

Well, we could change it. Does the generated code improve significantly
we take those clobbers away?

Thanks,
Nick


More information about the Linuxppc-dev mailing list