[PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

Tue Nov 21 03:45:43 AEDT 2023

----- Original Message -----
> From: "Michael Ellerman" <mpe at ellerman.id.au>
> To: "Timothy Pearson" <tpearson at raptorengineering.com>, "Jens Axboe" <axboe at kernel.dk>, "regressions"
> <regressions at lists.linux.dev>, "npiggin" <npiggin at gmail.com>, "christophe leroy" <christophe.leroy at csgroup.eu>,
> "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
> Sent: Monday, November 20, 2023 1:10:06 AM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save

> Hi Timothy,
> 
> Great work debugging this. I think your fix is good, but I want to understand it
> a bit more to make sure I can explain why we haven't seen it outside of
> io-uring.
> If this can be triggered outside of io-uring then I have even more backporting
> in my future :}
> 
> Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
> also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP regs
> from the thread struct before letting the task use FP again. So in that case
> save_fpu() is free to clobber fr0 because the FP regs no longer hold live values
> for the task.
> 
> There is another case though, which is the path via:
>  copy_process()
>    dup_task_struct()
>      arch_dup_task_struct()
>        flush_all_to_thread()
>          save_all()
> 
> That path saves the FP regs but leaves them live. That's meant as an
> optimisation for a process that's using FP/VSX and then calls fork(), leaving
> the regs live means the parent process doesn't have to take a fault after the
> fork to get its FP regs back.
> 
> That path does clobber fr0, but fr0 is volatile across a syscall, and the only
> way to reach copy_process() from userspace is via a syscall. So in normal usage
> fr0 being clobbered across a syscall shouldn't cause data corruption.
> 
> Even if we handle a signal on the return from the fork() syscall, the worst that
> happens is that the task's thread struct holds the clobbered fr0, but the task
> doesn't care (because fr0 is volatile across the syscall anyway).
> 
> That path is something like:
> 
> system_call_vectored_common()
>  system_call_exception()
>    sys_fork()
>      kernel_clone()
>        copy_process()
>          dup_task_struct()
>            arch_dup_task_struct()
>              flush_all_to_thread()
>                save_all()
>                  if (tsk->thread.regs->msr & MSR_FP)
>                    save_fpu()
>                    # does not clear MSR_FP from regs->msr
>  syscall_exit_prepare()
>    interrupt_exit_user_prepare_main()
>      do_notify_resume()
>        get_signal()
>        handle_rt_signal64()
>          prepare_setup_sigcontext()
>            flush_fp_to_thread()
>              if (tsk->thread.regs->msr & MSR_FP)
>                giveup_fpu()
>                  __giveup_fpu
>                    save_fpu()
>                    # clobbered fr0 is saved, but task considers it volatile
>                    # across syscall anyway
> 
> 
> But we now have a new path, because io-uring can call copy_process() via
> create_io_thread() from the signal handling path. That's OK if the signal is
> handled as we return from a syscall, but it's not OK if the signal is handled
> due to some other interrupt.
> 
> Which is:
> 
> interrupt_return_srr_user()
>  interrupt_exit_user_prepare()
>    interrupt_exit_user_prepare_main()
>      do_notify_resume()
>        get_signal()
>          task_work_run()
>            create_worker_cb()
>              create_io_worker()
>                copy_process()
>                  dup_task_struct()
>                    arch_dup_task_struct()
>                      flush_all_to_thread()
>                        save_all()
>                          if (tsk->thread.regs->msr & MSR_FP)
>                            save_fpu()
>                            # fr0 is clobbered and potentially live in userspace
> 
> 
> So tldr I think the corruption is only an issue since io-uring started doing
> the clone via signal, which I think matches the observed timeline of this bug
> appearing.

I agree the corruption really only started showing up in earnest on io_uring clone-via-signal, as this was confirmed several times in the course of debugging.  Bear in mind however that we have seen very, very rare crashes over several years in other tasks, and I am starting to think this bug might be the root cause (see below).

Note as well that I may very well have a wrong call order in the commit message, since I was relying on a couple of WARN_ON() macros I inserted to check for a similar (but not identical) condition and didn't spend much time getting new traces after identifying the root cause.

I went back and grabbed some real world system-wide stack traces, since I now know what to trigger on.  A typical example is:

interrupt_return_srr_user()
 interrupt_exit_user_prepare()
  interrupt_exit_user_prepare_main()
   schedule()
    __schedule()
     __switch_to()
      giveup_all()
       # tsk->thread.regs->msr MSR_FP is still set here
       __giveup_fpu()
        save_fpu()
        # fr0 is clobbered and potentially live in userspace

This indicates that the pending signal is not actually required, it simply makes triggering much more likely.  Both the pending signal and the _TIF_NEED_RESCHED paths give up control and end up in the same overall position of trigging down-call routines that assume the FPU state is valid.

perl and bash seem to be affected to some degree, though current builds don't use enough VSX instructions rapidly enough to cause crashes with any significant probability.  That said, over many years of running POWER at datacenter scale I have seen enough random bash/perl crashes in the logs to recognize the pattern; I think this has been a low-grade issue for a long time, but with an infantismally small chance of happening it was seen as random noise / hardware issues / other rare bugs in various programs.