[PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

Nicholas Piggin npiggin at gmail.com
Tue Nov 21 01:32:41 AEDT 2023


Yeah, awesome.

On Mon Nov 20, 2023 at 5:10 PM AEST, Michael Ellerman wrote:
> Hi Timothy,
>
> Great work debugging this. I think your fix is good, but I want to understand it
> a bit more to make sure I can explain why we haven't seen it outside of io-uring.

Analysis seems right to me.

Probably the best minimal fix. But I wonder if we should just use the
one path for saving/flushing/giving up, just use giveup instead of
save?

KVM looks odd too, and actually gets this wrong. In a way that's not
fixed by Timothy's patch, because it's just not restoring userspace
registers at all. Fortunately QEMU isn't in the habit of using non
volatile FP/VEC registers over a VCPU ioctl, but there's no reason it
couldn't do since GCC/LLVM can easily use them. KVM really wants to be
using giveup.

Thanks,
Nick

> If this can be triggered outside of io-uring then I have even more backporting
> in my future :}
>
> Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
> also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP regs
> from the thread struct before letting the task use FP again. So in that case
> save_fpu() is free to clobber fr0 because the FP regs no longer hold live values
> for the task.
>
> There is another case though, which is the path via:
>   copy_process()
>     dup_task_struct()
>       arch_dup_task_struct()
>         flush_all_to_thread()
>           save_all()
>
> That path saves the FP regs but leaves them live. That's meant as an
> optimisation for a process that's using FP/VSX and then calls fork(), leaving
> the regs live means the parent process doesn't have to take a fault after the
> fork to get its FP regs back.
>
> That path does clobber fr0, but fr0 is volatile across a syscall, and the only
> way to reach copy_process() from userspace is via a syscall. So in normal usage
> fr0 being clobbered across a syscall shouldn't cause data corruption.
>
> Even if we handle a signal on the return from the fork() syscall, the worst that
> happens is that the task's thread struct holds the clobbered fr0, but the task
> doesn't care (because fr0 is volatile across the syscall anyway).
>
> That path is something like:
>
> system_call_vectored_common()
>   system_call_exception()
>     sys_fork()
>       kernel_clone()
>         copy_process()
>           dup_task_struct()
>             arch_dup_task_struct()
>               flush_all_to_thread()
>                 save_all()
>                   if (tsk->thread.regs->msr & MSR_FP)
>                     save_fpu()
>                     # does not clear MSR_FP from regs->msr
>   syscall_exit_prepare()
>     interrupt_exit_user_prepare_main()
>       do_notify_resume()
>         get_signal()
>         handle_rt_signal64()
>           prepare_setup_sigcontext()
>             flush_fp_to_thread()
>               if (tsk->thread.regs->msr & MSR_FP)
>                 giveup_fpu()
>                   __giveup_fpu
>                     save_fpu()
>                     # clobbered fr0 is saved, but task considers it volatile
>                     # across syscall anyway
>
>
> But we now have a new path, because io-uring can call copy_process() via
> create_io_thread() from the signal handling path. That's OK if the signal is
> handled as we return from a syscall, but it's not OK if the signal is handled
> due to some other interrupt.
>
> Which is:
>
> interrupt_return_srr_user()
>   interrupt_exit_user_prepare()
>     interrupt_exit_user_prepare_main()
>       do_notify_resume()
>         get_signal()
>           task_work_run()
>             create_worker_cb()
>               create_io_worker()
>                 copy_process()
>                   dup_task_struct()
>                     arch_dup_task_struct()
>                       flush_all_to_thread()
>                         save_all()
>                           if (tsk->thread.regs->msr & MSR_FP)
>                             save_fpu()
>                             # fr0 is clobbered and potentially live in userspace
>
>
> So tldr I think the corruption is only an issue since io-uring started doing
> the clone via signal, which I think matches the observed timeline of this bug
> appearing.
>
> Gotta run home, will have a closer look at the actual patch later on.
>
> cheers
>
>
> Timothy Pearson <tpearson at raptorengineering.com> writes:
> > During floating point and vector save to thread data fr0/vs0 are clobbered
> > by the FPSCR/VSCR store routine.  This leads to userspace register corruption
> > and application data corruption / crash under the following rare condition:
> >
> >  * A userspace thread is executing with VSX/FP mode enabled
> >  * The userspace thread is making active use of fr0 and/or vs0
> >  * An IPI is taken in kernel mode, forcing the userspace thread to reschedule
> >  * The userspace thread is interrupted by the IPI before accessing data it
> >    previously stored in fr0/vs0
> >  * The thread being switched in by the IPI has a pending signal
> >
> > If these exact criteria are met, then the following sequence happens:
> >
> >  * The existing thread FP storage is still valid before the IPI, due to a
> >    prior call to save_fpu() or store_fp_state().  Note that the current
> >    fr0/vs0 registers have been clobbered, so the FP/VSX state in registers
> >    is now invalid pending a call to restore_fp()/restore_altivec().
> >  * IPI -- FP/VSX register state remains invalid
> >  * interrupt_exit_user_prepare_main() calls do_notify_resume(),
> >    due to the pending signal
> >  * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which
> >    merrily reads and saves the invalid FP/VSX state to thread local storage.
> >  * interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid
> >    FP/VSX state back to registers.
> >  * Execution is released to userspace, and the application crashes or corrupts
> >    data.
> >
> > Without the pending signal, do_notify_resume() is never called, therefore the
> > invalid register state does't matter as it is overwritten nearly immediately
> > by interrupt_exit_user_prepare_main() calling restore_math() before return
> > to userspace.
> >
> > Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and
> > altivec register save paths.
> >
> > Tested under QEMU in kvm mode, running on a Talos II workstation with dual
> > POWER9 DD2.2 CPUs.
> >
> > Closes: https://lore.kernel.org/all/480932026.45576726.1699374859845.JavaMail.zimbra@raptorengineeringinc.com/
> > Closes: https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.JavaMail.zimbra@raptorengineeringinc.com/
> > Tested-by: Timothy Pearson <tpearson at raptorengineering.com>
> > Tested-by: Jens Axboe <axboe at kernel.dk>
> > Signed-off-by: Timothy Pearson <tpearson at raptorengineering.com>
> > ---
> >  arch/powerpc/kernel/fpu.S    | 13 +++++++++++++
> >  arch/powerpc/kernel/vector.S |  2 ++
> >  2 files changed, 15 insertions(+)
> >
> > diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
> > index 6a9acfb690c9..2f8f3f93cbb6 100644
> > --- a/arch/powerpc/kernel/fpu.S
> > +++ b/arch/powerpc/kernel/fpu.S
> > @@ -23,6 +23,15 @@
> >  #include <asm/feature-fixups.h>
> >  
> >  #ifdef CONFIG_VSX
> > +#define __REST_1FPVSR(n,c,base)						\
> > +BEGIN_FTR_SECTION							\
> > +	b	2f;							\
> > +END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
> > +	REST_FPR(n,base);						\
> > +	b	3f;							\
> > +2:	REST_VSR(n,c,base);						\
> > +3:
> > +
> >  #define __REST_32FPVSRS(n,c,base)					\
> >  BEGIN_FTR_SECTION							\
> >  	b	2f;							\
> > @@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
> >  2:	SAVE_32VSRS(n,c,base);						\
> >  3:
> >  #else
> > +#define __REST_1FPVSR(n,b,base)		REST_FPR(n, base)
> >  #define __REST_32FPVSRS(n,b,base)	REST_32FPRS(n, base)
> >  #define __SAVE_32FPVSRS(n,b,base)	SAVE_32FPRS(n, base)
> >  #endif
> > +#define REST_1FPVSR(n,c,base)   __REST_1FPVSR(n,__REG_##c,__REG_##base)
> >  #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base)
> >  #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base)
> >  
> > @@ -67,6 +78,7 @@ _GLOBAL(store_fp_state)
> >  	SAVE_32FPVSRS(0, R4, R3)
> >  	mffs	fr0
> >  	stfd	fr0,FPSTATE_FPSCR(r3)
> > +	REST_1FPVSR(0, R4, R3)
> >  	blr
> >  EXPORT_SYMBOL(store_fp_state)
> >  
> > @@ -138,4 +150,5 @@ _GLOBAL(save_fpu)
> >  2:	SAVE_32FPVSRS(0, R4, R6)
> >  	mffs	fr0
> >  	stfd	fr0,FPSTATE_FPSCR(r6)
> > +	REST_1FPVSR(0, R4, R6)
> >  	blr
> > diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S
> > index 4094e4c4c77a..80b3f6e476b6 100644
> > --- a/arch/powerpc/kernel/vector.S
> > +++ b/arch/powerpc/kernel/vector.S
> > @@ -33,6 +33,7 @@ _GLOBAL(store_vr_state)
> >  	mfvscr	v0
> >  	li	r4, VRSTATE_VSCR
> >  	stvx	v0, r4, r3
> > +	lvx	v0, 0, r3
> >  	blr
> >  EXPORT_SYMBOL(store_vr_state)
> >  
> > @@ -109,6 +110,7 @@ _GLOBAL(save_altivec)
> >  	mfvscr	v0
> >  	li	r4,VRSTATE_VSCR
> >  	stvx	v0,r4,r7
> > +	lvx	v0,0,r7
> >  	blr
> >  
> >  #ifdef CONFIG_VSX
> > -- 
> > 2.39.2



More information about the Linuxppc-dev mailing list