[PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

Tue Nov 21 12:23:57 AEDT 2023

----- Original Message -----
> From: "Michael Ellerman" <mpe at ellerman.id.au>
> To: "Timothy Pearson" <tpearson at raptorengineering.com>
> Cc: "Jens Axboe" <axboe at kernel.dk>, "regressions" <regressions at lists.linux.dev>, "npiggin" <npiggin at gmail.com>,
> "christophe leroy" <christophe.leroy at csgroup.eu>, "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
> Sent: Monday, November 20, 2023 5:39:52 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save

> Timothy Pearson <tpearson at raptorengineering.com> writes:
>> ----- Original Message -----
>>> From: "Michael Ellerman" <mpe at ellerman.id.au>
> ...
>>> 
>>> But we now have a new path, because io-uring can call copy_process() via
>>> create_io_thread() from the signal handling path. That's OK if the signal is
>>> handled as we return from a syscall, but it's not OK if the signal is handled
>>> due to some other interrupt.
>>> 
>>> Which is:
>>> 
>>> interrupt_return_srr_user()
>>>  interrupt_exit_user_prepare()
>>>    interrupt_exit_user_prepare_main()
>>>      do_notify_resume()
>>>        get_signal()
>>>          task_work_run()
>>>            create_worker_cb()
>>>              create_io_worker()
>>>                copy_process()
>>>                  dup_task_struct()
>>>                    arch_dup_task_struct()
>>>                      flush_all_to_thread()
>>>                        save_all()
>>>                          if (tsk->thread.regs->msr & MSR_FP)
>>>                            save_fpu()
>>>                            # fr0 is clobbered and potentially live in userspace
>>> 
>>> 
>>> So tldr I think the corruption is only an issue since io-uring started doing
>>> the clone via signal, which I think matches the observed timeline of this bug
>>> appearing.
>>
>> I agree the corruption really only started showing up in earnest on
>> io_uring clone-via-signal, as this was confirmed several times in the
>> course of debugging.
> 
> Thanks.
> 
>> Note as well that I may very well have a wrong call order in the
>> commit message, since I was relying on a couple of WARN_ON() macros I
>> inserted to check for a similar (but not identical) condition and
>> didn't spend much time getting new traces after identifying the root
>> cause.
> 
> Yep no worries. I'll reword it to incorporate the full path from my mail.
> 
>> I went back and grabbed some real world system-wide stack traces, since I now
>> know what to trigger on.  A typical example is:
>>
>> interrupt_return_srr_user()
>>  interrupt_exit_user_prepare()
>>   interrupt_exit_user_prepare_main()
>>    schedule()
>>     __schedule()
>>      __switch_to()
>>       giveup_all()
>>        # tsk->thread.regs->msr MSR_FP is still set here
>>        __giveup_fpu()
>>         save_fpu()
>>         # fr0 is clobbered and potentially live in userspace
> 
> fr0 is not live there.
> 
> __giveup_fpu() does roughly:
> 
>	msr = tsk->thread.regs->msr;
>	msr &= ~(MSR_FP|MSR_FE0|MSR_FE1);
>        msr &= ~MSR_VSX;
>	tsk->thread.regs = msr;
> 
> ie. it clears the FP etc. bits from the task's MSR. That means the FP
> state will be reloaded from the thread struct before the task is run again.
> 
> Also on that path we're switching to another task, so we'll be reloading
> the other task's FP state before returning to userspace.
> 
> So I don't see any bug there.

Yeah, you're right.  I was trying to get traces while doing something else, and didn't think that all the way through, sorry. :)  It's not going to be super easy to get a real trace (I was triggering the WARN_ON() from of fr0 getting set to to FPSCR), so let's just assume it's mainly the path you manually found above and update the commit message accordingly.

> There's only two places that call save_fpu() and skip the giveup logic,
> which is save_all() and kvmppc_save_user_regs().

Now that's interesting as well, since it might explain some issues I've seen for years on a specific QEMU workload.  Once this is backported to stable I'll need to get the kernels updated on those boxes and see if the issues disappear...