[PATCH] powerpc/tm: Set MSR[TS] just prior to recheckpoint
Breno Leitao
leitao at debian.org
Thu Nov 22 05:32:31 AEDT 2018
hi Michael,
On 11/20/18 8:34 AM, Michael Ellerman wrote:
> Hi Breno,
>
> Thanks for chasing this one down.
>
> Breno Leitao <leitao at debian.org> writes:
>
>> On a signal handler return, the user could set a context with MSR[TS] bits
>> set, and these bits would be copied to task regs->msr.
>>
>> At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
>> several __get_user() are called and then a recheckpoint is executed.
>>
>> This is a problem since a page fault (in kernel space) could happen when
>> calling __get_user(). If it happens, the process MSR[TS] bits were
>> already set, but recheckpoint was not executed, and SPRs are still invalid.
>>
>> The page fault can cause the current process to be de-scheduled, with
>> MSR[TS] active and without tm_recheckpoint() being called. More
>> importantly, without TEXAR[FS] bit set also.
>>
>> Since TEXASR might not have the FS bit set, and when the process is
>> scheduled back, it will try to reclaim, which will be aborted because of
>> the CPU is not in the suspended state, and, then, recheckpoint. This
>> recheckpoint will restore thread->texasr into TEXASR SPR, which might be
>> zero, hitting a BUG_ON().
>>
>> [ 2181.457997] kernel BUG at arch/powerpc/kernel/tm.S:446!
>
> As Mikey said, would be good to have at least the stack trace & NIP
> here, if not the full oops.
Ack!
>> This patch simply delays the MSR[TS] set, so, if there is any page fault in
>> the __get_user() section, it does not have regs->msr[TS] set, since the TM
>> structures are still invalid, thus avoiding doing TM operations for
>> in-kernel exceptions and possible process reschedule.
>>
>> With this patch, the MSR[TS] will only be set just before recheckpointing
>> and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
>> invalid state.
>
> To make this safe when PREEMPT is enabled we need to preempt_disable() /
> enable() around the setting of regs->msr and the recheckpoint.
>
> That could also serve as nice documentation.
>
> I guess the other question is whether it should be the job of
> tm_recheckpoint() to set regs->msr, given that it already hard disables
> interrupts.
>
> eg. we'd set the TM flags in a local msr variable and pass the to
> tm_recheckpoint(), it would then assign that to regs->msr in the IRQ
> disabled section. Though there's many callers of tm_recheckpoint() that
> don't need that behaviour, so it would probably need to be factored out.
Right, that might be doable, but I am wondering if it isn't better to create
a new function that does it as below:
void tm_set_msr_and_recheckpoint(struct pt_regs *regs, u64 msr)
{
preempt_disable();
regs->msr = msr;
tm_recheckpoint();
preempt_enable();
}
I understand that preempt_disable() does a better work disabling preemption
compared to disabling IRQ. Also, it does not change the API just for this
very specific signal case.
>> It is not possible to move tm_recheckpoint to happen earlier, because it is
>> required to get the checkpointed registers from userspace, with
>> __get_user(), thus, the only way to avoid this undesired behavior is
>> delaying the MSR[TS] set, as done in this patch.
>
> I think the root cause here is that we're copying into the live regs of
> current. That has obviously worked in the past, because the register
> state wasn't used until we returned back to userspace. But that's no
> longer true with TM. And even so it's quite subtle. I also suspect some
> of our FP/VEC handling may not work correctly if we're scheduled part
> way through restoring the regs.
I got your point and I think we have a risk of corrupting the regs if MSR[TS]
is set and we do page_fault->recheckpoint/recheclaim in the middle of
__get_user() chunks.
> What might work better is if we copy all the regs into temporary space
> and then with interrupts disabled we copy them into the task. That way
> we should never be scheduled with a half-populated set of regs.
Yes, that seems to be the best strategy, and I might be interested in working
on it.
> That's obviously a much bigger patch though and something we'll have to
> do later.
>
>> Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals")
>> Cc: stable at vger.kernel.org (v3.9+)
>> Signed-off-by: Breno Leitao <leitao at debian.org>
>> ---
>> arch/powerpc/kernel/signal_64.c | 29 +++++++++++++++--------------
>> 1 file changed, 15 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
>> index 83d51bf586c7..15b153bdd826 100644
>> --- a/arch/powerpc/kernel/signal_64.c
>> +++ b/arch/powerpc/kernel/signal_64.c
>> @@ -467,20 +467,6 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
>> if (MSR_TM_RESV(msr))
>> return -EINVAL;
>>
>> - /* pull in MSR TS bits from user context */
>> - regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
>> -
>> - /*
>> - * Ensure that TM is enabled in regs->msr before we leave the signal
>> - * handler. It could be the case that (a) user disabled the TM bit
>> - * through the manipulation of the MSR bits in uc_mcontext or (b) the
>> - * TM bit was disabled because a sufficient number of context switches
>> - * happened whilst in the signal handler and load_tm overflowed,
>> - * disabling the TM bit. In either case we can end up with an illegal
>> - * TM state leading to a TM Bad Thing when we return to userspace.
>> - */
>> - regs->msr |= MSR_TM;
>> -
>> /* pull in MSR LE from user context */
>> regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
>>
>> @@ -572,6 +558,21 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
>> tm_enable();
>> /* Make sure the transaction is marked as failed */
>> tsk->thread.tm_texasr |= TEXASR_FS;
>> +
>
> preempt_disable();
>
>> + /* pull in MSR TS bits from user context */
>> + regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
>> +
>> + /*
>> + * Ensure that TM is enabled in regs->msr before we leave the signal
>> + * handler. It could be the case that (a) user disabled the TM bit
>> + * through the manipulation of the MSR bits in uc_mcontext or (b) the
>> + * TM bit was disabled because a sufficient number of context switches
>> + * happened whilst in the signal handler and load_tm overflowed,
>> + * disabling the TM bit. In either case we can end up with an illegal
>> + * TM state leading to a TM Bad Thing when we return to userspace.
>> + */
>> + regs->msr |= MSR_TM;
>> +
>> /* This loads the checkpointed FP/VEC state, if used */
>> tm_recheckpoint(&tsk->thread);
>>
> preempt_enable();
>
> Although looking at the code that follows, it probably won't cope with
> being preempted either. So the preempt_enable() should probably go at
> the end of the function.
Why? I confess I do not understand the preempt mechanism a lot. Does a
preemption save/restore FP and vector states when it is preempted?
What is the code/IRQ that is executed when there is a preemption? I am
wondering if a preempt happens because the Decrementer (0x900) IRQ happens
and a syscall is being executed, thus, saving the whole context and then
restoring it. If my guess is correct, the IRQ might not save the FP/VEC
states, thus, your concern about ahving load_fp_state() after a preemption.
Let me paste the "patched" code here to help with the discussion:
regs->msr |= MSR_TM;
/* This loads the checkpointed FP/VEC state, if used */
tm_recheckpoint(&tsk->thread);
msr_check_and_set(msr & (MSR_FP | MSR_VEC));
if (msr & MSR_FP) {
load_fp_state(&tsk->thread.fp_state);
regs->msr |= (MSR_FP | tsk->thread.fpexc_mode);
}
if (msr & MSR_VEC) {
load_vr_state(&tsk->thread.vr_state);
regs->msr |= MSR_VEC;
}
Thanks
Breno
More information about the Linuxppc-dev
mailing list