[PATCH] powerpc/tm: Set MSR[TS] just prior to recheckpoint

Thu Nov 22 05:32:31 AEDT 2018

hi Michael,

On 11/20/18 8:34 AM, Michael Ellerman wrote:
> Hi Breno,
> 
> Thanks for chasing this one down.
> 
> Breno Leitao <leitao at debian.org> writes:
> 
>> On a signal handler return, the user could set a context with MSR[TS] bits
>> set, and these bits would be copied to task regs->msr.
>>
>> At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
>> several __get_user() are called and then a recheckpoint is executed.
>>
>> This is a problem since a page fault (in kernel space) could happen when
>> calling __get_user(). If it happens, the process MSR[TS] bits were
>> already set, but recheckpoint was not executed, and SPRs are still invalid.
>>
>> The page fault can cause the current process to be de-scheduled, with
>> MSR[TS] active and without tm_recheckpoint() being called.  More
>> importantly, without TEXAR[FS] bit set also.
>>
>> Since TEXASR might not have the FS bit set, and when the process is
>> scheduled back, it will try to reclaim, which will be aborted because of
>> the CPU is not in the suspended state, and, then, recheckpoint. This
>> recheckpoint will restore thread->texasr into TEXASR SPR, which might be
>> zero, hitting a BUG_ON().
>>
>> 	[ 2181.457997] kernel BUG at arch/powerpc/kernel/tm.S:446!
> 
> As Mikey said, would be good to have at least the stack trace & NIP
> here, if not the full oops.

Ack!

>> This patch simply delays the MSR[TS] set, so, if there is any page fault in
>> the __get_user() section, it does not have regs->msr[TS] set, since the TM
>> structures are still invalid, thus avoiding doing TM operations for
>> in-kernel exceptions and possible process reschedule.
>>
>> With this patch, the MSR[TS] will only be set just before recheckpointing
>> and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
>> invalid state.
> 
> To make this safe when PREEMPT is enabled we need to preempt_disable() /
> enable() around the setting of regs->msr and the recheckpoint.
> 
> That could also serve as nice documentation.
> 
> I guess the other question is whether it should be the job of
> tm_recheckpoint() to set regs->msr, given that it already hard disables
> interrupts.
> 
> eg. we'd set the TM flags in a local msr variable and pass the to
> tm_recheckpoint(), it would then assign that to regs->msr in the IRQ
> disabled section. Though there's many callers of tm_recheckpoint() that
> don't need that behaviour, so it would probably need to be factored out.

Right, that might be doable, but I am wondering if it isn't better to create
a new function that does it as below:

void tm_set_msr_and_recheckpoint(struct pt_regs *regs, u64 msr)
{
       preempt_disable();
       regs->msr = msr;
       tm_recheckpoint();
       preempt_enable();
}

I understand that preempt_disable() does a better work disabling preemption
compared to disabling IRQ. Also, it does not change the API just for this
very specific signal case.

>> It is not possible to move tm_recheckpoint to happen earlier, because it is
>> required to get the checkpointed registers from userspace, with
>> __get_user(), thus, the only way to avoid this undesired behavior is
>> delaying the MSR[TS] set, as done in this patch.
> 
> I think the root cause here is that we're copying into the live regs of
> current. That has obviously worked in the past, because the register
> state wasn't used until we returned back to userspace. But that's no
> longer true with TM. And even so it's quite subtle. I also suspect some
> of our FP/VEC handling may not work correctly if we're scheduled part
> way through restoring the regs.

I got your point and I think we have a risk of corrupting the regs if MSR[TS]
is set and we do page_fault->recheckpoint/recheclaim in the middle of
__get_user() chunks.

> What might work better is if we copy all the regs into temporary space
> and then with interrupts disabled we copy them into the task. That way
> we should never be scheduled with a half-populated set of regs.

Yes, that seems to be the best strategy, and I might be interested in working
on it.

> That's obviously a much bigger patch though and something we'll have to
> do later.
> 
>> Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals")
>> Cc: stable at vger.kernel.org (v3.9+)
>> Signed-off-by: Breno Leitao <leitao at debian.org>
>> ---
>>  arch/powerpc/kernel/signal_64.c | 29 +++++++++++++++--------------
>>  1 file changed, 15 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
>> index 83d51bf586c7..15b153bdd826 100644
>> --- a/arch/powerpc/kernel/signal_64.c
>> +++ b/arch/powerpc/kernel/signal_64.c
>> @@ -467,20 +467,6 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
>>  	if (MSR_TM_RESV(msr))
>>  		return -EINVAL;
>>  
>> -	/* pull in MSR TS bits from user context */
>> -	regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
>> -
>> -	/*
>> -	 * Ensure that TM is enabled in regs->msr before we leave the signal
>> -	 * handler. It could be the case that (a) user disabled the TM bit
>> -	 * through the manipulation of the MSR bits in uc_mcontext or (b) the
>> -	 * TM bit was disabled because a sufficient number of context switches
>> -	 * happened whilst in the signal handler and load_tm overflowed,
>> -	 * disabling the TM bit. In either case we can end up with an illegal
>> -	 * TM state leading to a TM Bad Thing when we return to userspace.
>> -	 */
>> -	regs->msr |= MSR_TM;
>> -
>>  	/* pull in MSR LE from user context */
>>  	regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
>>  
>> @@ -572,6 +558,21 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
>>  	tm_enable();
>>  	/* Make sure the transaction is marked as failed */
>>  	tsk->thread.tm_texasr |= TEXASR_FS;
>> +
> 
> 	preempt_disable();
> 
>> +	/* pull in MSR TS bits from user context */
>> +	regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
>> +
>> +	/*
>> +	 * Ensure that TM is enabled in regs->msr before we leave the signal
>> +	 * handler. It could be the case that (a) user disabled the TM bit
>> +	 * through the manipulation of the MSR bits in uc_mcontext or (b) the
>> +	 * TM bit was disabled because a sufficient number of context switches
>> +	 * happened whilst in the signal handler and load_tm overflowed,
>> +	 * disabling the TM bit. In either case we can end up with an illegal
>> +	 * TM state leading to a TM Bad Thing when we return to userspace.
>> +	 */
>> +	regs->msr |= MSR_TM;
>> +
>>  	/* This loads the checkpointed FP/VEC state, if used */
>>  	tm_recheckpoint(&tsk->thread);
>>
> 	preempt_enable();
> 
> Although looking at the code that follows, it probably won't cope with
> being preempted either. So the preempt_enable() should probably go at
> the end of the function.

Why? I confess I do not understand the preempt mechanism a lot. Does a
preemption save/restore FP and vector states when it is preempted?

What is the code/IRQ that is executed when there is a preemption? I am
wondering if a preempt happens because the  Decrementer (0x900) IRQ  happens
and a syscall is being executed, thus, saving the whole context and then
restoring it. If my guess is correct, the IRQ might not save the FP/VEC
states, thus, your concern about ahving load_fp_state() after a preemption.

Let me paste the "patched" code here to help with the discussion:

        regs->msr |= MSR_TM;

        /* This loads the checkpointed FP/VEC state, if used */
        tm_recheckpoint(&tsk->thread);

        msr_check_and_set(msr & (MSR_FP | MSR_VEC));
        if (msr & MSR_FP) {
                load_fp_state(&tsk->thread.fp_state);
                regs->msr |= (MSR_FP | tsk->thread.fpexc_mode);
        }
        if (msr & MSR_VEC) {
                load_vr_state(&tsk->thread.vr_state);
                regs->msr |= MSR_VEC;
        }

Thanks
Breno