[PATCH 1/2] powerpc/mce: Avoid using irq_work_queue() in realmode

Thu Nov 18 20:44:58 AEDT 2021

On 11/12/21 12:42, Daniel Axtens wrote:

>> In realmode mce handler we use irq_work_queue() to defer
>> the processing of mce events, irq_work_queue() can only
>> be called when translation is enabled because it touches
>> memory outside RMA, hence we enable translation before
>> calling irq_work_queue and disable on return, though it
>> is not safe to do in realmode.
>>
>> To avoid this, program the decrementer and call the event
>> processing functions from timer handler.
>>
> This is an interesting approach, and it would indeed be nice to clear up
> the MCE handling a bit.
>
> I have a few questions:
>
>> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
>> index 331d944280b8..187810f13669 100644
>> --- a/arch/powerpc/include/asm/mce.h
>> +++ b/arch/powerpc/include/asm/mce.h
>> @@ -235,8 +235,10 @@ extern void machine_check_print_event_info(struct machine_check_event *evt,
>>   unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
>>   extern void mce_common_process_ue(struct pt_regs *regs,
>>   				  struct mce_error_info *mce_err);
>> +extern void machine_check_raise_dec_intr(void);
> Does this need an extern? I think that's the default...?

Not required, I will remove it.

>>   int mce_register_notifier(struct notifier_block *nb);
>>   int mce_unregister_notifier(struct notifier_block *nb);
>> +void mce_run_late_handlers(void);
>>   #ifdef CONFIG_PPC_BOOK3S_64
>>   void flush_and_reload_slb(void);
>>   void flush_erat(void);
>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>> index dc05a862e72a..f49180f8c9be 100644
>> --- a/arch/powerpc/include/asm/paca.h
>> +++ b/arch/powerpc/include/asm/paca.h
>> @@ -280,6 +280,7 @@ struct paca_struct {
>>   #endif
>>   #ifdef CONFIG_PPC_BOOK3S_64
>>   	struct mce_info *mce_info;
>> +	atomic_t mces_to_process;
> This is in the PACA, which is supposed to be a per-cpu structure: hey
> does it need to be atomic_t? Isn't there only one CPU accessing it?

Yes it need not be atomic, got confused with some scenario and
made it atomic.

> Does this variable provide anything new compared to mce_nest_count or
> mce_queue_count + mce_ue_count? It looks like
> machine_check_process_queued_event will clear a queue based on value of
> mce_queue_count and machine_check_process_ue_event will clear a queue
> based on mce_ue_count...
>
> I think (absent nested interrupts which I talk about below) it should be
> the case that mces_to_process == mce_queue_count + mce_ue_count but I
> might be wrong?

If we hit exception in process context, we may not increment mce_queue_count,
so the equation need not be true all time.

>>   #endif /* CONFIG_PPC_BOOK3S_64 */
>>   } ____cacheline_aligned;
>    
>>   /*
>>    * Decode and save high level MCE information into per cpu buffer which
>>    * is an array of machine_check_event structure.
>> @@ -135,6 +131,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
>>   	if (mce->error_type == MCE_ERROR_TYPE_UE)
>>   		mce->u.ue_error.ignore_event = mce_err->ignore_event;
>>   
>> +	atomic_inc(&local_paca->mces_to_process);
>> +
> Is there any chance the decrementer will fire between when you do this
> atomic_inc() and when you finish adding all the information to the mce
> data structure in the remainder of save_mce_event? (e.g. filling in the
> tlb_errror.effective_address field)?
>
> (Or does save_mce_event get called with interrupts masked? I find it
> very hard to remember what parts of the MCE code path happen under what
> circumstances!)

Yes, Interrupts will be disabled, I mean MSR[EE]=0 when mce is being handled.

>>   	if (!addr)
>>   		return;
>>   
>
>> @@ -338,7 +322,7 @@ static void machine_process_ue_event(struct work_struct *work)
>>    * process pending MCE event from the mce event queue. This function will be
>>    * called during syscall exit.
> Is this comment still accurate if this patch is applied?

No, mpe has also pointed this out, we will clean it in a different patch.

>
>>    */
>> -static void machine_check_process_queued_event(struct irq_work *work)
>> +static void machine_check_process_queued_event(void)
>>   {
>>   	int index;
>>   	struct machine_check_event *evt;
>> @@ -363,6 +347,17 @@ static void machine_check_process_queued_event(struct irq_work *work)
>>   	}
>>   }
>>   
>> +void mce_run_late_handlers(void)
>> +{
>> +	if (unlikely(atomic_read(&local_paca->mces_to_process))) {
>> +		if (ppc_md.machine_check_log_err)
>> +			ppc_md.machine_check_log_err();
>> +		machine_check_process_queued_event();
>> +		machine_check_ue_work();
>> +		atomic_dec(&local_paca->mces_to_process);
>> +	}
>> +}
> What happens if you get a nested MCE between log_err() and
> process_queued_event()? If my very foggy memory of the MCE handling is
> correct, we enable nested MCEs very early in the process because the
> alternative is that a nested MCE will checkstop the box. So I think this
> might be possible, albeit probably unlikely.
>
> It looks like process_queued_event clears the entire MCE queue as
> determined by the mce_queue_count. So, consider the following sequence
> of events:
>
> 1. Take MCE 1. Save to queue, increment mce_queue_count, increment
>     mces_to_process, set decrementer to fire.
>
> 2. Decrementer fires. mce_run_late_handlers is called.
>
> 3. mces_to_process = 1, so we call machine_check_log_err(), which prints
>     (on pseries) the info for MCE 1.
>
> 4. Take MCE 2. This is saved to the queue, mce_queue_count is
>     incremented, mces_to_process is incremented, and the decrementer is
>     armed again.
>
> 5. We then leave the MCE interrupt context and return to the decrementer
>     handling context. The next thing we do is we call
>     m_c_e_process_queued_event(), which clears the entire queue (that is,
>     MCEs 1 and 2):
>
> 	while (local_paca->mce_info->mce_queue_count > 0) {
> 		index = local_paca->mce_info->mce_queue_count - 1;
> 		evt = &local_paca->mce_info->mce_event_queue[index];
>
> 		if (evt->error_type == MCE_ERROR_TYPE_UE &&
> 		    evt->u.ue_error.ignore_event) {
> 			local_paca->mce_info->mce_queue_count--;
> 			continue;
> 		}
> 		machine_check_print_event_info(evt, false, false);
> 		local_paca->mce_info->mce_queue_count--;
> 	}
>
>   6. We finish mce_run_late_handlers() and decrement mces_to_process,
>      so it's now 1.
>
>   7. The decrementer fires again, mces_to_process is 1, so we start
>      processing again.
>
>   8. We call machine_check_log_err again, it will now call the FWNMI code
>      again and possibly print error 2.
>
>   9. process_queued_event will be called again but mce_queue_count will
>      be 0 so it it will bail out early.
>
> I _think_ the worst that can happen - at least so long as pseries is the
> only implementaion of machine_check_log_err - is that we will handle
> MCE 2 before we query the firmware about it. That's probably benign, but
> I am still concerned with the overall interaction around nested
> interrupts.

The only problem we have here is overwriting mce_data_buf in case of nested
mce, and about "handle MCE 2 before we query the firmware about it" It is not
possible, isn't it?

Assume we take MCE 2 while we are in the middle of mce_run_late_handlers(),
before the MCE handler relinquishes the CPU to timer handler, we will have
everything in place, right? or am I missing something obvious.

>>   void machine_check_print_event_info(struct machine_check_event *evt,
>>   				    bool user_mode, bool in_guest)
>>   {
>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>> index 934d8ae66cc6..2dc09d75d77c 100644
>> --- a/arch/powerpc/kernel/time.c
>> +++ b/arch/powerpc/kernel/time.c
>> @@ -597,6 +597,9 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(timer_interrupt)
>>   		irq_work_run();
>>   	}
>>   
>> +#ifdef CONFIG_PPC_BOOK3S_64
>> +	mce_run_late_handlers();
>> +#endif
>>
> So we're now branching to a function in a different file and doing an
> atomic read in every timer interrupt. Is this a hot path? Is there any
> speed implication to doing this?

Nick has suggested me to use test_irq_work_pending() and I will remove the
atomic read, with v2 we may not have any serious time implications.

>>   	now = get_tb();
>>   	if (now >= *next_tb) {
>>   		*next_tb = ~(u64)0;
>
>> @@ -729,40 +724,16 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
>>   	error_type = mce_log->error_type;
>>   
>>   	disposition = mce_handle_err_realmode(disposition, error_type);
>> -
>> -	/*
>> -	 * Enable translation as we will be accessing per-cpu variables
>> -	 * in save_mce_event() which may fall outside RMO region, also
>> -	 * leave it enabled because subsequently we will be queuing work
>> -	 * to workqueues where again per-cpu variables accessed, besides
>> -	 * fwnmi_release_errinfo() crashes when called in realmode on
>> -	 * pseries.
>> -	 * Note: All the realmode handling like flushing SLB entries for
>> -	 *       SLB multihit is done by now.
>> -	 */
>>   out:
>> -	msr = mfmsr();
>> -	mtmsr(msr | MSR_IR | MSR_DR);
>> -
>>   	disposition = mce_handle_err_virtmode(regs, errp, mce_log,
>>   					      disposition);
> Now you are not in virtual mode/translations on when you are calling
> mce_handle_err_virtmode(). From the name, I thought that
> mce_handle_err_virtmode() would assume that you are in virtual mode?
> Does the function assume that? If so is it safe to call it in real mode?
> If not, should we rename it as part of this patch?

patch 2/2, refactors this.

>> -
>> -	/*
>> -	 * Queue irq work to log this rtas event later.
>> -	 * irq_work_queue uses per-cpu variables, so do this in virt
>> -	 * mode as well.
>> -	 */
>> -	irq_work_queue(&mce_errlog_process_work);
>> -
>> -	mtmsr(msr);
>> -
>>   	return disposition;
>>   }

Thanks for the review :) .
Ganesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20211118/b785ae81/attachment-0001.htm>