<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">
<pre class="moz-quote-pre" wrap="">On 11/12/21 12:42, Daniel Axtens wrote:
</pre>
</div>
<blockquote type="cite"
cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">In realmode mce handler we use irq_work_queue() to defer
the processing of mce events, irq_work_queue() can only
be called when translation is enabled because it touches
memory outside RMA, hence we enable translation before
calling irq_work_queue and disable on return, though it
is not safe to do in realmode.
To avoid this, program the decrementer and call the event
processing functions from timer handler.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
This is an interesting approach, and it would indeed be nice to clear up
the MCE handling a bit.
I have a few questions:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 331d944280b8..187810f13669 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -235,8 +235,10 @@ extern void machine_check_print_event_info(struct machine_check_event *evt,
unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
extern void mce_common_process_ue(struct pt_regs *regs,
struct mce_error_info *mce_err);
+extern void machine_check_raise_dec_intr(void);
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Does this need an extern? I think that's the default...?</pre>
</blockquote>
<pre>Not required, I will remove it.
</pre>
<blockquote type="cite"
cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> int mce_register_notifier(struct notifier_block *nb);
int mce_unregister_notifier(struct notifier_block *nb);
+void mce_run_late_handlers(void);
#ifdef CONFIG_PPC_BOOK3S_64
void flush_and_reload_slb(void);
void flush_erat(void);
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index dc05a862e72a..f49180f8c9be 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -280,6 +280,7 @@ struct paca_struct {
#endif
#ifdef CONFIG_PPC_BOOK3S_64
struct mce_info *mce_info;
+ atomic_t mces_to_process;
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
This is in the PACA, which is supposed to be a per-cpu structure: hey
does it need to be atomic_t? Isn't there only one CPU accessing it?</pre>
</blockquote>
<pre>Yes it need not be atomic, got confused with some scenario and
made it atomic.
</pre>
<blockquote type="cite"
cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">
<pre class="moz-quote-pre" wrap="">Does this variable provide anything new compared to mce_nest_count or
mce_queue_count + mce_ue_count? It looks like
machine_check_process_queued_event will clear a queue based on value of
mce_queue_count and machine_check_process_ue_event will clear a queue
based on mce_ue_count...
I think (absent nested interrupts which I talk about below) it should be
the case that mces_to_process == mce_queue_count + mce_ue_count but I
might be wrong?</pre>
</blockquote>
<pre>If we hit exception in process context, we may not increment mce_queue_count,
so the equation need not be true all time.
</pre>
<blockquote type="cite"
cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> #endif /* CONFIG_PPC_BOOK3S_64 */
} ____cacheline_aligned;
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> /*
* Decode and save high level MCE information into per cpu buffer which
* is an array of machine_check_event structure.
@@ -135,6 +131,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
if (mce->error_type == MCE_ERROR_TYPE_UE)
mce->u.ue_error.ignore_event = mce_err->ignore_event;
+ atomic_inc(&local_paca->mces_to_process);
+
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Is there any chance the decrementer will fire between when you do this
atomic_inc() and when you finish adding all the information to the mce
data structure in the remainder of save_mce_event? (e.g. filling in the
tlb_errror.effective_address field)?
(Or does save_mce_event get called with interrupts masked? I find it
very hard to remember what parts of the MCE code path happen under what
circumstances!)</pre>
</blockquote>
<pre>Yes, Interrupts will be disabled, I mean MSR[EE]=0 when mce is being handled. </pre>
<blockquote type="cite"
cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> if (!addr)
return;
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">@@ -338,7 +322,7 @@ static void machine_process_ue_event(struct work_struct *work)
* process pending MCE event from the mce event queue. This function will be
* called during syscall exit.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Is this comment still accurate if this patch is applied?</pre>
</blockquote>
<pre>No, mpe has also pointed this out, we will clean it in a different patch.
</pre>
<blockquote type="cite"
cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> */
-static void machine_check_process_queued_event(struct irq_work *work)
+static void machine_check_process_queued_event(void)
{
int index;
struct machine_check_event *evt;
@@ -363,6 +347,17 @@ static void machine_check_process_queued_event(struct irq_work *work)
}
}
+void mce_run_late_handlers(void)
+{
+ if (unlikely(atomic_read(&local_paca->mces_to_process))) {
+ if (ppc_md.machine_check_log_err)
+ ppc_md.machine_check_log_err();
+ machine_check_process_queued_event();
+ machine_check_ue_work();
+ atomic_dec(&local_paca->mces_to_process);
+ }
+}
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
What happens if you get a nested MCE between log_err() and
process_queued_event()? If my very foggy memory of the MCE handling is
correct, we enable nested MCEs very early in the process because the
alternative is that a nested MCE will checkstop the box. So I think this
might be possible, albeit probably unlikely.
It looks like process_queued_event clears the entire MCE queue as
determined by the mce_queue_count. So, consider the following sequence
of events:
1. Take MCE 1. Save to queue, increment mce_queue_count, increment
mces_to_process, set decrementer to fire.
2. Decrementer fires. mce_run_late_handlers is called.
3. mces_to_process = 1, so we call machine_check_log_err(), which prints
(on pseries) the info for MCE 1.
4. Take MCE 2. This is saved to the queue, mce_queue_count is
incremented, mces_to_process is incremented, and the decrementer is
armed again.
5. We then leave the MCE interrupt context and return to the decrementer
handling context. The next thing we do is we call
m_c_e_process_queued_event(), which clears the entire queue (that is,
MCEs 1 and 2):
while (local_paca->mce_info->mce_queue_count > 0) {
index = local_paca->mce_info->mce_queue_count - 1;
evt = &local_paca->mce_info->mce_event_queue[index];
if (evt->error_type == MCE_ERROR_TYPE_UE &&
evt->u.ue_error.ignore_event) {
local_paca->mce_info->mce_queue_count--;
continue;
}
machine_check_print_event_info(evt, false, false);
local_paca->mce_info->mce_queue_count--;
}
6. We finish mce_run_late_handlers() and decrement mces_to_process,
so it's now 1.
7. The decrementer fires again, mces_to_process is 1, so we start
processing again.
8. We call machine_check_log_err again, it will now call the FWNMI code
again and possibly print error 2.
9. process_queued_event will be called again but mce_queue_count will
be 0 so it it will bail out early.
I _think_ the worst that can happen - at least so long as pseries is the
only implementaion of machine_check_log_err - is that we will handle
MCE 2 before we query the firmware about it. That's probably benign, but
I am still concerned with the overall interaction around nested
interrupts.</pre>
</blockquote>
<pre>The only problem we have here is overwriting mce_data_buf in case of nested
mce, and about "handle MCE 2 before we query the firmware about it" It is not
possible, isn't it?
Assume we take MCE 2 while we are in the middle of mce_run_late_handlers(),
before the MCE handler relinquishes the CPU to timer handler, we will have
everything in place, right? or am I missing something obvious.
</pre>
<blockquote type="cite"
cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> void machine_check_print_event_info(struct machine_check_event *evt,
bool user_mode, bool in_guest)
{
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 934d8ae66cc6..2dc09d75d77c 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -597,6 +597,9 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(timer_interrupt)
irq_work_run();
}
+#ifdef CONFIG_PPC_BOOK3S_64
+ mce_run_late_handlers();
+#endif
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">So we're now branching to a function in a different file and doing an
atomic read in every timer interrupt. Is this a hot path? Is there any
speed implication to doing this?</pre>
</blockquote>
<pre>Nick has suggested me to use test_irq_work_pending() and I will remove the
atomic read, with v2 we may not have any serious time implications.
</pre>
<blockquote type="cite"
cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> now = get_tb();
if (now >= *next_tb) {
*next_tb = ~(u64)0;
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">@@ -729,40 +724,16 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
error_type = mce_log->error_type;
disposition = mce_handle_err_realmode(disposition, error_type);
-
- /*
- * Enable translation as we will be accessing per-cpu variables
- * in save_mce_event() which may fall outside RMO region, also
- * leave it enabled because subsequently we will be queuing work
- * to workqueues where again per-cpu variables accessed, besides
- * fwnmi_release_errinfo() crashes when called in realmode on
- * pseries.
- * Note: All the realmode handling like flushing SLB entries for
- * SLB multihit is done by now.
- */
out:
- msr = mfmsr();
- mtmsr(msr | MSR_IR | MSR_DR);
-
disposition = mce_handle_err_virtmode(regs, errp, mce_log,
disposition);
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Now you are not in virtual mode/translations on when you are calling
mce_handle_err_virtmode(). From the name, I thought that
mce_handle_err_virtmode() would assume that you are in virtual mode?
Does the function assume that? If so is it safe to call it in real mode?
If not, should we rename it as part of this patch?</pre>
</blockquote>
<pre>patch 2/2, refactors this.
</pre>
<blockquote type="cite"
cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">-
- /*
- * Queue irq work to log this rtas event later.
- * irq_work_queue uses per-cpu variables, so do this in virt
- * mode as well.
- */
- irq_work_queue(&mce_errlog_process_work);
-
- mtmsr(msr);
-
return disposition;
}
</pre>
</blockquote>
</blockquote>
<pre>Thanks for the review :) .
Ganesh
</pre>
</body>
</html>