<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <div class="moz-cite-prefix">

      <pre class="moz-quote-pre" wrap="">On 11/12/21 12:42, Daniel Axtens wrote:

</pre>

    </div>

    <blockquote type="cite"

      cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">In realmode mce handler we use irq_work_queue() to defer

the processing of mce events, irq_work_queue() can only

be called when translation is enabled because it touches

memory outside RMA, hence we enable translation before

calling irq_work_queue and disable on return, though it

is not safe to do in realmode.

To avoid this, program the decrementer and call the event

processing functions from timer handler.

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

This is an interesting approach, and it would indeed be nice to clear up

the MCE handling a bit.

I have a few questions:

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h

index 331d944280b8..187810f13669 100644

--- a/arch/powerpc/include/asm/mce.h

+++ b/arch/powerpc/include/asm/mce.h

@@ -235,8 +235,10 @@ extern void machine_check_print_event_info(struct machine_check_event *evt,

 unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);

 extern void mce_common_process_ue(struct pt_regs *regs,

                                  struct mce_error_info *mce_err);

+extern void machine_check_raise_dec_intr(void);

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

Does this need an extern? I think that's the default...?</pre>

    </blockquote>

    <pre>Not required, I will remove it.

</pre>

    <blockquote type="cite"

      cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">

      <pre class="moz-quote-pre" wrap="">

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap=""> int mce_register_notifier(struct notifier_block *nb);

 int mce_unregister_notifier(struct notifier_block *nb);

+void mce_run_late_handlers(void);

 #ifdef CONFIG_PPC_BOOK3S_64

 void flush_and_reload_slb(void);

 void flush_erat(void);

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h

index dc05a862e72a..f49180f8c9be 100644

--- a/arch/powerpc/include/asm/paca.h

+++ b/arch/powerpc/include/asm/paca.h

@@ -280,6 +280,7 @@ struct paca_struct {

 #endif

 #ifdef CONFIG_PPC_BOOK3S_64

        struct mce_info *mce_info;

+       atomic_t mces_to_process;

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

This is in the PACA, which is supposed to be a per-cpu structure: hey

does it need to be atomic_t? Isn't there only one CPU accessing it?</pre>

    </blockquote>

    <pre>Yes it need not be atomic, got confused with some scenario and

made it atomic.

</pre>

    <blockquote type="cite"

      cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">

      <pre class="moz-quote-pre" wrap="">Does this variable provide anything new compared to mce_nest_count or

mce_queue_count + mce_ue_count? It looks like

machine_check_process_queued_event will clear a queue based on value of

mce_queue_count and machine_check_process_ue_event will clear a queue

based on mce_ue_count...

I think (absent nested interrupts which I talk about below) it should be

the case that mces_to_process == mce_queue_count + mce_ue_count but I

might be wrong?</pre>

    </blockquote>

    <pre>If we hit exception in process context, we may not increment mce_queue_count,

so the equation need not be true all time. 

</pre>

    <blockquote type="cite"

      cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">

      <pre class="moz-quote-pre" wrap="">

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap=""> #endif /* CONFIG_PPC_BOOK3S_64 */

 } ____cacheline_aligned;

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap=""> /*

  * Decode and save high level MCE information into per cpu buffer which

  * is an array of machine_check_event structure.

@@ -135,6 +131,8 @@ void save_mce_event(struct pt_regs *regs, long handled,

        if (mce->error_type == MCE_ERROR_TYPE_UE)

                mce->u.ue_error.ignore_event = mce_err->ignore_event;

+       atomic_inc(&local_paca->mces_to_process);

+

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

Is there any chance the decrementer will fire between when you do this

atomic_inc() and when you finish adding all the information to the mce

data structure in the remainder of save_mce_event? (e.g. filling in the

tlb_errror.effective_address field)?

(Or does save_mce_event get called with interrupts masked? I find it

very hard to remember what parts of the MCE code path happen under what

circumstances!)</pre>

    </blockquote>

    <pre>Yes, Interrupts will be disabled, I mean MSR[EE]=0 when mce is being handled. </pre>

    <blockquote type="cite"

      cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">

      <pre class="moz-quote-pre" wrap="">

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">   if (!addr)

                return;

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">@@ -338,7 +322,7 @@ static void machine_process_ue_event(struct work_struct *work)

  * process pending MCE event from the mce event queue. This function will be

  * called during syscall exit.

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

Is this comment still accurate if this patch is applied?</pre>

    </blockquote>

    <pre>No, mpe has also pointed this out, we will clean it in a different patch.

</pre>

    <blockquote type="cite"

      cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">

      <pre class="moz-quote-pre" wrap="">

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">  */

-static void machine_check_process_queued_event(struct irq_work *work)

+static void machine_check_process_queued_event(void)

 {

        int index;

        struct machine_check_event *evt;

@@ -363,6 +347,17 @@ static void machine_check_process_queued_event(struct irq_work *work)

        }

 }

+void mce_run_late_handlers(void)

+{

+       if (unlikely(atomic_read(&local_paca->mces_to_process))) {

+               if (ppc_md.machine_check_log_err)

+                       ppc_md.machine_check_log_err();

+               machine_check_process_queued_event();

+               machine_check_ue_work();

+               atomic_dec(&local_paca->mces_to_process);

+       }

+}

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

What happens if you get a nested MCE between log_err() and

process_queued_event()? If my very foggy memory of the MCE handling is

correct, we enable nested MCEs very early in the process because the

alternative is that a nested MCE will checkstop the box. So I think this

might be possible, albeit probably unlikely.

It looks like process_queued_event clears the entire MCE queue as

determined by the mce_queue_count. So, consider the following sequence

of events:

1. Take MCE 1. Save to queue, increment mce_queue_count, increment

   mces_to_process, set decrementer to fire.

2. Decrementer fires. mce_run_late_handlers is called.

3. mces_to_process = 1, so we call machine_check_log_err(), which prints

   (on pseries) the info for MCE 1.

4. Take MCE 2. This is saved to the queue, mce_queue_count is

   incremented, mces_to_process is incremented, and the decrementer is

   armed again.

5. We then leave the MCE interrupt context and return to the decrementer

   handling context. The next thing we do is we call

   m_c_e_process_queued_event(), which clears the entire queue (that is,

   MCEs 1 and 2):

        while (local_paca->mce_info->mce_queue_count > 0) {

                index = local_paca->mce_info->mce_queue_count - 1;

                evt = &local_paca->mce_info->mce_event_queue[index];

                if (evt->error_type == MCE_ERROR_TYPE_UE &&

                    evt->u.ue_error.ignore_event) {

                        local_paca->mce_info->mce_queue_count--;

                        continue;

                }

                machine_check_print_event_info(evt, false, false);

                local_paca->mce_info->mce_queue_count--;

        }

 6. We finish mce_run_late_handlers() and decrement mces_to_process,

    so it's now 1.

 7. The decrementer fires again, mces_to_process is 1, so we start

    processing again.

 8. We call machine_check_log_err again, it will now call the FWNMI code

    again and possibly print error 2.

 9. process_queued_event will be called again but mce_queue_count will

    be 0 so it it will bail out early.

I _think_ the worst that can happen - at least so long as pseries is the

only implementaion of machine_check_log_err - is that we will handle

MCE 2 before we query the firmware about it. That's probably benign, but

I am still concerned with the overall interaction around nested

interrupts.</pre>

    </blockquote>

    <pre>The only problem we have here is overwriting mce_data_buf in case of nested

mce, and about "handle MCE 2 before we query the firmware about it" It is not

possible, isn't it?

Assume we take MCE 2 while we are in the middle of mce_run_late_handlers(),

before the MCE handler relinquishes the CPU to timer handler, we will have

everything in place, right? or am I missing something obvious.    

</pre>

    <blockquote type="cite"

      cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">

      <pre class="moz-quote-pre" wrap="">

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap=""> void machine_check_print_event_info(struct machine_check_event *evt,

                                    bool user_mode, bool in_guest)

 {

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c

index 934d8ae66cc6..2dc09d75d77c 100644

--- a/arch/powerpc/kernel/time.c

+++ b/arch/powerpc/kernel/time.c

@@ -597,6 +597,9 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(timer_interrupt)

                irq_work_run();

        }

+#ifdef CONFIG_PPC_BOOK3S_64

+       mce_run_late_handlers();

+#endif

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">So we're now branching to a function in a different file and doing an

atomic read in every timer interrupt. Is this a hot path? Is there any

speed implication to doing this?</pre>

    </blockquote>

    <pre>Nick has suggested me to use test_irq_work_pending() and I will remove the

atomic read, with v2 we may not have any serious time implications. 

</pre>

    <blockquote type="cite"

      cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">

      <pre class="moz-quote-pre" wrap="">

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">   now = get_tb();

        if (now >= *next_tb) {

                *next_tb = ~(u64)0;

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">@@ -729,40 +724,16 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)

        error_type = mce_log->error_type;

        disposition = mce_handle_err_realmode(disposition, error_type);

-

-       /*

-        * Enable translation as we will be accessing per-cpu variables

-        * in save_mce_event() which may fall outside RMO region, also

-        * leave it enabled because subsequently we will be queuing work

-        * to workqueues where again per-cpu variables accessed, besides

-        * fwnmi_release_errinfo() crashes when called in realmode on

-        * pseries.

-        * Note: All the realmode handling like flushing SLB entries for

-        *       SLB multihit is done by now.

-        */

 out:

-       msr = mfmsr();

-       mtmsr(msr | MSR_IR | MSR_DR);

-

        disposition = mce_handle_err_virtmode(regs, errp, mce_log,

                                              disposition);

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

Now you are not in virtual mode/translations on when you are calling

mce_handle_err_virtmode(). From the name, I thought that

mce_handle_err_virtmode() would assume that you are in virtual mode?

Does the function assume that? If so is it safe to call it in real mode?

If not, should we rename it as part of this patch?</pre>

    </blockquote>

    <pre>patch 2/2, refactors this.

</pre>

    <blockquote type="cite"

      cite="mid:87lf1t3l0q.fsf@linkitivity.dja.id.au">

      <pre class="moz-quote-pre" wrap="">

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">-

-       /*

-        * Queue irq work to log this rtas event later.

-        * irq_work_queue uses per-cpu variables, so do this in virt

-        * mode as well.

-        */

-       irq_work_queue(&mce_errlog_process_work);

-

-       mtmsr(msr);

-

        return disposition;

 }

</pre>

      </blockquote>

    </blockquote>

    <pre>Thanks for the review :) .

Ganesh

</pre>

  </body>

</html>