[PATCH 2/2] powerpc: Handle MCE on POWER9 with only DSISR bit 33 set

Balbir Singh bsingharora at gmail.com
Tue Sep 19 20:13:47 AEST 2017


On Fri, Sep 15, 2017 at 3:25 PM, Michael Neuling <mikey at neuling.org> wrote:
> On POWER9 DD2.1 and below, it's possible to get Machine Check
> Exception (MCE) where only DSISR bit 33 is set. This will result in
> the linux MCE handler seeing an unknown event, which triggers linux to
> crash.
>
> We change this by detecting unknown events in the MCE handler and
> marking them as handled so that we no longer crash. We do this only on
> chip revisions known to have this problem.
>
> Signed-off-by: Michael Neuling <mikey at neuling.org>
> ---
>  arch/powerpc/kernel/mce_power.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
> index b76ca198e0..72ec667136 100644
> --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -595,6 +595,7 @@ static long mce_handle_error(struct pt_regs *regs,
>         uint64_t addr;
>         uint64_t srr1 = regs->msr;
>         long handled;
> +       unsigned long pvr;
>
>         if (SRR1_MC_LOADSTORE(srr1))
>                 handled = mce_handle_derror(regs, dtable, &mce_err, &addr);
> @@ -604,6 +605,20 @@ static long mce_handle_error(struct pt_regs *regs,
>         if (!handled && mce_err.error_type == MCE_ERROR_TYPE_UE)
>                 handled = mce_handle_ue_error(regs);
>
> +       /*
> +        * On POWER9 DD2.1 and below, it's possible to get machine
> +        * check where only DSISR bit 33 is set. This will result in
> +        * the MCE handler seeing an unknown event and us crashing.
> +        * Change this to mark as handled on these revisions.
> +        */
> +       pvr = mfspr(SPRN_PVR);
> +       if (((PVR_VER(pvr) == PVR_POWER9) &&
> +            (PVR_CFG(pvr) == 2) &&
> +            (PVR_MIN(pvr) <= 1)) || cpu_has_feature(CPU_FTR_POWER9_DD1))
> +               /* DD2.1 and below */
> +               if (mce_err.error_type == MCE_ERROR_TYPE_UNKNOWN)
> +                   handled = 1;
> +

What does this mean in terms of handling? Since we do not know the event,
how do we recover/handle? The system will anyway crash for not-recovered
cases. Does this prevent us from seeing the same exception several times?

Balbir Singh.


More information about the Linuxppc-dev mailing list