[PATCH] powerpc/64s: Report SLB multi-hit rather than parity error
Nicholas Piggin
npiggin at gmail.com
Sun Jun 17 22:07:57 AEST 2018
On Fri, 15 Jun 2018 21:37:15 +1000
Michael Ellerman <mpe at ellerman.id.au> wrote:
> Nicholas Piggin <npiggin at gmail.com> writes:
> > On Wed, 13 Jun 2018 23:24:14 +1000
> > Michael Ellerman <mpe at ellerman.id.au> wrote:
> >
> >> When we take an SLB multi-hit on bare metal, we see both the multi-hit
> >> and parity error bits set in DSISR. The user manuals indicates this is
> >> expected to always happen on Power8, whereas on Power9 it says a
> >> multi-hit will "usually" also cause a parity error.
> >>
> >> We decide what to do based on the various error tables in mce_power.c,
> >> and because we process them in order and only report the first, we
> >> currently always report a parity error but not the multi-hit, eg:
> >>
> >> Severe Machine check interrupt [Recovered]
> >> Initiator: CPU
> >> Error type: SLB [Parity]
> >> Effective address: c000000ffffd4300
> >>
> >> Although this is correct, it leaves the user wondering why they got a
> >> parity error. It would be clearer instead if we reported the
> >> multi-hit because that is more likely to be simply a software bug,
> >> whereas a true parity error is possibly an indication of a bad core.
> >>
> >> We can do that simply by reordering the error tables so that multi-hit
> >> appears before parity. That doesn't affect the error recovery at all,
> >> because we flush the SLB either way.
> >
> > Yeah this is a good idea. I wonder if there are any other conditions
> > like this that should be reordered.
>
> Yeah good point, this one just caught my eye because I was testing it.
> Ideally it wouldn't matter and we could actually report multiple, but
> that would be a bit of a bigger change.
Yep this patch looks fine for a minimal fix.
>
> > I think the i-side should not have to be changed here because it
> > matches the value not bits, so that shouldn't matter.
>
> Ah OK, will check.
>
> > A bit of a shame we don't report i/d side, and ideally we'd be able
> > to report multiple conditions. The reporting APIs really want to be
> > massaged a bit, but for now this is a good step.
>
> Ah snap, yep, more detail & multiple conditions would be nice.
>
> I don't really understand the way we do the reporting now. The
> struct machine_check_event is all carefully laid out with reserved
> fields and a version number and everything as if it's an ABI. But AFAICS
> it's purely internal to the kernel.
>
> And then we have struct mce_error_info, but that's a separate thing and
> struct machine_check_event doesn't contain one of them?
Yeah I noticed that too a while back, was it an old OPAL API or maybe a
proposed new API that was never implemented? I would like to end up
doing most MCE decoding in firmware at some point, but I don't think
it's worth keeping this existing ABI thing around for it.
Thanks,
Nick
More information about the Linuxppc-dev
mailing list