Printing of machine check severity
Michael Ellerman
mpe at ellerman.id.au
Tue Mar 5 17:57:47 AEDT 2019
Hi all,
RE: https://github.com/linuxppc/issues/issues/230
> Host dmesg throws lot of below SLB [Multihit] HMI's
>
> [295216.837358] Severe Machine check interrupt [Recovered]
> [295216.837365] Harmless Hypervisor Maintenance interrupt [Recovered]
> [295216.837374] Guest NIP: c00000000024a7dc
> [295216.837378] Error detail: Processor Recovery done
> [295216.837381] HMER: 2040000000000000
> [295216.837388] Initiator: CPU
> [295216.837406] Error type: SLB [Multihit]
> [295216.837415] Effective address: d00000000316c400
Paul points out that these aren't severe errors from the hosts point of
view, and possibly not even for the guest.
I think the key problem here is that we print "Severe" for most types of
MCEs, even though some really aren't.
That comes from the severity being set to `MCE_SEV_ERROR_SYNC` in the
i/derror table.
All the enum values are `MCE_SEV` so the value is actually `ERROR_SYNC`,
which I think means "synchronous error". That is correct. But I don't
think it's correct that all synchronous errors are "severe".
We also have some errors in `mce_ierror_table` that are marked
`MCE_SEV_FATAL` and then have a comment saying `/* ASYNC is fatal */`.
So I feel like we have severity and sync/async conflated in the severity
value, ie. we should split out sync/async and then have a separate
severity field.
We need to be careful because a few places check for `MCE_SEV_ERROR_SYNC`,
it's not *only* used for the severity string.
We could then mark eg. SLB multi-hits as warning rather than severe.
Additionally we probably want to use the `in_guest` flag to modulate the
severity or the message, or both.
cheers
More information about the Linuxppc-dev
mailing list