[RFC PATCH 3/3] powenv/mce: print additional information about mce error.
mpe at ellerman.id.au
Fri Mar 29 12:31:57 AEDT 2019
Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
> Print more information about mce error whether it is an hardware or
> software error.
> Some of the mce errors can be easily categorized as hardware or software
> errors e.g. UEs are due to hardware error, where as error triggered due to
> invalid usage of tlbie is a pure software bug. But not all the mce errors
> can be easily categorize into either software or hardware. There are errors
> like multihit errors which are usually result of a software bug, but in
> some rare cases a hardware failure can cause a multihit error. In past, we
> have seen case where after replacing faulty chip, multihit errors stopped
> occurring. Same with parity errors, which are usually due to faulty hardware
> but there are chances where multihit can also cause an parity error. Such
> errors are difficult to determine what really caused it. Hence this patch
> classifies mce errors into following four categorize:
> 1. Hardware error:
> UE and Link timeout failure errors.
> 2. Hardware error, small probability of software cause:
> SLB/ERAT/TLB Parity errors.
> 3. Software error
> Invalid tlbie form.
> 4. Software error, small probability of hardware failure
> SLB/ERAT/TLB Multihit errors.
I like the idea, but I think the phrasing is a little confusing.
> Sample o/p:
> [ 1259.331319] MCE: CPU40: (Warning) Guest SLB Multihit at 00007fff9a59dc60 DAR: 000001003d740320 [Recovered]
> [ 1259.331324] MCE: CPU40: PID: 24051 Comm: qemu-system-ppc
> [ 1259.331345] MCE: CPU40: Software error, small probability of hardware failure
"Software error, small probability of hardware failure"
That can be read as "there has been a software error, *and now* there is
a small probability of a hardware failure".
I also think "probability" suggests we actually know the mathematical
probability of it being a hardware failure, which is not true.
Instead maybe we use:
"Probable hardware error (some chance of software cause)",
"Probable software error (some chance of hardware cause)",
More information about the Linuxppc-dev