[RFC PATCH 3/3] powenv/mce: print additional information about mce error.

Michael Ellerman mpe at ellerman.id.au
Fri Mar 29 12:31:57 AEDT 2019


Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>
> Print more information about mce error whether it is an hardware or
> software error.
>
> Some of the mce errors can be easily categorized as hardware or software
> errors e.g. UEs are due to hardware error, where as error triggered due to
> invalid usage of tlbie is a pure software bug. But not all the mce errors
> can be easily categorize into either software or hardware. There are errors
> like multihit errors which are usually result of a software bug, but in
> some rare cases a hardware failure can cause a multihit error. In past, we
> have seen case where after replacing faulty chip, multihit errors stopped
> occurring. Same with parity errors, which are usually due to faulty hardware
> but there are chances where multihit can also cause an parity error. Such
> errors are difficult to determine what really caused it. Hence this patch
> classifies mce errors into following four categorize:
> 	1. Hardware error:
> 		UE and Link timeout failure errors.
> 	2. Hardware error, small probability of software cause:
> 		SLB/ERAT/TLB Parity errors.
> 	3. Software error
> 		Invalid tlbie form.
> 	4. Software error, small probability of hardware failure
> 		SLB/ERAT/TLB Multihit errors.

I like the idea, but I think the phrasing is a little confusing.

> Sample o/p:
>
> [ 1259.331319] MCE: CPU40: (Warning) Guest SLB Multihit at 00007fff9a59dc60 DAR: 000001003d740320 [Recovered]
> [ 1259.331324] MCE: CPU40: PID: 24051 Comm: qemu-system-ppc
> [ 1259.331345] MCE: CPU40: Software error, small probability of hardware failure

"Software error, small probability of hardware failure"

That can be read as "there has been a software error, *and now* there is
a small probability of a hardware failure".

I also think "probability" suggests we actually know the mathematical
probability of it being a hardware failure, which is not true.

Instead maybe we use:

	"Hardware error",
	"Probable hardware error (some chance of software cause)",
	"Software error",
	"Probable software error (some chance of hardware cause)",

??

cheers


More information about the Linuxppc-dev mailing list