[PATCH v2] vmcoreinfo: Track and log recoverable hardware errors

Tue Jul 22 01:43:24 AEST 2025

Hello Borislav,

On Mon, Jul 21, 2025 at 03:57:18PM +0200, Borislav Petkov wrote:
> On Mon, Jul 21, 2025 at 03:13:40AM -0700, Breno Leitao wrote:
> > Introduce a generic infrastructure for tracking recoverable hardware
> > errors (HW errors that did not cause a panic) and record them for vmcore
> > consumption. This aids post-mortem crash analysis tools by preserving
> > a count and timestamp for the last occurrence of such errors.
> > 
> > This patch adds centralized logging for three common sources of
> 
> "Add centralized... "

Ack!

> > recoverable hardware errors:
> > 
> >   - PCIe AER Correctable errors
> >   - x86 Machine Check Exceptions (MCE)
> >   - APEI/CPER GHES corrected or recoverable errors
> > 
> > hwerror_tracking is write-only at kernel runtime, and it is meant to be
> > read from vmcore using tools like crash/drgn. For example, this is how
> > it looks like when opening the crashdump from drgn.
> > 
> > 	>>> prog['hwerror_tracking']
> > 	(struct hwerror_tracking_info [3]){
> > 		{
> > 			.count = (int)844,
> > 			.timestamp = (time64_t)1752852018,
> > 		},
> > 		...
> > 
> 
> I'm still missing the justification why rasdaemon can't be used here.
> You did explain it already in past emails.

Sorry, I will update it.

> > +enum hwerror_tracking_source {
> > +	HWE_RECOV_AER,
> > +	HWE_RECOV_MCE,
> > +	HWE_RECOV_GHES,
> > +	HWE_RECOV_MAX,
> > +};
> 
> Are we confident this separation will serve all cloud dudes?

I am not, but, I've added them to CC list of this patch, so, they are
more than free to chime in.

> > +void hwerror_tracking_log(enum hwerror_tracking_source src)
> 
> A function should have a verb in its name explaining what it does:
> 
> hwerr_log_error_type()
> 
> or so.

Ack!

I will wait a bit more and send an updated version.

Thanks for the review
--breno