[PATCH v2] vmcoreinfo: Track and log recoverable hardware errors
Breno Leitao
leitao at debian.org
Tue Jul 22 01:43:24 AEST 2025
Hello Borislav,
On Mon, Jul 21, 2025 at 03:57:18PM +0200, Borislav Petkov wrote:
> On Mon, Jul 21, 2025 at 03:13:40AM -0700, Breno Leitao wrote:
> > Introduce a generic infrastructure for tracking recoverable hardware
> > errors (HW errors that did not cause a panic) and record them for vmcore
> > consumption. This aids post-mortem crash analysis tools by preserving
> > a count and timestamp for the last occurrence of such errors.
> >
> > This patch adds centralized logging for three common sources of
>
> "Add centralized... "
Ack!
> > recoverable hardware errors:
> >
> > - PCIe AER Correctable errors
> > - x86 Machine Check Exceptions (MCE)
> > - APEI/CPER GHES corrected or recoverable errors
> >
> > hwerror_tracking is write-only at kernel runtime, and it is meant to be
> > read from vmcore using tools like crash/drgn. For example, this is how
> > it looks like when opening the crashdump from drgn.
> >
> > >>> prog['hwerror_tracking']
> > (struct hwerror_tracking_info [3]){
> > {
> > .count = (int)844,
> > .timestamp = (time64_t)1752852018,
> > },
> > ...
> >
>
> I'm still missing the justification why rasdaemon can't be used here.
> You did explain it already in past emails.
Sorry, I will update it.
> > +enum hwerror_tracking_source {
> > + HWE_RECOV_AER,
> > + HWE_RECOV_MCE,
> > + HWE_RECOV_GHES,
> > + HWE_RECOV_MAX,
> > +};
>
> Are we confident this separation will serve all cloud dudes?
I am not, but, I've added them to CC list of this patch, so, they are
more than free to chime in.
> > +void hwerror_tracking_log(enum hwerror_tracking_source src)
>
> A function should have a verb in its name explaining what it does:
>
> hwerr_log_error_type()
>
> or so.
Ack!
I will wait a bit more and send an updated version.
Thanks for the review
--breno
More information about the Linuxppc-dev
mailing list