[PATCH v2] vmcoreinfo: Track and log recoverable hardware errors
Borislav Petkov
bp at alien8.de
Mon Jul 21 23:57:18 AEST 2025
On Mon, Jul 21, 2025 at 03:13:40AM -0700, Breno Leitao wrote:
> Introduce a generic infrastructure for tracking recoverable hardware
> errors (HW errors that did not cause a panic) and record them for vmcore
> consumption. This aids post-mortem crash analysis tools by preserving
> a count and timestamp for the last occurrence of such errors.
>
> This patch adds centralized logging for three common sources of
"Add centralized... "
> recoverable hardware errors:
>
> - PCIe AER Correctable errors
> - x86 Machine Check Exceptions (MCE)
> - APEI/CPER GHES corrected or recoverable errors
>
> hwerror_tracking is write-only at kernel runtime, and it is meant to be
> read from vmcore using tools like crash/drgn. For example, this is how
> it looks like when opening the crashdump from drgn.
>
> >>> prog['hwerror_tracking']
> (struct hwerror_tracking_info [3]){
> {
> .count = (int)844,
> .timestamp = (time64_t)1752852018,
> },
> ...
>
I'm still missing the justification why rasdaemon can't be used here.
You did explain it already in past emails.
> +enum hwerror_tracking_source {
> + HWE_RECOV_AER,
> + HWE_RECOV_MCE,
> + HWE_RECOV_GHES,
> + HWE_RECOV_MAX,
> +};
Are we confident this separation will serve all cloud dudes?
> +
> +#ifdef CONFIG_VMCORE_INFO
> +void hwerror_tracking_log(enum hwerror_tracking_source src);
> +#else
> +void hwerror_tracking_log(enum hwerror_tracking_source src) {};
> +#endif
> +
> #endif /* LINUX_VMCORE_INFO_H */
> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
> index e066d31d08f89..23d7ddcd55cdd 100644
> --- a/kernel/vmcore_info.c
> +++ b/kernel/vmcore_info.c
> @@ -31,6 +31,13 @@ u32 *vmcoreinfo_note;
> /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
> static unsigned char *vmcoreinfo_data_safecopy;
>
> +struct hwerror_tracking_info {
> + int __data_racy count;
> + time64_t __data_racy timestamp;
> +};
> +
> +static struct hwerror_tracking_info hwerror_tracking[HWE_RECOV_MAX];
> +
> Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
> void *data, size_t data_len)
> {
> @@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
> }
> EXPORT_SYMBOL(paddr_vmcoreinfo_note);
>
> +void hwerror_tracking_log(enum hwerror_tracking_source src)
A function should have a verb in its name explaining what it does:
hwerr_log_error_type()
or so.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
More information about the Linuxppc-dev
mailing list