[PATCH v2] vmcoreinfo: Track and log recoverable hardware errors

Mon Jul 21 23:57:18 AEST 2025

On Mon, Jul 21, 2025 at 03:13:40AM -0700, Breno Leitao wrote:
> Introduce a generic infrastructure for tracking recoverable hardware
> errors (HW errors that did not cause a panic) and record them for vmcore
> consumption. This aids post-mortem crash analysis tools by preserving
> a count and timestamp for the last occurrence of such errors.
> 
> This patch adds centralized logging for three common sources of

"Add centralized... "

> recoverable hardware errors:
> 
>   - PCIe AER Correctable errors
>   - x86 Machine Check Exceptions (MCE)
>   - APEI/CPER GHES corrected or recoverable errors
> 
> hwerror_tracking is write-only at kernel runtime, and it is meant to be
> read from vmcore using tools like crash/drgn. For example, this is how
> it looks like when opening the crashdump from drgn.
> 
> 	>>> prog['hwerror_tracking']
> 	(struct hwerror_tracking_info [3]){
> 		{
> 			.count = (int)844,
> 			.timestamp = (time64_t)1752852018,
> 		},
> 		...
> 

I'm still missing the justification why rasdaemon can't be used here.
You did explain it already in past emails.

> +enum hwerror_tracking_source {
> +	HWE_RECOV_AER,
> +	HWE_RECOV_MCE,
> +	HWE_RECOV_GHES,
> +	HWE_RECOV_MAX,
> +};

Are we confident this separation will serve all cloud dudes?

> +
> +#ifdef CONFIG_VMCORE_INFO
> +void hwerror_tracking_log(enum hwerror_tracking_source src);
> +#else
> +void hwerror_tracking_log(enum hwerror_tracking_source src) {};
> +#endif
> +
>  #endif /* LINUX_VMCORE_INFO_H */
> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
> index e066d31d08f89..23d7ddcd55cdd 100644
> --- a/kernel/vmcore_info.c
> +++ b/kernel/vmcore_info.c
> @@ -31,6 +31,13 @@ u32 *vmcoreinfo_note;
>  /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
>  static unsigned char *vmcoreinfo_data_safecopy;
>  
> +struct hwerror_tracking_info {
> +	int __data_racy count;
> +	time64_t __data_racy timestamp;
> +};
> +
> +static struct hwerror_tracking_info hwerror_tracking[HWE_RECOV_MAX];
> +
>  Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
>  			  void *data, size_t data_len)
>  {
> @@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
>  }
>  EXPORT_SYMBOL(paddr_vmcoreinfo_note);
>  
> +void hwerror_tracking_log(enum hwerror_tracking_source src)

A function should have a verb in its name explaining what it does:

hwerr_log_error_type()

or so.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette