[PATCH RESEND v5] vmcoreinfo: Track and log recoverable hardware errors

Hanjun Guo guohanjun at huawei.com
Fri Nov 21 13:47:32 AEDT 2025


On 2025/10/10 18:36, Breno Leitao wrote:
> Introduce a generic infrastructure for tracking recoverable hardware
> errors (HW errors that are visible to the OS but does not cause a panic)
> and record them for vmcore consumption. This aids post-mortem crash
> analysis tools by preserving a count and timestamp for the last
> occurrence of such errors. On the other side, correctable errors, which
> the OS typically remains unaware of because the underlying hardware
> handles them transparently, are less relevant for crash dump
> and therefore are NOT tracked in this infrastructure.
> 
> Add centralized logging for sources of recoverable hardware
> errors based on the subsystem it has been notified.
> 
> hwerror_data is write-only at kernel runtime, and it is meant to be read
> from vmcore using tools like crash/drgn. For example, this is how it
> looks like when opening the crashdump from drgn.
> 
> 	>>> prog['hwerror_data']
> 	(struct hwerror_info[1]){
> 		{
> 			.count = (int)844,
> 			.timestamp = (time64_t)1752852018,
> 		},
> 		...
> 
> This helps fleet operators quickly triage whether a crash may be
> influenced by hardware recoverable errors (which executes a uncommon
> code path in the kernel), especially when recoverable errors occurred
> shortly before a panic, such as the bug fixed by
> commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them
> when destroying the pool")
> 
> This is not intended to replace full hardware diagnostics but provides
> a fast way to correlate hardware events with kernel panics quickly.
> 
> Rare machine check exceptions—like those indicated by mce_flags.p5 or
> mce_flags.winchip—are not accounted for in this method, as they fall
> outside the intended usage scope for this feature’s user base.
> 
> Suggested-by: Tony Luck <tony.luck at intel.com>
> Suggested-by: Shuai Xue <xueshuai at linux.alibaba.com>
> Signed-off-by: Breno Leitao <leitao at debian.org>
> Reviewed-by: Shuai Xue <xueshuai at linux.alibaba.com>
> ---
> Changes in v5:
> - Move the headers to uapi file (Dave Hansen)
> - Use atomic operations in the tracking struct (Dave Hansen)
> - Drop the MCE enum type, and track MCE errors as "others"
> - Document this feature better
> - Link to v4: https://lore.kernel.org/r/20250801-vmcore_hw_error-v4-1-fa1fe65edb83@debian.org
> 
> Changes in v4:
> - Split the error by hardware subsystem instead of kernel
>    subsystem/driver (Shuai)
> - Do not count the corrected errors, only focusing on recoverable errors (Shuai)
> - Link to v3: https://lore.kernel.org/r/20250722-vmcore_hw_error-v3-1-ff0683fc1f17@debian.org
> 
> Changes in v3:
> - Add more information about this feature in the commit message
>    (Borislav Petkov)
> - Renamed the function to hwerr_log_error_type() and use hwerr as
>    suffix (Borislav Petkov)
> - Make the empty function static inline (kernel test robot)
> - Link to v2: https://lore.kernel.org/r/20250721-vmcore_hw_error-v2-1-ab65a6b43c5a@debian.org
> 
> Changes in v2:
> - Split the counter by recoverable error (Tony Luck)
> - Link to v1: https://lore.kernel.org/r/20250714-vmcore_hw_error-v1-1-8cf45edb6334@debian.org
> ---
>   Documentation/driver-api/hw-recoverable-errors.rst | 60 ++++++++++++++++++++++
>   arch/x86/kernel/cpu/mce/core.c                     |  4 ++
>   drivers/acpi/apei/ghes.c                           | 36 +++++++++++++

For the APEI part,

Reviewed-by: Hanjun Guo <guohanjun at huawei.com>

Thanks
Hanjun


More information about the Linuxppc-dev mailing list