[PATCH v4] vmcoreinfo: Track and log recoverable hardware errors

Dave Hansen dave.hansen at intel.com
Sat Aug 2 02:24:43 AEST 2025


On 8/1/25 08:13, Breno Leitao wrote:
> Hello Dave,
> 
> On Fri, Aug 01, 2025 at 07:52:17AM -0700, Dave Hansen wrote:
>> On 8/1/25 05:31, Breno Leitao wrote:
>>> Introduce a generic infrastructure for tracking recoverable hardware
>>> errors (HW errors that are visible to the OS but does not cause a panic)
>>> and record them for vmcore consumption.
>> ...
>>
>> Are there patches for the consumer side of this, too? Or do humans
>> looking at crash dumps have to know what to go digging for?
>>
>> In either case, don't we need documentation for this new ABI?
> 
> I have considered this, but the documentation for vmcoreinfo
> (admin-guide/kdump/vmcoreinfo.rst) solely documents what is explicitly
> exposed by vmcore, which differs from the nature of these counters.
> 
> Where would be a good place to document it?

I'm not picky. But you also didn't quite answer the question I was asking.

Is this new data for humans or machines to read?

>>> @@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
>>>  	}
>>>  
>>>  out:
>>> +	/* Given it didn't panic, mark it as recoverable */
>>> +	hwerr_log_error_type(HWERR_RECOV_MCE);
>>> +
>>
>> Does "MCE" mean anything outside of x86?
> 
> AFAIK this is a MCE concept.

I'm not really sure what that response means.

There are two problems here. First is that HWERR_RECOV_MCE is defined in
arch-generic code, but it may never get used by anything other than x86
when CONFIG_X86_MCE.

That also completely wastes space in your data structure when
HWERR_RECOV_MCE=n. Not a huge deal as-is, but it's still a bit sloppy
and wasteful.

...
>>> +	hwerr_data[src].count++;
>>> +	hwerr_data[src].timestamp = ktime_get_real_seconds();
>>> +}
>>> +EXPORT_SYMBOL_GPL(hwerr_log_error_type);
>>
>> I'd also love to hear more about _actual_ users of this. Surely, someone
>> hit a real world problem and thought this would be a nifty solution. Who
>> was that? What problem did they hit? How does this help them?
> 
> Yes, this has been extensively discussed in the very first version of
> the patch. Borislav raised the same question, which was discussed in the
> following link:
> 
> https://lore.kernel.org/all/20250715125327.GGaHZPRz9QLNNO-7q8@fat_crate.local/

When someone raises a concern, we usually try to alleviate the concern
in a way that is self-contained in the next posting. A cover letter with
a full explanation would be one place to put the reasoning, for example.

But expecting future reviewers to plod through all the old threads isn't
really feasible.


More information about the Linuxppc-dev mailing list