Questions: Should kernel panic when PCIe fatal error occurs?

Wed Sep 27 13:01:19 AEST 2023

On 2023/9/27 07:02, Bjorn Helgaas wrote:
> On Fri, Sep 22, 2023 at 10:46:36AM +0800, Shuai Xue wrote:
>> ...
> 
>> Actually, this is a question from my colleague from firmware team.
>> The original question is that:
>>
>>     "Should I set CPER_SEV_FATAL for Generic Error Status Block when a
>>     PCIe fatal error is detected? If set, kernel will always panic.
>>     Otherwise, kernel will always not panic."
>>
>> So I pull a question about desired behavior of Linux kernel first :)
>> From the perspective of the kernel, CPER_SEV_FATAL for Generic Error
>> Status Block is not reasonable. The kernel will attempt to recover
>> Fatal errors, although recovery may fail.
> 
> I don't know the semantics of CPER_SEV_FATAL or why it's there.
> With CPER, we have *two* error severities: a "native" one defined by
> the PCIe spec and another defined by the platform via CPER.
> 
> I speculate that the reason for the CPER severity could be to provide
> a severity for error sources that don't have a "native" severity like
> AER does, or for the vendor to force the OS to restart (for
> CPER_SEV_FATAL, anyway) in cases where it might not otherwise.

Agreed, it is the key point.

Per ACPI 6.5 18.1 Hardware Errors and Error Sources[1]:

	- An uncorrected error is a hardware error condition that cannot be
	corrected by the hardware or by the firmware. Uncorrected errors
	are either fatal or non-fatal.

	- A fatal hardware error is an uncorrected or uncontained error
	condition that is determined to be unrecoverable by the hardware.
	When a fatal uncorrected error occurs, the system is restarted to
	prevent propagation of the error.

	A non-fatal hardware error is an uncorrected error condition from
	which OSPM can attempt recovery by trying to correct the error.
	These are also referred to as correctable or recoverable errors.

Based on our discussion and the PCIe and APCI Spec:

- Native AER fatal error defined in PCIe does not indate that there's
uncontained data corruption.
- The kernel is capable of handle native AER fatal and non-fatal errors.
- When a CPER_SEV_FATAL error nofitied by firmware, it indicates the
platform wants to force the OS to restart, and the APEI/GHES driver follows
the Spec now.

(Please correct me if I misunderstand any)

> 
> In the native case, we only have the PCIe severity and don't have the
> CPER severity at all, and I suspect that unless there's uncontained
> data corruption, we would rather handle even the most severe PCIe
> fatal error by disabling the specific device(s) instead of panicking
> and restarting the whole machine.
> 
> So for PCIe errors, I'm not sure setting CPER_SEV_FATAL is beneficial
> unless the platform wants to force the OS to panic, e.g., maybe the
> platform knows about data corruption and/or the vendor wants the OS to
> panic as part of a reliability story.

So back to the original question, I think your above comments are clear enough.

> 
> Presumably the platform has already logged the error, and I assume the
> platform *could* restart without even returning to the OS, but maybe
> it wants the OS to do a crashdump or shutdown in a more orderly way.
> 

If the system is reset in platform without even returning to the OS,
it is not visible to end user. IMHO, it always a bad choice.
The OS can provide enhanced debuggability, for example:

- providing details about the runtime context through crashdump
- saving error information to persistent storage

Thank you for your patience and valuable feedback. It is greatly appreciated
and truly helpful.

Best Regards and Cheers.
Shuai

[1] https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#hardware-errors-and-error-sources