[PATCH] PCI/AER: Add option to panic on unrecoverable errors

Breno Leitao leitao at debian.org
Sat Feb 7 05:23:11 AEDT 2026


When a device lacks an error_detected callback, AER recovery fails and
the device is left in a disconnected state. This can mask serious
hardware issues during development and testing.

Add a module parameter 'aer_unrecoverable_fatal' that panics the kernel
instead, making such failures immediately visible. The parameter
defaults to false to preserve existing behavior.

Signed-off-by: Breno Leitao <leitao at debian.org>
---
In environments where all hardware must be fully operational, silently
leaving a device in a disconnected state after an AER recovery failure
is unacceptable. This is common in high-reliability systems, production
servers, and testing infrastructure where a degraded system should not
continue running.

This patch adds a module parameter that allows administrators to enforce
a strict policy: if a device cannot recover from an AER error, the
kernel panics instead of continuing with degraded hardware. This ensures
that hardware failures are immediately visible and can trigger
appropriate remediation (restart, failover, alerting).
---
 Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
 drivers/pci/pcie/err.c                          | 3 +++
 drivers/pci/pcie/portdrv.c                      | 7 +++++++
 drivers/pci/pcie/portdrv.h                      | 1 +
 4 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 1058f2a6d6a8c..ff95c24280e3c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5240,6 +5240,15 @@ Kernel parameters
 		nomsi	Do not use MSI for native PCIe PME signaling (this makes
 			all PCIe root ports use INTx for all services).
 
+	pcieportdrv.aer_unrecoverable_fatal=
+			[PCIE] Panic on unrecoverable AER errors:
+		0	Log the error and leave the device in a disconnected
+			state (default).
+		1	Panic the kernel when a device cannot recover from an
+			AER error (no error_detected callback). Useful for
+			high-reliability systems where degraded hardware is
+			unacceptable.
+
 	pcmv=		[HW,PCMCIA] BadgePAD 4
 
 	pd_ignore_unused
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index bebe4bc111d75..788484791902e 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -73,6 +73,9 @@ static int report_error_detected(struct pci_dev *dev,
 		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
 			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
 			pci_info(dev, "can't recover (no error_detected callback)\n");
+			if (aer_unrecoverable_fatal)
+				panic("AER: %s: no error_detected callback\n",
+				      pci_name(dev));
 		} else {
 			vote = PCI_ERS_RESULT_NONE;
 		}
diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 38a41ccf79b9a..a411f60ff50ce 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -22,6 +22,13 @@
 #include "../pci.h"
 #include "portdrv.h"
 
+#ifdef CONFIG_PCIEAER
+bool aer_unrecoverable_fatal;
+module_param(aer_unrecoverable_fatal, bool, 0644);
+MODULE_PARM_DESC(aer_unrecoverable_fatal,
+		 "Panic if a device cannot recover from an AER error (default: false)");
+#endif
+
 /*
  * The PCIe Capability Interrupt Message Number (PCIe r3.1, sec 7.8.2) must
  * be one of the first 32 MSI-X entries.  Per PCI r3.0, sec 6.8.3.1, MSI
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index bd29d1cc7b8bd..6c67b18de93c9 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -29,6 +29,7 @@ extern bool pcie_ports_dpc_native;
 
 #ifdef CONFIG_PCIEAER
 int pcie_aer_init(void);
+extern bool aer_unrecoverable_fatal;
 #else
 static inline int pcie_aer_init(void) { return 0; }
 #endif

---
base-commit: 6bd9ed02871f22beb0e50690b0c3caf457104f7c
change-id: 20260206-pci-362cf172187f

Best regards,
--  
Breno Leitao <leitao at debian.org>



More information about the Linuxppc-dev mailing list