[PATCH] PCI/AER: Add option to panic on unrecoverable errors
Keith Busch
kbusch at kernel.org
Sat Feb 7 16:55:13 AEDT 2026
On Fri, Feb 06, 2026 at 09:53:39PM +0100, Lukas Wunner wrote:
> On Fri, Feb 06, 2026 at 12:22:44PM -0700, Keith Busch wrote:
> > On Fri, Feb 06, 2026 at 12:52:32PM -0600, Bjorn Helgaas wrote:
> > > Are there any other similar flags you already use that we could
> > > piggy-back on? E.g., if we raised the level to KERN_WARNING, maybe
> > > the existing "panic_on_warn" would be enough?
> >
> > There are many KERN_WARNING messages that don't rise to the level of
> > warranting a 'panic' that don't want to enable such an option in
> > production. It looks like the panic_on_warn was introduced for developer
> > debugging.
>
> panic_on_warn springs into action on WARN() splats, not arbitrary
> messages with KERN_WARNING severity. Also, sysctl kernel.warn_limit
> may be used to grant a certain number of panic-free WARNs.
Okay, but the warn panic param still isn't an option for production.
> FWIW, the "pcieportdrv.aer_unrecoverable_fatal" parameter introduced
> by this patch feels somewhat oddly named. Something like
> "pci.panic_on_fatal" might be clearer and more succinct.
Naming is hard; thanks for the suggestion.
> > I agree the curnent INFO level is too low for the generic unrecovered
> > condition, though.
>
> At least for unbound devices, I think 918b4053184c went way too far.
> I think an unbound device should generally be considered recoverable
> through a reset.
Yes, I agree, especially considering the generic probe saves a
checkpoint of the state that we can restore to that is consistent with
the kernel's view. There's no clear reason to fail recovery when there's
no bound driver, so this changing that behavior s a good idea.
> As for bound devices whose drivers lack pci_error_handlers, it has been
> painful in practice that they're considered unrecoverable wholesale.
Yes, it gets tricky when there is a bound driver; there's no telling
whether or not it may initiate a broken transaction with cascading
consequences for the rest of the system if anything in the chain is not
cooperating with the error recovery orchestration. I don't know if there
is a best default action, so allowing it to be user defined seems okay.
> E.g. GPUs often expose an audio device as well as telemetry devices,
> all arranged below an integrated PCIe switch. All of these devices
> need drivers with pci_error_handlers in order for the GPU to be
> recoverable. In some cases, dummy callbacks were added to render
> the whole thing recoverable.
This experience sounds familiar, and it really does appear that a hard
reboot is the best outcome in many cases because orchestrating all the
components to recover is not going to happen. Hence the reboot param.
> So I wouldn't consider 918b4053184c to have been a universally successful
> approach and I fear that this patch goes even further.
If anyone goes through the effort of fixing that, will it be considered?
You told me in Vienna LPC '24 that you'd help resolve the pci hotplug
deadlocks that have been plaguing pci for the last 10 years, but not a
single comment has happened despite multiple complete and validated
solutions offered.
More information about the Linuxppc-dev
mailing list