[PATCH 1/2] PCI: Ensure error recoverability at all times
Rafael J. Wysocki
rafael at kernel.org
Fri Nov 14 08:03:42 AEDT 2025
On Thu, Nov 13, 2025 at 9:49 PM Rafael J. Wysocki <rafael at kernel.org> wrote:
>
> On Sun, Oct 12, 2025 at 3:30 PM Lukas Wunner <lukas at wunner.de> wrote:
> >
> > When the PCI core gained power management support in 2002, it introduced
> > pci_save_state() and pci_restore_state() helpers to restore Config Space
> > after a D3hot or D3cold transition, which implies a Soft or Fundamental
> > Reset (PCIe r7.0 sec 5.8):
> >
> > https://git.kernel.org/tglx/history/c/a5287abe398b
> >
> > In 2006, EEH and AER were introduced to recover from errors by performing
> > a reset. Because errors can occur at any time, drivers began calling
> > pci_save_state() on probe to ensure recoverability.
> >
> > In 2009, recoverability was foiled by commit c82f63e411f1 ("PCI: check
> > saved state before restore"): It amended pci_restore_state() to bail out
> > if the "state_saved" flag has been cleared. The flag is cleared by
> > pci_restore_state() itself, hence a saved state is now allowed to be
> > restored only once and is then invalidated. That doesn't seem to make
> > sense because the saved state should be good enough to be reused.
> >
> > Soon after, drivers began to work around this behavior by calling
> > pci_save_state() immediately after pci_restore_state(), see e.g. commit
> > b94f2d775a71 ("igb: call pci_save_state after pci_restore_state").
> > Hilariously, two drivers even set the "saved_state" flag to true before
> > invoking pci_restore_state(), see ipr_reset_restore_cfg_space() and
> > e1000_io_slot_reset().
> >
> > Despite these workarounds, recoverability at all times is not guaranteed:
> > E.g. when a PCIe port goes through a runtime suspend and resume cycle,
> > the "saved_state" flag is cleared by:
> >
> > pci_pm_runtime_resume()
> > pci_pm_default_resume_early()
> > pci_restore_state()
> >
> > ... and hence on a subsequent AER event, the port's Config Space cannot be
> > restored. Riana reports a recovery failure of a GPU-integrated PCIe
> > switch and has root-caused it to the behavior of pci_restore_state().
> > Another workaround would be necessary, namely calling pci_save_state() in
> > pcie_port_device_runtime_resume().
> >
> > The motivation of commit c82f63e411f1 was to prevent restoring state if
> > pci_save_state() hasn't been called before. But that can be achieved by
> > saving state already on device addition, after Config Space has been
> > initialized. A desirable side effect is that devices become recoverable
> > even if no driver gets bound. This renders the commit unnecessary, so
> > revert it.
> >
> > Reported-by: Riana Tauro <riana.tauro at intel.com> # off-list
> > Tested-by: Riana Tauro <riana.tauro at intel.com>
> > Signed-off-by: Lukas Wunner <lukas at wunner.de>
> > ---
> > Proof that removing the check in pci_restore_state() makes no
> > difference for the PCI core:
> >
> > * The only relevant invocations of pci_restore_state() are in
> > drivers/pci/pci-driver.c
> > * One invocation is in pci_restore_standard_config(), which is
> > always called conditionally on "if (pci_dev->state_saved)".
> > So the check at the beginning of pci_restore_state() which
> > this patch removes is an unnecessary duplication.
> > * Another invocation is in pci_pm_default_resume_early(), which
> > is called from pci_pm_resume_noirq(), pci_pm_restore_noirq()
> > and pci_pm_runtime_resume(). Those functions are only called
> > after prior calls to pci_pm_suspend_noirq(), pci_pm_freeze_noirq(),
> > and pci_pm_runtime_suspend(), respectively. And all of them
> > call pci_save_state(). So the "if (!dev->state_saved)" check
> > in pci_restore_state() never evaluates to true.
> > * A third invocation is in pci_pm_thaw_noirq(). It is only called
> > after a prior call to pci_pm_freeze_noirq(), which invokes
> > pci_save_state(). So likewise the "if (!dev->state_saved)" check
> > in pci_restore_state() never evaluates to true.
> >
> > drivers/pci/bus.c | 7 +++++++
> > drivers/pci/pci.c | 3 ---
> > drivers/pci/probe.c | 2 --
> > 3 files changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > index f26aec6..4318568 100644
> > --- a/drivers/pci/bus.c
> > +++ b/drivers/pci/bus.c
> > @@ -358,6 +358,13 @@ void pci_bus_add_device(struct pci_dev *dev)
> > pci_bridge_d3_update(dev);
> >
> > /*
> > + * Save config space for error recoverability. Clear state_saved
> > + * to detect whether drivers invoked pci_save_state() on suspend.
> > + */
> > + pci_save_state(dev);
> > + dev->state_saved = false;
> > +
> > + /*
> > * If the PCI device is associated with a pwrctrl device with a
> > * power supply, create a device link between the PCI device and
> > * pwrctrl device. This ensures that pwrctrl drivers are probed
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index b14dd06..2f0da5d 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -1855,9 +1855,6 @@ static void pci_restore_rebar_state(struct pci_dev *pdev)
> > */
> > void pci_restore_state(struct pci_dev *dev)
> > {
> > - if (!dev->state_saved)
> > - return;
> > -
>
> So after this change, is there any mechanism to clear state_saved
> after a suspend-resume cycle so that the next cycle is not confused by
> seeing it set?
Never mind, this hasn't changed.
So I agree with Bjorn that it would be good to expand the new comment
in pci_bus_add_device() because it doesn't really explain much in its
current form. Or maybe even split it to say "Save config space for
error recoverability" before pci_save_state(dev) and then explain why
state_saved is cleared after it?
Apart from this and modulo possible changelog adjustments
Reviewed-by: Rafael J. Wysocki (Intel) <rafael at kernel.org>
More information about the Linuxppc-dev
mailing list