[PATCH 1/2] PCI: Ensure error recoverability at all times

Rafael J. Wysocki rafael at kernel.org
Fri Nov 14 08:03:42 AEDT 2025


On Thu, Nov 13, 2025 at 9:49 PM Rafael J. Wysocki <rafael at kernel.org> wrote:
>
> On Sun, Oct 12, 2025 at 3:30 PM Lukas Wunner <lukas at wunner.de> wrote:
> >
> > When the PCI core gained power management support in 2002, it introduced
> > pci_save_state() and pci_restore_state() helpers to restore Config Space
> > after a D3hot or D3cold transition, which implies a Soft or Fundamental
> > Reset (PCIe r7.0 sec 5.8):
> >
> >   https://git.kernel.org/tglx/history/c/a5287abe398b
> >
> > In 2006, EEH and AER were introduced to recover from errors by performing
> > a reset.  Because errors can occur at any time, drivers began calling
> > pci_save_state() on probe to ensure recoverability.
> >
> > In 2009, recoverability was foiled by commit c82f63e411f1 ("PCI: check
> > saved state before restore"):  It amended pci_restore_state() to bail out
> > if the "state_saved" flag has been cleared.  The flag is cleared by
> > pci_restore_state() itself, hence a saved state is now allowed to be
> > restored only once and is then invalidated.  That doesn't seem to make
> > sense because the saved state should be good enough to be reused.
> >
> > Soon after, drivers began to work around this behavior by calling
> > pci_save_state() immediately after pci_restore_state(), see e.g. commit
> > b94f2d775a71 ("igb: call pci_save_state after pci_restore_state").
> > Hilariously, two drivers even set the "saved_state" flag to true before
> > invoking pci_restore_state(), see ipr_reset_restore_cfg_space() and
> > e1000_io_slot_reset().
> >
> > Despite these workarounds, recoverability at all times is not guaranteed:
> > E.g. when a PCIe port goes through a runtime suspend and resume cycle,
> > the "saved_state" flag is cleared by:
> >
> >   pci_pm_runtime_resume()
> >     pci_pm_default_resume_early()
> >       pci_restore_state()
> >
> > ... and hence on a subsequent AER event, the port's Config Space cannot be
> > restored.  Riana reports a recovery failure of a GPU-integrated PCIe
> > switch and has root-caused it to the behavior of pci_restore_state().
> > Another workaround would be necessary, namely calling pci_save_state() in
> > pcie_port_device_runtime_resume().
> >
> > The motivation of commit c82f63e411f1 was to prevent restoring state if
> > pci_save_state() hasn't been called before.  But that can be achieved by
> > saving state already on device addition, after Config Space has been
> > initialized.  A desirable side effect is that devices become recoverable
> > even if no driver gets bound.  This renders the commit unnecessary, so
> > revert it.
> >
> > Reported-by: Riana Tauro <riana.tauro at intel.com> # off-list
> > Tested-by: Riana Tauro <riana.tauro at intel.com>
> > Signed-off-by: Lukas Wunner <lukas at wunner.de>
> > ---
> > Proof that removing the check in pci_restore_state() makes no
> > difference for the PCI core:
> >
> > * The only relevant invocations of pci_restore_state() are in
> >   drivers/pci/pci-driver.c
> > * One invocation is in pci_restore_standard_config(), which is
> >   always called conditionally on "if (pci_dev->state_saved)".
> >   So the check at the beginning of pci_restore_state() which
> >   this patch removes is an unnecessary duplication.
> > * Another invocation is in pci_pm_default_resume_early(), which
> >   is called from pci_pm_resume_noirq(), pci_pm_restore_noirq()
> >   and pci_pm_runtime_resume().  Those functions are only called
> >   after prior calls to pci_pm_suspend_noirq(), pci_pm_freeze_noirq(),
> >   and pci_pm_runtime_suspend(), respectively.  And all of them
> >   call pci_save_state().  So the "if (!dev->state_saved)" check
> >   in pci_restore_state() never evaluates to true.
> > * A third invocation is in pci_pm_thaw_noirq().  It is only called
> >   after a prior call to pci_pm_freeze_noirq(), which invokes
> >   pci_save_state().  So likewise the "if (!dev->state_saved)" check
> >   in pci_restore_state() never evaluates to true.
> >
> >  drivers/pci/bus.c   | 7 +++++++
> >  drivers/pci/pci.c   | 3 ---
> >  drivers/pci/probe.c | 2 --
> >  3 files changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > index f26aec6..4318568 100644
> > --- a/drivers/pci/bus.c
> > +++ b/drivers/pci/bus.c
> > @@ -358,6 +358,13 @@ void pci_bus_add_device(struct pci_dev *dev)
> >         pci_bridge_d3_update(dev);
> >
> >         /*
> > +        * Save config space for error recoverability.  Clear state_saved
> > +        * to detect whether drivers invoked pci_save_state() on suspend.
> > +        */
> > +       pci_save_state(dev);
> > +       dev->state_saved = false;
> > +
> > +       /*
> >          * If the PCI device is associated with a pwrctrl device with a
> >          * power supply, create a device link between the PCI device and
> >          * pwrctrl device.  This ensures that pwrctrl drivers are probed
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index b14dd06..2f0da5d 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -1855,9 +1855,6 @@ static void pci_restore_rebar_state(struct pci_dev *pdev)
> >   */
> >  void pci_restore_state(struct pci_dev *dev)
> >  {
> > -       if (!dev->state_saved)
> > -               return;
> > -
>
> So after this change, is there any mechanism to clear state_saved
> after a suspend-resume cycle so that the next cycle is not confused by
> seeing it set?

Never mind, this hasn't changed.

So I agree with Bjorn that it would be good to expand the new comment
in pci_bus_add_device() because it doesn't really explain much in its
current form.  Or maybe even split it to say "Save config space for
error recoverability" before pci_save_state(dev) and then explain why
state_saved is cleared after it?

Apart from this and modulo possible changelog adjustments

Reviewed-by: Rafael J. Wysocki (Intel) <rafael at kernel.org>


More information about the Linuxppc-dev mailing list