[PATCH 1/2] PCI: Ensure error recoverability at all times

Rafael J. Wysocki rafael at kernel.org
Fri Nov 14 07:49:06 AEDT 2025


On Sun, Oct 12, 2025 at 3:30 PM Lukas Wunner <lukas at wunner.de> wrote:
>
> When the PCI core gained power management support in 2002, it introduced
> pci_save_state() and pci_restore_state() helpers to restore Config Space
> after a D3hot or D3cold transition, which implies a Soft or Fundamental
> Reset (PCIe r7.0 sec 5.8):
>
>   https://git.kernel.org/tglx/history/c/a5287abe398b
>
> In 2006, EEH and AER were introduced to recover from errors by performing
> a reset.  Because errors can occur at any time, drivers began calling
> pci_save_state() on probe to ensure recoverability.
>
> In 2009, recoverability was foiled by commit c82f63e411f1 ("PCI: check
> saved state before restore"):  It amended pci_restore_state() to bail out
> if the "state_saved" flag has been cleared.  The flag is cleared by
> pci_restore_state() itself, hence a saved state is now allowed to be
> restored only once and is then invalidated.  That doesn't seem to make
> sense because the saved state should be good enough to be reused.
>
> Soon after, drivers began to work around this behavior by calling
> pci_save_state() immediately after pci_restore_state(), see e.g. commit
> b94f2d775a71 ("igb: call pci_save_state after pci_restore_state").
> Hilariously, two drivers even set the "saved_state" flag to true before
> invoking pci_restore_state(), see ipr_reset_restore_cfg_space() and
> e1000_io_slot_reset().
>
> Despite these workarounds, recoverability at all times is not guaranteed:
> E.g. when a PCIe port goes through a runtime suspend and resume cycle,
> the "saved_state" flag is cleared by:
>
>   pci_pm_runtime_resume()
>     pci_pm_default_resume_early()
>       pci_restore_state()
>
> ... and hence on a subsequent AER event, the port's Config Space cannot be
> restored.  Riana reports a recovery failure of a GPU-integrated PCIe
> switch and has root-caused it to the behavior of pci_restore_state().
> Another workaround would be necessary, namely calling pci_save_state() in
> pcie_port_device_runtime_resume().
>
> The motivation of commit c82f63e411f1 was to prevent restoring state if
> pci_save_state() hasn't been called before.  But that can be achieved by
> saving state already on device addition, after Config Space has been
> initialized.  A desirable side effect is that devices become recoverable
> even if no driver gets bound.  This renders the commit unnecessary, so
> revert it.
>
> Reported-by: Riana Tauro <riana.tauro at intel.com> # off-list
> Tested-by: Riana Tauro <riana.tauro at intel.com>
> Signed-off-by: Lukas Wunner <lukas at wunner.de>
> ---
> Proof that removing the check in pci_restore_state() makes no
> difference for the PCI core:
>
> * The only relevant invocations of pci_restore_state() are in
>   drivers/pci/pci-driver.c
> * One invocation is in pci_restore_standard_config(), which is
>   always called conditionally on "if (pci_dev->state_saved)".
>   So the check at the beginning of pci_restore_state() which
>   this patch removes is an unnecessary duplication.
> * Another invocation is in pci_pm_default_resume_early(), which
>   is called from pci_pm_resume_noirq(), pci_pm_restore_noirq()
>   and pci_pm_runtime_resume().  Those functions are only called
>   after prior calls to pci_pm_suspend_noirq(), pci_pm_freeze_noirq(),
>   and pci_pm_runtime_suspend(), respectively.  And all of them
>   call pci_save_state().  So the "if (!dev->state_saved)" check
>   in pci_restore_state() never evaluates to true.
> * A third invocation is in pci_pm_thaw_noirq().  It is only called
>   after a prior call to pci_pm_freeze_noirq(), which invokes
>   pci_save_state().  So likewise the "if (!dev->state_saved)" check
>   in pci_restore_state() never evaluates to true.
>
>  drivers/pci/bus.c   | 7 +++++++
>  drivers/pci/pci.c   | 3 ---
>  drivers/pci/probe.c | 2 --
>  3 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index f26aec6..4318568 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -358,6 +358,13 @@ void pci_bus_add_device(struct pci_dev *dev)
>         pci_bridge_d3_update(dev);
>
>         /*
> +        * Save config space for error recoverability.  Clear state_saved
> +        * to detect whether drivers invoked pci_save_state() on suspend.
> +        */
> +       pci_save_state(dev);
> +       dev->state_saved = false;
> +
> +       /*
>          * If the PCI device is associated with a pwrctrl device with a
>          * power supply, create a device link between the PCI device and
>          * pwrctrl device.  This ensures that pwrctrl drivers are probed
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b14dd06..2f0da5d 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1855,9 +1855,6 @@ static void pci_restore_rebar_state(struct pci_dev *pdev)
>   */
>  void pci_restore_state(struct pci_dev *dev)
>  {
> -       if (!dev->state_saved)
> -               return;
> -

So after this change, is there any mechanism to clear state_saved
after a suspend-resume cycle so that the next cycle is not confused by
seeing it set?

>         pci_restore_pcie_state(dev);
>         pci_restore_pasid_state(dev);
>         pci_restore_pri_state(dev);
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index c83e75a..c7c7a3d 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2747,8 +2747,6 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
>
>         pci_reassigndev_resource_alignment(dev);
>
> -       dev->state_saved = false;
> -
>         pci_init_capabilities(dev);
>
>         /*
> --


More information about the Linuxppc-dev mailing list