[PATCH 1/2] PCI: Ensure error recoverability at all times
Rafael J. Wysocki
rafael at kernel.org
Fri Nov 14 07:49:06 AEDT 2025
On Sun, Oct 12, 2025 at 3:30 PM Lukas Wunner <lukas at wunner.de> wrote:
>
> When the PCI core gained power management support in 2002, it introduced
> pci_save_state() and pci_restore_state() helpers to restore Config Space
> after a D3hot or D3cold transition, which implies a Soft or Fundamental
> Reset (PCIe r7.0 sec 5.8):
>
> https://git.kernel.org/tglx/history/c/a5287abe398b
>
> In 2006, EEH and AER were introduced to recover from errors by performing
> a reset. Because errors can occur at any time, drivers began calling
> pci_save_state() on probe to ensure recoverability.
>
> In 2009, recoverability was foiled by commit c82f63e411f1 ("PCI: check
> saved state before restore"): It amended pci_restore_state() to bail out
> if the "state_saved" flag has been cleared. The flag is cleared by
> pci_restore_state() itself, hence a saved state is now allowed to be
> restored only once and is then invalidated. That doesn't seem to make
> sense because the saved state should be good enough to be reused.
>
> Soon after, drivers began to work around this behavior by calling
> pci_save_state() immediately after pci_restore_state(), see e.g. commit
> b94f2d775a71 ("igb: call pci_save_state after pci_restore_state").
> Hilariously, two drivers even set the "saved_state" flag to true before
> invoking pci_restore_state(), see ipr_reset_restore_cfg_space() and
> e1000_io_slot_reset().
>
> Despite these workarounds, recoverability at all times is not guaranteed:
> E.g. when a PCIe port goes through a runtime suspend and resume cycle,
> the "saved_state" flag is cleared by:
>
> pci_pm_runtime_resume()
> pci_pm_default_resume_early()
> pci_restore_state()
>
> ... and hence on a subsequent AER event, the port's Config Space cannot be
> restored. Riana reports a recovery failure of a GPU-integrated PCIe
> switch and has root-caused it to the behavior of pci_restore_state().
> Another workaround would be necessary, namely calling pci_save_state() in
> pcie_port_device_runtime_resume().
>
> The motivation of commit c82f63e411f1 was to prevent restoring state if
> pci_save_state() hasn't been called before. But that can be achieved by
> saving state already on device addition, after Config Space has been
> initialized. A desirable side effect is that devices become recoverable
> even if no driver gets bound. This renders the commit unnecessary, so
> revert it.
>
> Reported-by: Riana Tauro <riana.tauro at intel.com> # off-list
> Tested-by: Riana Tauro <riana.tauro at intel.com>
> Signed-off-by: Lukas Wunner <lukas at wunner.de>
> ---
> Proof that removing the check in pci_restore_state() makes no
> difference for the PCI core:
>
> * The only relevant invocations of pci_restore_state() are in
> drivers/pci/pci-driver.c
> * One invocation is in pci_restore_standard_config(), which is
> always called conditionally on "if (pci_dev->state_saved)".
> So the check at the beginning of pci_restore_state() which
> this patch removes is an unnecessary duplication.
> * Another invocation is in pci_pm_default_resume_early(), which
> is called from pci_pm_resume_noirq(), pci_pm_restore_noirq()
> and pci_pm_runtime_resume(). Those functions are only called
> after prior calls to pci_pm_suspend_noirq(), pci_pm_freeze_noirq(),
> and pci_pm_runtime_suspend(), respectively. And all of them
> call pci_save_state(). So the "if (!dev->state_saved)" check
> in pci_restore_state() never evaluates to true.
> * A third invocation is in pci_pm_thaw_noirq(). It is only called
> after a prior call to pci_pm_freeze_noirq(), which invokes
> pci_save_state(). So likewise the "if (!dev->state_saved)" check
> in pci_restore_state() never evaluates to true.
>
> drivers/pci/bus.c | 7 +++++++
> drivers/pci/pci.c | 3 ---
> drivers/pci/probe.c | 2 --
> 3 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index f26aec6..4318568 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -358,6 +358,13 @@ void pci_bus_add_device(struct pci_dev *dev)
> pci_bridge_d3_update(dev);
>
> /*
> + * Save config space for error recoverability. Clear state_saved
> + * to detect whether drivers invoked pci_save_state() on suspend.
> + */
> + pci_save_state(dev);
> + dev->state_saved = false;
> +
> + /*
> * If the PCI device is associated with a pwrctrl device with a
> * power supply, create a device link between the PCI device and
> * pwrctrl device. This ensures that pwrctrl drivers are probed
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b14dd06..2f0da5d 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1855,9 +1855,6 @@ static void pci_restore_rebar_state(struct pci_dev *pdev)
> */
> void pci_restore_state(struct pci_dev *dev)
> {
> - if (!dev->state_saved)
> - return;
> -
So after this change, is there any mechanism to clear state_saved
after a suspend-resume cycle so that the next cycle is not confused by
seeing it set?
> pci_restore_pcie_state(dev);
> pci_restore_pasid_state(dev);
> pci_restore_pri_state(dev);
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index c83e75a..c7c7a3d 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2747,8 +2747,6 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
>
> pci_reassigndev_resource_alignment(dev);
>
> - dev->state_saved = false;
> -
> pci_init_capabilities(dev);
>
> /*
> --
More information about the Linuxppc-dev
mailing list