[PATCH 1/5] PCI/AER: Allow drivers to opt in to Bus Reset on Non-Fatal Errors
Sathyanarayanan Kuppuswamy
sathyanarayanan.kuppuswamy at linux.intel.com
Mon Aug 18 02:10:49 AEST 2025
On 8/17/25 6:17 AM, Lukas Wunner wrote:
> On Thu, Aug 14, 2025 at 12:29:25PM -0700, Sathyanarayanan Kuppuswamy wrote:
>> On 8/14/25 2:36 AM, Lukas Wunner wrote:
>>> On Thu, Aug 14, 2025 at 09:56:09AM +0200, Niklas Schnelle wrote:
>>>> On Wed, 2025-08-13 at 07:11 +0200, Lukas Wunner wrote:
>>>>> @@ -233,6 +228,14 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>> pci_walk_bridge(bridge, report_mmio_enabled, &status);
>>>>> }
>>>>> + if (status == PCI_ERS_RESULT_NEED_RESET ||
>>>>> + state == pci_channel_io_frozen) {
>>>>> + if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
>>>>> + pci_warn(bridge, "subordinate device reset failed\n");
>>>>> + goto failed;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> if (status == PCI_ERS_RESULT_NEED_RESET) {
>>>>> /*
>>>>> * TODO: Should call platform-specific
>>>> I wonder if it might make sense to merge the reset into the above
>>>> existing if.
>>> There are drivers such as drivers/bus/mhi/host/pci_generic.c which
>>> return PCI_ERS_RESULT_RECOVERED from ->error_detected(). So they
>>> fall through directly to the ->resume() stage. They're doing this
>>> even in the pci_channel_io_frozen case (i.e. for Fatal Errors).
>>>
>>> But for DPC we must call reset_subordinates() to bring the link back up.
>>> And for Fatal Errors, Documentation/PCI/pcieaer-howto.rst suggests that
>>> we must likewise call it because the link may be unreliable.
>> For fatal errors, since we already ignore the value returned by
>> ->error_detected() (by calling reset_subordinates() unconditionally), why
>> not update status accordingly in report_frozen_detected() and notify the
>> driver about the reset?
>>
>> That way, the reset logic could be unified under a single if
>> (status == PCI_ERS_RESULT_NEED_RESET) condition.
>>
>> Checking the drivers/bus/mhi/host/pci_generic.c implementation, it looks
>> like calling slot_reset callback looks harmless.
> Unfortunately it's not harmless:
>
> mhi_pci_slot_reset() calls pci_enable_device(). But a corresponding
> call to pci_disable_device() is only performed before in
> mhi_pci_error_detected() if that function returns
> PCI_ERS_RESULT_NEED_RESET.
>
> So there would be an enable_cnt imbalance if I'd change the logic to
> overwrite the driver's vote with PCI_ERS_RESULT_NEED_RESET in the
> pci_channel_io_frozen case and call its ->slot_reset() callback.
>
> The approach taken by this patch is to minimize risk, avoid any changes
> to drivers, make do with minimal changes to pcie_do_recovery() and
> limit the behavioral change.
>
> I think overriding status = PCI_ERS_RESULT_NEED_RESET and calling drivers'
> ->slot_reset() would have to be done in a separate patch on top and would
> require going through all drivers again to see which ones need to be
> amended.
>
> Also, note that report_frozen_detected() is too early to set
> "status = PCI_ERS_RESULT_NEED_RESET". That needs to happen after the
> ->mmio_enabled() step, so that drivers get a chance to examine the
> device even in the pci_channel_io_frozen case before a reset is
> performed. (The ->mmio_enabled() step is only performed if "status" is
> PCI_ERS_RESULT_CAN_RECOVER.)
>
> So then the code would look like this:
>
> if (state == pci_channel_io_frozen)
> status = PCI_ERS_RESULT_NEED_RESET;
>
> if (status == PCI_ERS_RESULT_NEED_RESET) {
> if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
> pci_warn(bridge, "subordinate device reset failed\n");
> goto failed;
> }
>
> status = PCI_ERS_RESULT_RECOVERED;
> pci_dbg(bridge, "broadcast slot_reset message\n");
> pci_walk_bridge(bridge, report_slot_reset, &status);
> }
>
> ... which isn't very different from the present patch:
>
> if (status == PCI_ERS_RESULT_NEED_RESET ||
> state == pci_channel_io_frozen) {
> if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
> pci_warn(bridge, "subordinate device reset failed\n");
> goto failed;
> }
> }
>
> if (status == PCI_ERS_RESULT_NEED_RESET) {
> status = PCI_ERS_RESULT_RECOVERED;
> pci_dbg(bridge, "broadcast slot_reset message\n");
> pci_walk_bridge(bridge, report_slot_reset, &status);
> }
>
> ... except that this patch avoids touching any drivers.
Makes sense. Thanks for the clarification.
>
> Thanks,
>
> Lukas
>
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
More information about the Linuxppc-dev
mailing list