[PATCH 04/14] powerpc/eeh: Check slot presence state in eeh_handle_normal_event()

Oliver O'Halloran oohall at gmail.com
Tue Sep 17 14:20:11 AEST 2019


On Tue, Sep 17, 2019 at 11:01 AM Sam Bobroff <sbobroff at linux.ibm.com> wrote:
>
> On Tue, Sep 03, 2019 at 08:15:55PM +1000, Oliver O'Halloran wrote:
> > When a device is surprise removed while undergoing IO we will probably
> > get an EEH PE freeze due to MMIO timeouts and other errors. When a freeze
> > is detected we send a recovery event to the EEH worker thread which will
> > notify drivers, and perform recovery as needed.
> >
> > In the event of a hot-remove we don't want recovery to occur since there
> > isn't a device to recover. The recovery process is fairly long due to
> > the number of wait states (required by PCIe) which causes problems when
> > devices are removed and replaced (e.g. hot swapping of U.2 NVMe drives).
> >
> > To determine if we need to skip the recovery process we can use the
> > get_adapter_state() operation of the hotplug_slot to determine if the
> > slot contains a device or not, and if the slot is empty we can skip
> > recovery entirely.
> >
> > One thing to note is that the slot being EEH frozen does not prevent the
> > hotplug driver from working. We don't have the EEH recovery thread
> > remove any of the devices since it's assumed that the hotplug driver
> > will handle tearing down the slot state.
> >
> > Signed-off-by: Oliver O'Halloran <oohall at gmail.com>
>
> Looks good, but some comments, below.
>
> > ---
> >  arch/powerpc/kernel/eeh_driver.c | 60 ++++++++++++++++++++++++++++++++
> >  1 file changed, 60 insertions(+)
> >
> > diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> > index 18a69fac4d80..52ce7584af43 100644
> > --- a/arch/powerpc/kernel/eeh_driver.c
> > +++ b/arch/powerpc/kernel/eeh_driver.c
> > @@ -27,6 +27,7 @@
> >  #include <linux/irq.h>
> >  #include <linux/module.h>
> >  #include <linux/pci.h>
> > +#include <linux/pci_hotplug.h>
> >  #include <asm/eeh.h>
> >  #include <asm/eeh_event.h>
> >  #include <asm/ppc-pci.h>
> > @@ -769,6 +770,46 @@ static void eeh_pe_cleanup(struct eeh_pe *pe)
> >       }
> >  }
> >
> > +/**
> > + * eeh_check_slot_presence - Check if a device is still present in a slot
> > + * @pdev: pci_dev to check
> > + *
> > + * This function may return a false positive if we can't determine the slot's
> > + * presence state. This might happen for for PCIe slots if the PE containing
> > + * the upstream bridge is also frozen, or the bridge is part of the same PE
> > + * as the device.
> > + *
> > + * This shouldn't happen often, but you might see it if you hotplug a PCIe
> > + * switch.
> > + */
>
> I don't think the function name is very good; it does check the slot but
> it doesn't tell you what a true result means -- but I don't see an
> obviously great alternative either. If it can return false positives, it's
> really testing for empty so maybe 'eeh_slot_definitely_empty()' or
> 'eeh_slot_maybe_populated()'?

I don't see a better name either. I thought the meaning was fairly
clear when looked at in the context of the caller though.

> > +     /*
> > +      * When devices are hot-removed we might get an EEH due to
> > +      * a driver attempting to touch the MMIO space of a removed
> > +      * device. In this case we don't have a device to recover
> > +      * so suppress the event if we can't find any present devices.
> > +      *
> > +      * The hotplug driver should take care of tearing down the
> > +      * device itself.
> > +      */
> > +     eeh_for_each_pe(pe, tmp_pe)
> > +             eeh_pe_for_each_dev(tmp_pe, edev, tmp)
> > +                     if (eeh_slot_presence_check(edev->pdev))
> > +                             devices++;
>
> In other parts of the EEH code we do a get_device() on edev->pdev before
> passing it around, it might be good to do that here too.

I don't see any calls to get_device (or pci_dev_get) in arch/powerpc/kernel/

I agree that we probably should be taking a ref to the pci_dev, but
IIRC you we're working on a series to do just that so I figured I
should keep things in line with what's there currently.

> > +     if (!devices)
> > +             goto out; /* nothing to recover */
>
> Does this handle an empty, but frozen, PHB correctly? (Can that happen?)

Probably not, but I don't think we handle the case well (at all?)
currently. In order to start the recovery process we need something to
flag that an error has occurred on the PHB and without a device being
present I don't see where that would come from. It might work on P8
where we have PHB error interrupts, but I don't think those can fire
if the PHB is fenced.

Oliver


More information about the Linuxppc-dev mailing list