[PATCH 12/14] powerpc/eeh: Add debugfs interface to run an EEH check

Tue Sep 17 14:23:09 AEST 2019

On Tue, Sep 17, 2019 at 1:36 PM Oliver O'Halloran <oohall at gmail.com> wrote:
>
> On Tue, Sep 17, 2019 at 1:16 PM Sam Bobroff <sbobroff at linux.ibm.com> wrote:
> >
> > On Tue, Sep 03, 2019 at 08:16:03PM +1000, Oliver O'Halloran wrote:
> > > Detecting an frozen EEH PE usually occurs when an MMIO load returns a 0xFFs
> > > response. When performing EEH testing using the EEH error injection feature
> > > available on some platforms there is no simple way to kick-off the kernel's
> > > recovery process since any accesses from userspace (usually /dev/mem) will
> > > bypass the MMIO helpers in the kernel which check if a 0xFF response is due
> > > to an EEH freeze or not.
> > >
> > > If a device contains a 0xFF byte in it's config space it's possible to
> > > trigger the recovery process via config space read from userspace, but this
> > > is not a reliable method. If a driver is bound to the device an in use it
> > > will frequently trigger the MMIO check, but this is also inconsistent.
> > >
> > > To solve these problems this patch adds a debugfs file called
> > > "eeh_dev_check" which accepts a <domain>:<bus>:<dev>.<fn> string and runs
> > > eeh_dev_check_failure() on it. This is the same check that's done when the
> > > kernel gets a 0xFF result from an config or MMIO read with the added
> > > benifit that it can be reliably triggered from userspace.
> > >
> > > Signed-off-by: Oliver O'Halloran <oohall at gmail.com>
> >
> > Looks good, and I tested it with the next patch and it seems to work.
> >
> > But I think you should make it clear that this does not work with
> > the hardware "EEH error injection" facility accessible via debugfs in
> > err_injct (that doesn't seem clear to me from the commit message).
>
> It's not intended to be a separate mechanisms in the long term. I'm
> planning on converting this interface to make use the platform defined
> error injection mechanism once I can find how to use the PAPR ones
> reliably. The idea is to use this as a generic "cause an EEH to happen
> on this device" interface for userspace which we can use in test
> scripts and the like.

Urgh, I'm tired and thought this was the eeh_debugfs_break patch.

This (the _check) debugfs interface does work with the HW error
injection facilities. After the HW injects an error the PE is frozen,
but the kernel doesn't notice until something runs
eeh_dev_check_failure(). This interface gives userspace a reliable way
to do that rather than relying on drivers doing MMIO, or somewhere in
config space containing a 0xFF.

Oliver