[Skiboot] Important details about race condition in EEH/NVMe-issue on ppc64le.

Fri Nov 9 14:46:55 AEDT 2018

On Fri, Nov 9, 2018 at 2:30 AM Koltsov Dmitriy <d.koltsov at yadro.com> wrote:
>
> Hi, Oliver.
>
> Your version of EEH/NVMe-issue looks close to be true.
>
> I've applied simple patch to ./drivers/nvme/host/pci.c file:
>
> @@ -29,6 +29,7 @@
>  #include <linux/types.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/sed-opal.h>
> +#include <linux/delay.h>
>
>  #include "nvme.h"
>
> @@ -2073,11 +2074,14 @@
>
>  static int nvme_pci_enable(struct nvme_dev *dev)
>  {
> +    static int num=0;
>      int result = -ENOMEM;
>      struct pci_dev *pdev = to_pci_dev(dev->dev);
> +    num=num+1;
>
>      if (pci_enable_device_mem(pdev))
>          return result;
> +    msleep(num);
>
>      pci_set_master(pdev);
>
>
> and find out approximate minimal estimation of delay which removes the EEH/NVMe-issue.
> This delay is about 1 millisecond (see in the code above). So, it's a stable phenomena
> that when all nvme disks have minimum mutual interval of 1(one) millisecond (each) to
> enable upstream bridge then EEH/NVMe-issue is not reproducible. Hence, I have the
> hypothesis that due to bar alignment patch (see prev message) the bridge enabling procedure
> lasts significantly more time than in case of alignment=0.

> So, in correspondence with your
> hypothesis about several threads "pinging" PHB, it turns out that one of that threads
> begins to write to PHB, while necessary time is not passed yet (~1 ms).

No, only the first thread ever touches the hardware (PCI config space,
really). The first thread will set an in-memory flag which is seen by
the other threads and taken to mean that the bridge is already
enabled.

> HW error follows and readl() of NVME_REG_CSTS returns 0xffffffff in nvme_pci_enable().
> (And when align=0 the PHB enable procedure duration is significantly less than 1ms so
> other threads which don't enable PHB themselves - has no conflict, because by the time
> they tries to write to PHB - PHB is already really enabled).
> I don't see here contradictions in such version of events sequence leading
> to EEH/NVMe-issue.
>
> So, there are two questions:
>
> Q.1. Does my hypothesis about delay=1ms and about difference of PHB enable procedure
> duration in two cases (align=0 and align=64K) looks close to be true ?

*shrug*

Probably? Even if there is a difference in timing I don't see why that
is a problem. You might be taking (or not taking) a page fault due to
the alignment patch, or maybe there's some extra state inside of linux
that needed to be updated for some reason. It could be anything.
Fundamentally, the problem here is that Linux has a race condition.
Debugging the exact set of circumstances required to trigger the race
is sometimes worth it, but I don't think this is one of those cases.

If you honestly believe that there is a problem here then can you
provide us with some actual data rather than speculating about HW
bugs? The Linux ftrace (function_graph is useful) infrastructure
should allow you to do accurate measurements of the whole process and
get a breakdown of what the slow parts are. Once you have a good
overview of what code paths are being exercised in each case then you
can do more detailed measurements of specific functions.

> Q.2. If answer is 'yes' to Q.1 - are there some procedures in FW or HW implementation in
> POWER8 PHB that may "extend" phb enable procedure time execution because of alignment change
> to 64K ? May be in OPAL or on-chip ? Or the only "suspected area" for EEH/NVMe-issue -
> is in linux kernel procedures ?

I don't see any reason to suspect OPAL or the chip here.

Oliver

>
>
>
> Regards,
> Dmitriy.
>