<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <samp>Hi, Oliver.<br>

      <br>

      Your version of EEH/NVMe-issue looks close to be true.<br>

      <br>

      I've applied simple patch to ./drivers/nvme/host/pci.c file:<br>

      <br>

    </samp>

    <blockquote>

      <blockquote><samp>@@ -29,6 +29,7 @@</samp><br>

        <samp> #include <linux/types.h></samp><br>

        <samp> #include <linux/io-64-nonatomic-lo-hi.h></samp><br>

        <samp> #include <linux/sed-opal.h></samp><br>

        <samp>+#include <linux/delay.h></samp><br>

        <samp> </samp><br>

        <samp> #include "nvme.h"</samp><br>

        <samp> </samp><br>

        <samp>@@ -2073,11 +2074,14 @@</samp><br>

        <samp> </samp><br>

        <samp> static int nvme_pci_enable(struct nvme_dev *dev)</samp><br>

        <samp> {</samp><br>

        <samp>+    static int num=0;</samp><br>

        <samp>     int result = -ENOMEM;</samp><br>

        <samp>     struct pci_dev *pdev = to_pci_dev(dev->dev);</samp><br>

        <samp>+    num=num+1;</samp><br>

        <samp> </samp><br>

        <samp>     if (pci_enable_device_mem(pdev))</samp><br>

        <samp>         return result;</samp><br>

        <samp>+    msleep(num);</samp><br>

        <samp> </samp><br>

        <samp>     pci_set_master(pdev);</samp><samp></samp><br>

      </blockquote>

    </blockquote>

    <samp></samp><samp><br>

      and find out approximate minimal estimation of delay which removes

      the EEH/NVMe-issue.<br>

      This delay is about 1 millisecond (see in the code above). So,

      it's a stable phenomena<br>

      that when all nvme disks have minimum mutual interval of 1(one)

      millisecond (each) to<br>

      enable upstream bridge then EEH/NVMe-issue is not reproducible.

      Hence, I have the<br>

      hypothesis that due to bar alignment patch (see prev message) the

      bridge enabling procedure<br>

      lasts significantly more time than in case of alignment=0. So, in

      correspondence with your<br>

      hypothesis about several threads "pinging" PHB, it turns out that

      one of that threads<br>

      begins to write to PHB, while necessary time is not passed yet (~1

      ms). Then, of course,<br>

      HW error follows and readl() of NVME_REG_CSTS returns 0xffffffff

      in nvme_pci_enable().<br>

      (And when align=0 the PHB enable procedure duration is

      significantly less than 1ms so<br>

      other threads which don't enable PHB themselves - has no conflict,

      because by the time<br>

      they tries to write to PHB - PHB is already really enabled).<br>

      I don't see here contradictions in such version of events sequence

      leading<br>

      to EEH/NVMe-issue.<br>

      <br>

      So, there are two questions:<br>

      <br>

      Q.1. Does my hypothesis about delay=1ms and about difference of

      PHB enable procedure<br>

      duration in two cases (align=0 and align=64K) looks close to be

      true ?<br>

      <br>

      Q.2. If answer is 'yes' to Q.1 - are there some procedures in FW

      or HW implementation in<br>

      POWER8 PHB that may "extend" phb enable procedure time execution

      because of alignment change<br>

      to 64K ? May be in OPAL or on-chip ? Or the only "suspected area"

      for EEH/NVMe-issue -<br>

      is in linux kernel procedures ?<br>

      <br>

      <br>

      <br>

      <br>

      Regards,<br>

      Dmitriy.<br>

    </samp><br>

    <div class="moz-cite-prefix">On 05.11.2018 07:30, Oliver wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAOSf1CEAqOgekkz3iq53n2mcn0MqFAUp1rbTAHdyDBFgajFpBw@mail.gmail.com">

      <pre wrap="">On Tue, Oct 30, 2018 at 1:28 AM Koltsov Dmitriy <a class="moz-txt-link-rfc2396E" href="mailto:d.koltsov@yadro.com"><d.koltsov@yadro.com></a> wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">

Hello.

There is an EEH/NVMe-issue on SUT ppc64le POWER8-based server Yadro VESNIN (with minimum 2 NVMe disks) with

Linux OS kernel 4.15.17/4.15.18/4.16.7 (OS - Ubuntu 16.04.5 LTS(hwe)/18.04.1 LTS or Fedora 27).

The issue is described here -

<a class="moz-txt-link-freetext" href="https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=db2173198b9513f7add8009f225afa1f1c79bcc6">https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=db2173198b9513f7add8009f225afa1f1c79bcc6</a>

I've already had some discussion about it with IBM and now we know about suggested workaround (above).

</pre>

      </blockquote>

      <pre wrap="">

Given this is a linux issue the discussion should go to linuxppc-dev

(+cc) directly rather than the skiboot list.

</pre>

      <blockquote type="cite">

        <pre wrap="">But for know there are imho several important facts that remained "uncovered" in this discussion.

The facts are that -

1) We found out that main reason of the issue is near evolution process of vanilla linux kernel taken as a base

for Ubuntu releases, namely - the issue is not reproducible till Linux v.4.11 vanilla release (01.05.2017) and becomes

reproducible from the closest release - Linux v.4.12-rc1 (13.05.2017). Further research showed that there is the commit

responsible for appearance of the issue - its name is "Merge tag 'pci-v4.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci",

see <a class="moz-txt-link-freetext" href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=857f8640147c9fb43f20e43cbca6452710e1ca5d">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=857f8640147c9fb43f20e43cbca6452710e1ca5d</a>.

More definitely - there are only two patches (of author Yongji Xie, in this commit) which enable the issue - they are:

"PCI: Add pcibios_default_alignment() for arch-specific alignment control",

see <a class="moz-txt-link-freetext" href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=0a701aa6378496ea54fb065c68b41d918e372e94">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=0a701aa6378496ea54fb065c68b41d918e372e94</a>

and

"powerpc/powernv: Override pcibios_default_alignment() to force PCI devices to be page aligned"

see <a class="moz-txt-link-freetext" href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=382746376993cfa6d6c4e546c67384201c0f3a82">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=382746376993cfa6d6c4e546c67384201c0f3a82</a>.

If we remove the code strings of these patches from 4.15.18 kernel in Ubuntu 18.04.1 LTS (or in OS Fedora 27) - the issue disappears and

everything, regarding EEH/NVMe-issue, works fine.

2) Finally, we found out the exact point in kernel source which begins the abnormal behaviour during boot process and starts the sequence

with EEH/NVMe-issue symptoms - stacktraces like show below

[   16.922094] nvme 0000:07:00.0: Using 64-bit DMA iommu bypass

[   16.922164] EEH: Frozen PHB#21-PE#fd detected

[   16.922171] EEH: PE location: S002103, PHB location: N/A

[   16.922177] CPU: 97 PID: 590 Comm: kworker/u770:0 Not tainted 4.15.18.ppc64le #1

[   16.922182] nvme 0000:08:00.0: Using 64-bit DMA iommu bypass

[   16.922194] Workqueue: nvme-wq nvme_reset_work [nvme]

[   16.922198] Call Trace:

[   16.922208] [c00001ffb07cba10] [c000000000ba5b3c] dump_stack+0xb0/0xf4 (unreliable)

[   16.922219] [c00001ffb07cba50] [c00000000003f4e8] eeh_dev_check_failure+0x598/0x5b0

[   16.922227] [c00001ffb07cbaf0] [c00000000003f58c] eeh_check_failure+0x8c/0xd0

[   16.922236] [c00001ffb07cbb30] [d00000003c443970] nvme_reset_work+0x398/0x1890 [nvme]

[   16.922240] nvme 0022:04:00.0: enabling device (0140 -> 0142)

[   16.922247] [c00001ffb07cbc80] [c000000000136e08] process_one_work+0x248/0x540

[   16.922253] [c00001ffb07cbd20] [c000000000137198] worker_thread+0x98/0x5f0

[   16.922260] [c00001ffb07cbdc0] [c00000000014100c] kthread+0x1ac/0x1c0

[   16.922268] [c00001ffb07cbe30] [c00000000000bca0] ret_from_kernel_thread+0x5c/0xbc.

and part of nonseen NVMe-disks in /dev/-tree. This point is in nvme_pci_enable() function in readl()-execution string -

<a class="moz-txt-link-freetext" href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/nvme/host/pci.c?id=Ubuntu-4.15.0-29.31#n2088">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/nvme/host/pci.c?id=Ubuntu-4.15.0-29.31#n2088</a>.

When 64K alignment is not set by the Yongji Xie patch, this point is handled successfully. But when the 64K alignment is set from this patch,

those readl() execution give us 0xffffffff inside the code execution of this function (inside eeh_readl()) - in_le32() returns 0xffffffff here -

<a class="moz-txt-link-freetext" href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/arch/powerpc/include/asm/eeh.h?id=Ubuntu-4.15.0-29.31#n379">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/arch/powerpc/include/asm/eeh.h?id=Ubuntu-4.15.0-29.31#n379</a>.

3) Btw, from linux kernel documentation we know that for pci error recovery technique -

"STEP 0: Error Event

-------------------

A PCI bus error is detected by the PCI hardware.  On powerpc, the slot is isolated, in that all I/O is blocked: all reads return 0xffffffff,

all writes are ignored."

Also we know that 2)step-readl() gets virtual address of NVME_REG_CSTS register. I suppose that this readl() operation concerns to MMIO

read mechanism. And in the same kernel documentation there are some explanation -

               CPU                  CPU                  Bus

             Virtual              Physical             Address

             Address              Address               Space

              Space                Space

            +-------+             +------+             +------+

            |       |             |MMIO  |   Offset    |      |

            |       |  Virtual    |Space |   applied   |      |

          C +-------+ --------> B +------+ ----------> +------+ A

            |       |  mapping    |      |   by host   |      |

  +-----+   |       |             |      |   bridge    |      |   +--------+

  |     |   |       |             +------+             |      |   |        |

  | CPU |   |       |             | RAM  |             |      |   | Device |

  |     |   |       |             |      |             |      |   |        |

  +-----+   +-------+             +------+             +------+   +--------+

            |       |  Virtual    |Buffer|   Mapping   |      |

          X +-------+ --------> Y +------+ <---------- +------+ Z

            |       |  mapping    | RAM  |   by IOMMU

            |       |             |      |

            |       |             |      |

            +-------+             +------+

"During the enumeration process, the kernel learns about I/O devices and

their MMIO space and the host bridges that connect them to the system.  For

example, if a PCI device has a BAR, the kernel reads the bus address (A)

from the BAR and converts it to a CPU physical address (B).  The address B

is stored in a struct resource and usually exposed via /proc/iomem.  When a

driver claims a device, it typically uses ioremap() to map physical address

B at a virtual address (C).  It can then use, e.g., ioread32(C), to access

the device registers at bus address A."

So, we can conclude from this scheme and readl() execution string above that

readl() execution must step through C->B->A path from virtual address to physical and then

to bus address. I've added printk strings to some places in linux kernel, like nvme_dev_map()

and so, to check the physical memory allocations, virtual memory allocations, and ioremap()

results. Everything is ok in printk outputs even when EEH/NVMe-issue is reproduced.

Hence, the hypothesis of EEH/NVMe-issue is in the B->A mapping which must be done,

regarding the scheme above, - by PHB3 on POWER8 chip. When alignment=0, this process

works fine, but with alignment=64K - this translation to bus addresses reproduce EEH/NVMe-issue,

when more than one NVMe-disk are enabled in parallel with pcie bridge. Based on another commit

but for phb4 - "phb4: Only clear some PHB config space registers on errors"

<a class="moz-txt-link-freetext" href="https://github.com/open-power/skiboot/commit/5a8c57b6f2fcb46f1cac1840062a542c597996f9">https://github.com/open-power/skiboot/commit/5a8c57b6f2fcb46f1cac1840062a542c597996f9</a>

i suppose that the main reason may be in race in skiboot mechanism (or like firmware/hardware), which

frequently doesn't have enough time for B->A translation when alignment is ON and =64K -

i.e. in some entity which implement the pcie logic according to B->A translation with align=64K

(may be bound with coherent cache layer).

Question. Could you comment this hypothesis ? Is it far from to be true from your point of view ?

May be this letter is enough to start the research in this direction to find the exact reason of

mentioned EEH/NVMe-issue ?

</pre>

      </blockquote>

      <pre wrap="">

Hi there,

So I spent a fair amount of time debugging that issue and I don't

really see anything here that suggests there is some kind of

underlying translation issue. To elaborate, the race being worked

around in db2173198b95 is due to this code inside

pci_enable_device_flags():

         if (atomic_inc_return(&dev->enable_cnt) > 1)

                 return 0;               /* already enabled */

If two threads are probing two different PCI devices with a common

upstream bridge they will both call pci_enable_device_flags() on their

common bridge (e.g. pcie switch upstream port). Since it uses an

atomic increment only one thread will attempt to enable the bridge and

the other thread will return from the function assuming the bridge is

already enabled. Keep in mind that a bridge will always forward PCI

configuration space accesses so the remainder of the PCI device enable

process will succeed and the 2nd thread well return to the driver

which assumes that MMIOs are now permitted since the device is

enabled. If the first thread has not updated the PCI_COMMAND register

of the upstream bridge to allow forwarding of memory cycles then any

MMIOs done by the 2nd thread will result in an Unsupported Request

error which invokes EEH.

To my eye, everything you have mentioned here is completely consistent

with the underlying cause being the bridge enable race. I don't know

why the BAR alignment patch would expose that race, but I don't see

any reason to believe there is some other address translation issue.

</pre>

      <blockquote type="cite">

        <pre wrap="">(btw, i understand that another reason may be in the format of memory allocated for pcie resources

and its concrete field values. But according to things said above - i precisely tend to the hypothesis above).

ps: there are some other insteresting facts about experiments with this EEH/NVMe-issue - like for

example - this issue is not reproducible with CONFIG_PCIEASPM='y' (in kernel config), which is always set

to 'y' when CONFIG_PCIEPORTBUS='y' (D3hot, D3cold actions with some pauses in the appropriate

ASPM code may give necessary time for B->A translation).

</pre>

      </blockquote>

      <pre wrap="">

As brief look at pci_device_enable() and friends suggests that

additional work happens in the enable step when ASPM is enabled which

helps mask the bridge enable race.

</pre>

      <blockquote type="cite">

        <pre wrap="">

Regards,

Dmitriy.

_______________________________________________

Skiboot mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Skiboot@lists.ozlabs.org">Skiboot@lists.ozlabs.org</a>

<a class="moz-txt-link-freetext" href="https://lists.ozlabs.org/listinfo/skiboot">https://lists.ozlabs.org/listinfo/skiboot</a>

</pre>

      </blockquote>

    </blockquote>

    <br>

  </body>

</html>