<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Hello.</p>

    There is an EEH/NVMe-issue on SUT ppc64le POWER8-based server Yadro

    VESNIN (with minimum 2 NVMe disks) with<br>

    Linux OS kernel 4.15.17/4.15.18/4.16.7 (OS - Ubuntu 16.04.5

    LTS(hwe)/18.04.1 LTS or Fedora 27).<br>

    <br>

    The issue is described here -<br>

    <br>

    <br>

    <blockquote>

      <blockquote><a class="moz-txt-link-freetext"

href="https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=db2173198b9513f7add8009f225afa1f1c79bcc6">https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=db2173198b9513f7add8009f225afa1f1c79bcc6</a><br>

      </blockquote>

    </blockquote>

    <br>

    I've already had some discussion about it with IBM and now we know

    about suggested workaround (above).<br>

    <br>

    But for know there are imho several important facts that remained

    "uncovered" in this discussion.<br>

    <br>

    The facts are that -<br>

    <br>

    1) We found out that main reason of the issue is near evolution

    process of vanilla linux kernel taken as a base <br>

    for Ubuntu releases, namely - the issue is not reproducible till

    Linux v.4.11 vanilla release (01.05.2017) and becomes <br>

    reproducible from the closest release - Linux v.4.12-rc1

    (13.05.2017). Further research showed that there is the commit <br>

    responsible for appearance of the issue - its name is "Merge tag

    'pci-v4.12-changes' of

    git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci", <br>

    see <a class="moz-txt-link-freetext"

href="https://git.launchpad.net/%7Eubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=857f8640147c9fb43f20e43cbca6452710e1ca5d">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=857f8640147c9fb43f20e43cbca6452710e1ca5d</a>.

    <br>

    More definitely - there are only two patches (of author Yongji Xie,

    in this commit) which enable the issue - they are: <br>

    <br>

    "PCI: Add pcibios_default_alignment() for arch-specific alignment

    control", <br>

    see <a class="moz-txt-link-freetext"

href="https://git.launchpad.net/%7Eubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=0a701aa6378496ea54fb065c68b41d918e372e94">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=0a701aa6378496ea54fb065c68b41d918e372e94</a>

    <br>

    <br>

    and <br>

    "powerpc/powernv: Override pcibios_default_alignment() to force PCI

    devices to be page aligned" <br>

    see <a class="moz-txt-link-freetext"

href="https://git.launchpad.net/%7Eubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=382746376993cfa6d6c4e546c67384201c0f3a82">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=382746376993cfa6d6c4e546c67384201c0f3a82</a>.

    <br>

    <br>

    If we remove the code strings of these patches from 4.15.18 kernel

    in Ubuntu 18.04.1 LTS (or in OS Fedora 27) - the issue disappears

    and<br>

    everything, regarding EEH/NVMe-issue, works fine.<br>

    <br>

    2) Finally, we found out the exact point in kernel source which

    begins the abnormal behaviour during boot process and starts the

    sequence<br>

    with EEH/NVMe-issue symptoms - stacktraces like show below<br>

    <br>

    <blockquote>

      <blockquote>[   16.922094] nvme 0000:07:00.0: Using 64-bit DMA

        iommu bypass <br>

        [   16.922164] EEH: Frozen PHB#21-PE#fd detected <br>

        [   16.922171] EEH: PE location: S002103, PHB location: N/A <br>

        [   16.922177] CPU: 97 PID: 590 Comm: kworker/u770:0 Not tainted

        4.15.18.ppc64le #1 <br>

        [   16.922182] nvme 0000:08:00.0: Using 64-bit DMA iommu bypass

        <br>

        [   16.922194] Workqueue: nvme-wq nvme_reset_work [nvme] <br>

        [   16.922198] Call Trace: <br>

        [   16.922208] [c00001ffb07cba10] [c000000000ba5b3c]

        dump_stack+0xb0/0xf4 (unreliable) <br>

        [   16.922219] [c00001ffb07cba50] [c00000000003f4e8]

        eeh_dev_check_failure+0x598/0x5b0 <br>

        [   16.922227] [c00001ffb07cbaf0] [c00000000003f58c]

        eeh_check_failure+0x8c/0xd0 <br>

        [   16.922236] [c00001ffb07cbb30] [d00000003c443970]

        nvme_reset_work+0x398/0x1890 [nvme] <br>

        [   16.922240] nvme 0022:04:00.0: enabling device (0140 ->

        0142) <br>

        [   16.922247] [c00001ffb07cbc80] [c000000000136e08]

        process_one_work+0x248/0x540 <br>

        [   16.922253] [c00001ffb07cbd20] [c000000000137198]

        worker_thread+0x98/0x5f0 <br>

        [   16.922260] [c00001ffb07cbdc0] [c00000000014100c]

        kthread+0x1ac/0x1c0 <br>

        [   16.922268] [c00001ffb07cbe30] [c00000000000bca0]

        ret_from_kernel_thread+0x5c/0xbc.<br>

      </blockquote>

    </blockquote>

    <br>

    and part of nonseen NVMe-disks in /dev/-tree. This point is in

    nvme_pci_enable() function in readl()-execution string -<br>

    <br>

    <a class="moz-txt-link-freetext"

href="https://git.launchpad.net/%7Eubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/nvme/host/pci.c?id=Ubuntu-4.15.0-29.31#n2088">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/nvme/host/pci.c?id=Ubuntu-4.15.0-29.31#n2088</a>.<br>

    <br>

    When 64K alignment is not set by the Yongji Xie patch, this point is

    handled successfully. But when the 64K alignment is set from this

    patch,<br>

    those readl() execution give us 0xffffffff inside the code execution

    of this function (inside eeh_readl()) - in_le32() returns 0xffffffff

    here -<br>

    <br>

    <a class="moz-txt-link-freetext"

href="https://git.launchpad.net/%7Eubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/arch/powerpc/include/asm/eeh.h?id=Ubuntu-4.15.0-29.31#n379">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/arch/powerpc/include/asm/eeh.h?id=Ubuntu-4.15.0-29.31#n379</a>.<br>

    <br>

    3) Btw, from linux kernel documentation we know that for pci error

    recovery technique -<br>

    <br>

    "STEP 0: Error Event<br>

    -------------------<br>

    A PCI bus error is detected by the PCI hardware.  On powerpc, the

    slot is isolated, in that all I/O is blocked: all reads return

    0xffffffff,<br>

    all writes are ignored."<br>

    <br>

    Also we know that 2)step-readl() gets virtual address of

    NVME_REG_CSTS register. I suppose that this readl() operation

    concerns to MMIO<br>

    read mechanism. And in the same kernel documentation there are some

    explanation -<br>

    <br>

    <pre>               CPU                  CPU                  Bus

             Virtual              Physical             Address

             Address              Address               Space

              Space                Space

            +-------+             +------+             +------+

            |       |             |MMIO  |   Offset    |      |

            |       |  Virtual    |Space |   applied   |      |

          C +-------+ --------> B +------+ ----------> +------+ A

            |       |  mapping    |      |   by host   |      |

  +-----+   |       |             |      |   bridge    |      |   +--------+

  |     |   |       |             +------+             |      |   |        |

  | CPU |   |       |             | RAM  |             |      |   | Device |

  |     |   |       |             |      |             |      |   |        |

  +-----+   +-------+             +------+             +------+   +--------+

            |       |  Virtual    |Buffer|   Mapping   |      |

          X +-------+ --------> Y +------+ <---------- +------+ Z

            |       |  mapping    | RAM  |   by IOMMU

            |       |             |      |

            |       |             |      |

            +-------+             +------+</pre>

    <br>

    "During the enumeration process, the kernel learns about I/O devices

    and<br>

    their MMIO space and the host bridges that connect them to the

    system.  For<br>

    example, if a PCI device has a BAR, the kernel reads the bus address

    (A)<br>

    from the BAR and converts it to a CPU physical address (B).  The

    address B<br>

    is stored in a struct resource and usually exposed via /proc/iomem. 

    When a<br>

    driver claims a device, it typically uses ioremap() to map physical

    address<br>

    B at a virtual address (C).  It can then use, e.g., ioread32(C), to

    access<br>

    the device registers at bus address A."<br>

    <br>

    So, we can conclude from this scheme and readl() execution string

    above that<br>

    readl() execution must step through C->B->A path from virtual

    address to physical and then<br>

    to bus address. I've added printk strings to some places in linux

    kernel, like nvme_dev_map()<br>

    and so, to check the physical memory allocations, virtual memory

    allocations, and ioremap()<br>

    results. Everything is ok in printk outputs even when EEH/NVMe-issue

    is reproduced.<br>

    <br>

    Hence, the hypothesis of EEH/NVMe-issue is in the B->A mapping

    which must be done,<br>

    regarding the scheme above, - by PHB3 on POWER8 chip. When

    alignment=0, this process<br>

    works fine, but with alignment=64K - this translation to bus

    addresses reproduce EEH/NVMe-issue,<br>

    when more than one NVMe-disk are enabled in parallel with pcie

    bridge. Based on another commit<br>

    but for phb4 - "phb4: Only clear some PHB config space registers on

    errors"<br>

    <br>

    <blockquote>

      <blockquote><a class="moz-txt-link-freetext"

href="https://github.com/open-power/skiboot/commit/5a8c57b6f2fcb46f1cac1840062a542c597996f9">https://github.com/open-power/skiboot/commit/5a8c57b6f2fcb46f1cac1840062a542c597996f9</a><br>

      </blockquote>

    </blockquote>

    <br>

    i suppose that the main reason may be in race in skiboot mechanism

    (or like firmware/hardware), which<br>

    frequently doesn't have enough time for B->A translation when

    alignment is ON and =64K -<br>

    i.e. in some entity which implement the pcie logic according to

    B->A translation with align=64K<br>

    (may be bound with coherent cache layer).<br>

    <br>

    <br>

    <br>

    Question. Could you comment this hypothesis ? Is it far from to be

    true from your point of view ?<br>

    May be this letter is enough to start the research in this direction

    to find the exact reason of<br>

    mentioned EEH/NVMe-issue ?<br>

    <br>

    (btw, i understand that another reason may be in the format of

    memory allocated for pcie resources<br>

    and its concrete field values. But according to things said above -

    i precisely tend to the hypothesis above).<br>

    <br>

    <br>

    <br>

    <br>

    ps: there are some other insteresting facts about experiments with

    this EEH/NVMe-issue - like for<br>

    example - this issue is not reproducible with CONFIG_PCIEASPM='y'

    (in kernel config), which is always set<br>

    to 'y' when CONFIG_PCIEPORTBUS='y' (D3hot, D3cold actions with some

    pauses in the appropriate<br>

    ASPM code may give necessary time for B->A translation).<br>

    <br>

    <br>

    <br>

    <br>

    Regards,<br>

    Dmitriy.

  </body>

</html>