[Skiboot] Important details about race condition in EEH/NVMe-issue on ppc64le.

Fri Nov 9 02:30:13 AEDT 2018

Hi, Oliver.

Your version of EEH/NVMe-issue looks close to be true.

I've applied simple patch to ./drivers/nvme/host/pci.c file:

        @@ -29,6 +29,7 @@
          #include <linux/types.h>
          #include <linux/io-64-nonatomic-lo-hi.h>
          #include <linux/sed-opal.h>
        +#include <linux/delay.h>

          #include "nvme.h"

        @@ -2073,11 +2074,14 @@

          static int nvme_pci_enable(struct nvme_dev *dev)
          {
        +    static int num=0;
              int result = -ENOMEM;
              struct pci_dev *pdev = to_pci_dev(dev->dev);
        +    num=num+1;

              if (pci_enable_device_mem(pdev))
                  return result;
        +    msleep(num);

              pci_set_master(pdev);

and find out approximate minimal estimation of delay which removes the 
EEH/NVMe-issue.
This delay is about 1 millisecond (see in the code above). So, it's a 
stable phenomena
that when all nvme disks have minimum mutual interval of 1(one) 
millisecond (each) to
enable upstream bridge then EEH/NVMe-issue is not reproducible. Hence, I 
have the
hypothesis that due to bar alignment patch (see prev message) the bridge 
enabling procedure
lasts significantly more time than in case of alignment=0. So, in 
correspondence with your
hypothesis about several threads "pinging" PHB, it turns out that one of 
that threads
begins to write to PHB, while necessary time is not passed yet (~1 ms). 
Then, of course,
HW error follows and readl() of NVME_REG_CSTS returns 0xffffffff in 
nvme_pci_enable().
(And when align=0 the PHB enable procedure duration is significantly 
less than 1ms so
other threads which don't enable PHB themselves - has no conflict, 
because by the time
they tries to write to PHB - PHB is already really enabled).
I don't see here contradictions in such version of events sequence leading
to EEH/NVMe-issue.

So, there are two questions:

Q.1. Does my hypothesis about delay=1ms and about difference of PHB 
enable procedure
duration in two cases (align=0 and align=64K) looks close to be true ?

Q.2. If answer is 'yes' to Q.1 - are there some procedures in FW or HW 
implementation in
POWER8 PHB that may "extend" phb enable procedure time execution because 
of alignment change
to 64K ? May be in OPAL or on-chip ? Or the only "suspected area" for 
EEH/NVMe-issue -
is in linux kernel procedures ?

Regards,
Dmitriy.

On 05.11.2018 07:30, Oliver wrote:
> On Tue, Oct 30, 2018 at 1:28 AM Koltsov Dmitriy <d.koltsov at yadro.com> wrote:
>> Hello.
>>
>> There is an EEH/NVMe-issue on SUT ppc64le POWER8-based server Yadro VESNIN (with minimum 2 NVMe disks) with
>> Linux OS kernel 4.15.17/4.15.18/4.16.7 (OS - Ubuntu 16.04.5 LTS(hwe)/18.04.1 LTS or Fedora 27).
>>
>> The issue is described here -
>>
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=db2173198b9513f7add8009f225afa1f1c79bcc6
>>
>> I've already had some discussion about it with IBM and now we know about suggested workaround (above).
> Given this is a linux issue the discussion should go to linuxppc-dev
> (+cc) directly rather than the skiboot list.
>
>> But for know there are imho several important facts that remained "uncovered" in this discussion.
>>
>> The facts are that -
>>
>> 1) We found out that main reason of the issue is near evolution process of vanilla linux kernel taken as a base
>> for Ubuntu releases, namely - the issue is not reproducible till Linux v.4.11 vanilla release (01.05.2017) and becomes
>> reproducible from the closest release - Linux v.4.12-rc1 (13.05.2017). Further research showed that there is the commit
>> responsible for appearance of the issue - its name is "Merge tag 'pci-v4.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci",
>> see https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=857f8640147c9fb43f20e43cbca6452710e1ca5d.
>> More definitely - there are only two patches (of author Yongji Xie, in this commit) which enable the issue - they are:
>>
>> "PCI: Add pcibios_default_alignment() for arch-specific alignment control",
>> see https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=0a701aa6378496ea54fb065c68b41d918e372e94
>>
>> and
>> "powerpc/powernv: Override pcibios_default_alignment() to force PCI devices to be page aligned"
>> see https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=382746376993cfa6d6c4e546c67384201c0f3a82.
>>
>> If we remove the code strings of these patches from 4.15.18 kernel in Ubuntu 18.04.1 LTS (or in OS Fedora 27) - the issue disappears and
>> everything, regarding EEH/NVMe-issue, works fine.
>>
>> 2) Finally, we found out the exact point in kernel source which begins the abnormal behaviour during boot process and starts the sequence
>> with EEH/NVMe-issue symptoms - stacktraces like show below
>>
>> [   16.922094] nvme 0000:07:00.0: Using 64-bit DMA iommu bypass
>> [   16.922164] EEH: Frozen PHB#21-PE#fd detected
>> [   16.922171] EEH: PE location: S002103, PHB location: N/A
>> [   16.922177] CPU: 97 PID: 590 Comm: kworker/u770:0 Not tainted 4.15.18.ppc64le #1
>> [   16.922182] nvme 0000:08:00.0: Using 64-bit DMA iommu bypass
>> [   16.922194] Workqueue: nvme-wq nvme_reset_work [nvme]
>> [   16.922198] Call Trace:
>> [   16.922208] [c00001ffb07cba10] [c000000000ba5b3c] dump_stack+0xb0/0xf4 (unreliable)
>> [   16.922219] [c00001ffb07cba50] [c00000000003f4e8] eeh_dev_check_failure+0x598/0x5b0
>> [   16.922227] [c00001ffb07cbaf0] [c00000000003f58c] eeh_check_failure+0x8c/0xd0
>> [   16.922236] [c00001ffb07cbb30] [d00000003c443970] nvme_reset_work+0x398/0x1890 [nvme]
>> [   16.922240] nvme 0022:04:00.0: enabling device (0140 -> 0142)
>> [   16.922247] [c00001ffb07cbc80] [c000000000136e08] process_one_work+0x248/0x540
>> [   16.922253] [c00001ffb07cbd20] [c000000000137198] worker_thread+0x98/0x5f0
>> [   16.922260] [c00001ffb07cbdc0] [c00000000014100c] kthread+0x1ac/0x1c0
>> [   16.922268] [c00001ffb07cbe30] [c00000000000bca0] ret_from_kernel_thread+0x5c/0xbc.
>>
>>
>>
>> and part of nonseen NVMe-disks in /dev/-tree. This point is in nvme_pci_enable() function in readl()-execution string -
>>
>> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/nvme/host/pci.c?id=Ubuntu-4.15.0-29.31#n2088.
>>
>> When 64K alignment is not set by the Yongji Xie patch, this point is handled successfully. But when the 64K alignment is set from this patch,
>> those readl() execution give us 0xffffffff inside the code execution of this function (inside eeh_readl()) - in_le32() returns 0xffffffff here -
>>
>> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/arch/powerpc/include/asm/eeh.h?id=Ubuntu-4.15.0-29.31#n379.
>>
>> 3) Btw, from linux kernel documentation we know that for pci error recovery technique -
>>
>> "STEP 0: Error Event
>> -------------------
>> A PCI bus error is detected by the PCI hardware.  On powerpc, the slot is isolated, in that all I/O is blocked: all reads return 0xffffffff,
>> all writes are ignored."
>>
>> Also we know that 2)step-readl() gets virtual address of NVME_REG_CSTS register. I suppose that this readl() operation concerns to MMIO
>> read mechanism. And in the same kernel documentation there are some explanation -
>>
>>                 CPU                  CPU                  Bus
>>               Virtual              Physical             Address
>>               Address              Address               Space
>>                Space                Space
>>
>>              +-------+             +------+             +------+
>>              |       |             |MMIO  |   Offset    |      |
>>              |       |  Virtual    |Space |   applied   |      |
>>            C +-------+ --------> B +------+ ----------> +------+ A
>>              |       |  mapping    |      |   by host   |      |
>>    +-----+   |       |             |      |   bridge    |      |   +--------+
>>    |     |   |       |             +------+             |      |   |        |
>>    | CPU |   |       |             | RAM  |             |      |   | Device |
>>    |     |   |       |             |      |             |      |   |        |
>>    +-----+   +-------+             +------+             +------+   +--------+
>>              |       |  Virtual    |Buffer|   Mapping   |      |
>>            X +-------+ --------> Y +------+ <---------- +------+ Z
>>              |       |  mapping    | RAM  |   by IOMMU
>>              |       |             |      |
>>              |       |             |      |
>>              +-------+             +------+
>>
>>
>> "During the enumeration process, the kernel learns about I/O devices and
>> their MMIO space and the host bridges that connect them to the system.  For
>> example, if a PCI device has a BAR, the kernel reads the bus address (A)
>> from the BAR and converts it to a CPU physical address (B).  The address B
>> is stored in a struct resource and usually exposed via /proc/iomem.  When a
>> driver claims a device, it typically uses ioremap() to map physical address
>> B at a virtual address (C).  It can then use, e.g., ioread32(C), to access
>> the device registers at bus address A."
>>
>> So, we can conclude from this scheme and readl() execution string above that
>> readl() execution must step through C->B->A path from virtual address to physical and then
>> to bus address. I've added printk strings to some places in linux kernel, like nvme_dev_map()
>> and so, to check the physical memory allocations, virtual memory allocations, and ioremap()
>> results. Everything is ok in printk outputs even when EEH/NVMe-issue is reproduced.
>>
>> Hence, the hypothesis of EEH/NVMe-issue is in the B->A mapping which must be done,
>> regarding the scheme above, - by PHB3 on POWER8 chip. When alignment=0, this process
>> works fine, but with alignment=64K - this translation to bus addresses reproduce EEH/NVMe-issue,
>> when more than one NVMe-disk are enabled in parallel with pcie bridge. Based on another commit
>> but for phb4 - "phb4: Only clear some PHB config space registers on errors"
>>
>> https://github.com/open-power/skiboot/commit/5a8c57b6f2fcb46f1cac1840062a542c597996f9
>>
>> i suppose that the main reason may be in race in skiboot mechanism (or like firmware/hardware), which
>> frequently doesn't have enough time for B->A translation when alignment is ON and =64K -
>> i.e. in some entity which implement the pcie logic according to B->A translation with align=64K
>> (may be bound with coherent cache layer).
>>
>> Question. Could you comment this hypothesis ? Is it far from to be true from your point of view ?
>> May be this letter is enough to start the research in this direction to find the exact reason of
>> mentioned EEH/NVMe-issue ?
> Hi there,
>
> So I spent a fair amount of time debugging that issue and I don't
> really see anything here that suggests there is some kind of
> underlying translation issue. To elaborate, the race being worked
> around in db2173198b95 is due to this code inside
> pci_enable_device_flags():
>
>           if (atomic_inc_return(&dev->enable_cnt) > 1)
>                   return 0;               /* already enabled */
>
> If two threads are probing two different PCI devices with a common
> upstream bridge they will both call pci_enable_device_flags() on their
> common bridge (e.g. pcie switch upstream port). Since it uses an
> atomic increment only one thread will attempt to enable the bridge and
> the other thread will return from the function assuming the bridge is
> already enabled. Keep in mind that a bridge will always forward PCI
> configuration space accesses so the remainder of the PCI device enable
> process will succeed and the 2nd thread well return to the driver
> which assumes that MMIOs are now permitted since the device is
> enabled. If the first thread has not updated the PCI_COMMAND register
> of the upstream bridge to allow forwarding of memory cycles then any
> MMIOs done by the 2nd thread will result in an Unsupported Request
> error which invokes EEH.
>
> To my eye, everything you have mentioned here is completely consistent
> with the underlying cause being the bridge enable race. I don't know
> why the BAR alignment patch would expose that race, but I don't see
> any reason to believe there is some other address translation issue.
>
>> (btw, i understand that another reason may be in the format of memory allocated for pcie resources
>> and its concrete field values. But according to things said above - i precisely tend to the hypothesis above).
>>
>>
>>
>>
>> ps: there are some other insteresting facts about experiments with this EEH/NVMe-issue - like for
>> example - this issue is not reproducible with CONFIG_PCIEASPM='y' (in kernel config), which is always set
>> to 'y' when CONFIG_PCIEPORTBUS='y' (D3hot, D3cold actions with some pauses in the appropriate
>> ASPM code may give necessary time for B->A translation).
> As brief look at pci_device_enable() and friends suggests that
> additional work happens in the enable step when ASPM is enabled which
> helps mask the bridge enable race.
>
>> Regards,
>> Dmitriy.
>> _______________________________________________
>> Skiboot mailing list
>> Skiboot at lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/skiboot

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/skiboot/attachments/20181108/91a6181b/attachment-0001.html>