<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<samp>Hi, Oliver.<br>
<br>
Your version of EEH/NVMe-issue looks close to be true.<br>
<br>
I've applied simple patch to ./drivers/nvme/host/pci.c file:<br>
<br>
</samp>
<blockquote>
<blockquote><samp>@@ -29,6 +29,7 @@</samp><br>
<samp> #include <linux/types.h></samp><br>
<samp> #include <linux/io-64-nonatomic-lo-hi.h></samp><br>
<samp> #include <linux/sed-opal.h></samp><br>
<samp>+#include <linux/delay.h></samp><br>
<samp> </samp><br>
<samp> #include "nvme.h"</samp><br>
<samp> </samp><br>
<samp>@@ -2073,11 +2074,14 @@</samp><br>
<samp> </samp><br>
<samp> static int nvme_pci_enable(struct nvme_dev *dev)</samp><br>
<samp> {</samp><br>
<samp>+ static int num=0;</samp><br>
<samp> int result = -ENOMEM;</samp><br>
<samp> struct pci_dev *pdev = to_pci_dev(dev->dev);</samp><br>
<samp>+ num=num+1;</samp><br>
<samp> </samp><br>
<samp> if (pci_enable_device_mem(pdev))</samp><br>
<samp> return result;</samp><br>
<samp>+ msleep(num);</samp><br>
<samp> </samp><br>
<samp> pci_set_master(pdev);</samp><samp></samp><br>
</blockquote>
</blockquote>
<samp></samp><samp><br>
and find out approximate minimal estimation of delay which removes
the EEH/NVMe-issue.<br>
This delay is about 1 millisecond (see in the code above). So,
it's a stable phenomena<br>
that when all nvme disks have minimum mutual interval of 1(one)
millisecond (each) to<br>
enable upstream bridge then EEH/NVMe-issue is not reproducible.
Hence, I have the<br>
hypothesis that due to bar alignment patch (see prev message) the
bridge enabling procedure<br>
lasts significantly more time than in case of alignment=0. So, in
correspondence with your<br>
hypothesis about several threads "pinging" PHB, it turns out that
one of that threads<br>
begins to write to PHB, while necessary time is not passed yet (~1
ms). Then, of course,<br>
HW error follows and readl() of NVME_REG_CSTS returns 0xffffffff
in nvme_pci_enable().<br>
(And when align=0 the PHB enable procedure duration is
significantly less than 1ms so<br>
other threads which don't enable PHB themselves - has no conflict,
because by the time<br>
they tries to write to PHB - PHB is already really enabled).<br>
I don't see here contradictions in such version of events sequence
leading<br>
to EEH/NVMe-issue.<br>
<br>
So, there are two questions:<br>
<br>
Q.1. Does my hypothesis about delay=1ms and about difference of
PHB enable procedure<br>
duration in two cases (align=0 and align=64K) looks close to be
true ?<br>
<br>
Q.2. If answer is 'yes' to Q.1 - are there some procedures in FW
or HW implementation in<br>
POWER8 PHB that may "extend" phb enable procedure time execution
because of alignment change<br>
to 64K ? May be in OPAL or on-chip ? Or the only "suspected area"
for EEH/NVMe-issue -<br>
is in linux kernel procedures ?<br>
<br>
<br>
<br>
<br>
Regards,<br>
Dmitriy.<br>
</samp><br>
<div class="moz-cite-prefix">On 05.11.2018 07:30, Oliver wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAOSf1CEAqOgekkz3iq53n2mcn0MqFAUp1rbTAHdyDBFgajFpBw@mail.gmail.com">
<pre wrap="">On Tue, Oct 30, 2018 at 1:28 AM Koltsov Dmitriy <a class="moz-txt-link-rfc2396E" href="mailto:d.koltsov@yadro.com"><d.koltsov@yadro.com></a> wrote:
</pre>
<blockquote type="cite">
<pre wrap="">
Hello.
There is an EEH/NVMe-issue on SUT ppc64le POWER8-based server Yadro VESNIN (with minimum 2 NVMe disks) with
Linux OS kernel 4.15.17/4.15.18/4.16.7 (OS - Ubuntu 16.04.5 LTS(hwe)/18.04.1 LTS or Fedora 27).
The issue is described here -
<a class="moz-txt-link-freetext" href="https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=db2173198b9513f7add8009f225afa1f1c79bcc6">https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=db2173198b9513f7add8009f225afa1f1c79bcc6</a>
I've already had some discussion about it with IBM and now we know about suggested workaround (above).
</pre>
</blockquote>
<pre wrap="">
Given this is a linux issue the discussion should go to linuxppc-dev
(+cc) directly rather than the skiboot list.
</pre>
<blockquote type="cite">
<pre wrap="">But for know there are imho several important facts that remained "uncovered" in this discussion.
The facts are that -
1) We found out that main reason of the issue is near evolution process of vanilla linux kernel taken as a base
for Ubuntu releases, namely - the issue is not reproducible till Linux v.4.11 vanilla release (01.05.2017) and becomes
reproducible from the closest release - Linux v.4.12-rc1 (13.05.2017). Further research showed that there is the commit
responsible for appearance of the issue - its name is "Merge tag 'pci-v4.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci",
see <a class="moz-txt-link-freetext" href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=857f8640147c9fb43f20e43cbca6452710e1ca5d">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=857f8640147c9fb43f20e43cbca6452710e1ca5d</a>.
More definitely - there are only two patches (of author Yongji Xie, in this commit) which enable the issue - they are:
"PCI: Add pcibios_default_alignment() for arch-specific alignment control",
see <a class="moz-txt-link-freetext" href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=0a701aa6378496ea54fb065c68b41d918e372e94">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=0a701aa6378496ea54fb065c68b41d918e372e94</a>
and
"powerpc/powernv: Override pcibios_default_alignment() to force PCI devices to be page aligned"
see <a class="moz-txt-link-freetext" href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=382746376993cfa6d6c4e546c67384201c0f3a82">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=382746376993cfa6d6c4e546c67384201c0f3a82</a>.
If we remove the code strings of these patches from 4.15.18 kernel in Ubuntu 18.04.1 LTS (or in OS Fedora 27) - the issue disappears and
everything, regarding EEH/NVMe-issue, works fine.
2) Finally, we found out the exact point in kernel source which begins the abnormal behaviour during boot process and starts the sequence
with EEH/NVMe-issue symptoms - stacktraces like show below
[ 16.922094] nvme 0000:07:00.0: Using 64-bit DMA iommu bypass
[ 16.922164] EEH: Frozen PHB#21-PE#fd detected
[ 16.922171] EEH: PE location: S002103, PHB location: N/A
[ 16.922177] CPU: 97 PID: 590 Comm: kworker/u770:0 Not tainted 4.15.18.ppc64le #1
[ 16.922182] nvme 0000:08:00.0: Using 64-bit DMA iommu bypass
[ 16.922194] Workqueue: nvme-wq nvme_reset_work [nvme]
[ 16.922198] Call Trace:
[ 16.922208] [c00001ffb07cba10] [c000000000ba5b3c] dump_stack+0xb0/0xf4 (unreliable)
[ 16.922219] [c00001ffb07cba50] [c00000000003f4e8] eeh_dev_check_failure+0x598/0x5b0
[ 16.922227] [c00001ffb07cbaf0] [c00000000003f58c] eeh_check_failure+0x8c/0xd0
[ 16.922236] [c00001ffb07cbb30] [d00000003c443970] nvme_reset_work+0x398/0x1890 [nvme]
[ 16.922240] nvme 0022:04:00.0: enabling device (0140 -> 0142)
[ 16.922247] [c00001ffb07cbc80] [c000000000136e08] process_one_work+0x248/0x540
[ 16.922253] [c00001ffb07cbd20] [c000000000137198] worker_thread+0x98/0x5f0
[ 16.922260] [c00001ffb07cbdc0] [c00000000014100c] kthread+0x1ac/0x1c0
[ 16.922268] [c00001ffb07cbe30] [c00000000000bca0] ret_from_kernel_thread+0x5c/0xbc.
and part of nonseen NVMe-disks in /dev/-tree. This point is in nvme_pci_enable() function in readl()-execution string -
<a class="moz-txt-link-freetext" href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/nvme/host/pci.c?id=Ubuntu-4.15.0-29.31#n2088">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/nvme/host/pci.c?id=Ubuntu-4.15.0-29.31#n2088</a>.
When 64K alignment is not set by the Yongji Xie patch, this point is handled successfully. But when the 64K alignment is set from this patch,
those readl() execution give us 0xffffffff inside the code execution of this function (inside eeh_readl()) - in_le32() returns 0xffffffff here -
<a class="moz-txt-link-freetext" href="https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/arch/powerpc/include/asm/eeh.h?id=Ubuntu-4.15.0-29.31#n379">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/arch/powerpc/include/asm/eeh.h?id=Ubuntu-4.15.0-29.31#n379</a>.
3) Btw, from linux kernel documentation we know that for pci error recovery technique -
"STEP 0: Error Event
-------------------
A PCI bus error is detected by the PCI hardware. On powerpc, the slot is isolated, in that all I/O is blocked: all reads return 0xffffffff,
all writes are ignored."
Also we know that 2)step-readl() gets virtual address of NVME_REG_CSTS register. I suppose that this readl() operation concerns to MMIO
read mechanism. And in the same kernel documentation there are some explanation -
CPU CPU Bus
Virtual Physical Address
Address Address Space
Space Space
+-------+ +------+ +------+
| | |MMIO | Offset | |
| | Virtual |Space | applied | |
C +-------+ --------> B +------+ ----------> +------+ A
| | mapping | | by host | |
+-----+ | | | | bridge | | +--------+
| | | | +------+ | | | |
| CPU | | | | RAM | | | | Device |
| | | | | | | | | |
+-----+ +-------+ +------+ +------+ +--------+
| | Virtual |Buffer| Mapping | |
X +-------+ --------> Y +------+ <---------- +------+ Z
| | mapping | RAM | by IOMMU
| | | |
| | | |
+-------+ +------+
"During the enumeration process, the kernel learns about I/O devices and
their MMIO space and the host bridges that connect them to the system. For
example, if a PCI device has a BAR, the kernel reads the bus address (A)
from the BAR and converts it to a CPU physical address (B). The address B
is stored in a struct resource and usually exposed via /proc/iomem. When a
driver claims a device, it typically uses ioremap() to map physical address
B at a virtual address (C). It can then use, e.g., ioread32(C), to access
the device registers at bus address A."
So, we can conclude from this scheme and readl() execution string above that
readl() execution must step through C->B->A path from virtual address to physical and then
to bus address. I've added printk strings to some places in linux kernel, like nvme_dev_map()
and so, to check the physical memory allocations, virtual memory allocations, and ioremap()
results. Everything is ok in printk outputs even when EEH/NVMe-issue is reproduced.
Hence, the hypothesis of EEH/NVMe-issue is in the B->A mapping which must be done,
regarding the scheme above, - by PHB3 on POWER8 chip. When alignment=0, this process
works fine, but with alignment=64K - this translation to bus addresses reproduce EEH/NVMe-issue,
when more than one NVMe-disk are enabled in parallel with pcie bridge. Based on another commit
but for phb4 - "phb4: Only clear some PHB config space registers on errors"
<a class="moz-txt-link-freetext" href="https://github.com/open-power/skiboot/commit/5a8c57b6f2fcb46f1cac1840062a542c597996f9">https://github.com/open-power/skiboot/commit/5a8c57b6f2fcb46f1cac1840062a542c597996f9</a>
i suppose that the main reason may be in race in skiboot mechanism (or like firmware/hardware), which
frequently doesn't have enough time for B->A translation when alignment is ON and =64K -
i.e. in some entity which implement the pcie logic according to B->A translation with align=64K
(may be bound with coherent cache layer).
Question. Could you comment this hypothesis ? Is it far from to be true from your point of view ?
May be this letter is enough to start the research in this direction to find the exact reason of
mentioned EEH/NVMe-issue ?
</pre>
</blockquote>
<pre wrap="">
Hi there,
So I spent a fair amount of time debugging that issue and I don't
really see anything here that suggests there is some kind of
underlying translation issue. To elaborate, the race being worked
around in db2173198b95 is due to this code inside
pci_enable_device_flags():
if (atomic_inc_return(&dev->enable_cnt) > 1)
return 0; /* already enabled */
If two threads are probing two different PCI devices with a common
upstream bridge they will both call pci_enable_device_flags() on their
common bridge (e.g. pcie switch upstream port). Since it uses an
atomic increment only one thread will attempt to enable the bridge and
the other thread will return from the function assuming the bridge is
already enabled. Keep in mind that a bridge will always forward PCI
configuration space accesses so the remainder of the PCI device enable
process will succeed and the 2nd thread well return to the driver
which assumes that MMIOs are now permitted since the device is
enabled. If the first thread has not updated the PCI_COMMAND register
of the upstream bridge to allow forwarding of memory cycles then any
MMIOs done by the 2nd thread will result in an Unsupported Request
error which invokes EEH.
To my eye, everything you have mentioned here is completely consistent
with the underlying cause being the bridge enable race. I don't know
why the BAR alignment patch would expose that race, but I don't see
any reason to believe there is some other address translation issue.
</pre>
<blockquote type="cite">
<pre wrap="">(btw, i understand that another reason may be in the format of memory allocated for pcie resources
and its concrete field values. But according to things said above - i precisely tend to the hypothesis above).
ps: there are some other insteresting facts about experiments with this EEH/NVMe-issue - like for
example - this issue is not reproducible with CONFIG_PCIEASPM='y' (in kernel config), which is always set
to 'y' when CONFIG_PCIEPORTBUS='y' (D3hot, D3cold actions with some pauses in the appropriate
ASPM code may give necessary time for B->A translation).
</pre>
</blockquote>
<pre wrap="">
As brief look at pci_device_enable() and friends suggests that
additional work happens in the enable step when ASPM is enabled which
helps mask the bridge enable race.
</pre>
<blockquote type="cite">
<pre wrap="">
Regards,
Dmitriy.
_______________________________________________
Skiboot mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Skiboot@lists.ozlabs.org">Skiboot@lists.ozlabs.org</a>
<a class="moz-txt-link-freetext" href="https://lists.ozlabs.org/listinfo/skiboot">https://lists.ozlabs.org/listinfo/skiboot</a>
</pre>
</blockquote>
</blockquote>
<br>
</body>
</html>