<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hello.</p>
There is an EEH/NVMe-issue on SUT ppc64le POWER8-based server Yadro
VESNIN (with minimum 2 NVMe disks) with<br>
Linux OS kernel 4.15.17/4.15.18/4.16.7 (OS - Ubuntu 16.04.5
LTS(hwe)/18.04.1 LTS or Fedora 27).<br>
<br>
The issue is described here -<br>
<br>
<br>
<blockquote>
<blockquote><a class="moz-txt-link-freetext"
href="https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=db2173198b9513f7add8009f225afa1f1c79bcc6">https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=db2173198b9513f7add8009f225afa1f1c79bcc6</a><br>
</blockquote>
</blockquote>
<br>
I've already had some discussion about it with IBM and now we know
about suggested workaround (above).<br>
<br>
But for know there are imho several important facts that remained
"uncovered" in this discussion.<br>
<br>
The facts are that -<br>
<br>
1) We found out that main reason of the issue is near evolution
process of vanilla linux kernel taken as a base <br>
for Ubuntu releases, namely - the issue is not reproducible till
Linux v.4.11 vanilla release (01.05.2017) and becomes <br>
reproducible from the closest release - Linux v.4.12-rc1
(13.05.2017). Further research showed that there is the commit <br>
responsible for appearance of the issue - its name is "Merge tag
'pci-v4.12-changes' of
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci", <br>
see <a class="moz-txt-link-freetext"
href="https://git.launchpad.net/%7Eubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=857f8640147c9fb43f20e43cbca6452710e1ca5d">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=857f8640147c9fb43f20e43cbca6452710e1ca5d</a>.
<br>
More definitely - there are only two patches (of author Yongji Xie,
in this commit) which enable the issue - they are: <br>
<br>
"PCI: Add pcibios_default_alignment() for arch-specific alignment
control", <br>
see <a class="moz-txt-link-freetext"
href="https://git.launchpad.net/%7Eubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=0a701aa6378496ea54fb065c68b41d918e372e94">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=0a701aa6378496ea54fb065c68b41d918e372e94</a>
<br>
<br>
and <br>
"powerpc/powernv: Override pcibios_default_alignment() to force PCI
devices to be page aligned" <br>
see <a class="moz-txt-link-freetext"
href="https://git.launchpad.net/%7Eubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=382746376993cfa6d6c4e546c67384201c0f3a82">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=382746376993cfa6d6c4e546c67384201c0f3a82</a>.
<br>
<br>
If we remove the code strings of these patches from 4.15.18 kernel
in Ubuntu 18.04.1 LTS (or in OS Fedora 27) - the issue disappears
and<br>
everything, regarding EEH/NVMe-issue, works fine.<br>
<br>
2) Finally, we found out the exact point in kernel source which
begins the abnormal behaviour during boot process and starts the
sequence<br>
with EEH/NVMe-issue symptoms - stacktraces like show below<br>
<br>
<blockquote>
<blockquote>[ 16.922094] nvme 0000:07:00.0: Using 64-bit DMA
iommu bypass <br>
[ 16.922164] EEH: Frozen PHB#21-PE#fd detected <br>
[ 16.922171] EEH: PE location: S002103, PHB location: N/A <br>
[ 16.922177] CPU: 97 PID: 590 Comm: kworker/u770:0 Not tainted
4.15.18.ppc64le #1 <br>
[ 16.922182] nvme 0000:08:00.0: Using 64-bit DMA iommu bypass
<br>
[ 16.922194] Workqueue: nvme-wq nvme_reset_work [nvme] <br>
[ 16.922198] Call Trace: <br>
[ 16.922208] [c00001ffb07cba10] [c000000000ba5b3c]
dump_stack+0xb0/0xf4 (unreliable) <br>
[ 16.922219] [c00001ffb07cba50] [c00000000003f4e8]
eeh_dev_check_failure+0x598/0x5b0 <br>
[ 16.922227] [c00001ffb07cbaf0] [c00000000003f58c]
eeh_check_failure+0x8c/0xd0 <br>
[ 16.922236] [c00001ffb07cbb30] [d00000003c443970]
nvme_reset_work+0x398/0x1890 [nvme] <br>
[ 16.922240] nvme 0022:04:00.0: enabling device (0140 ->
0142) <br>
[ 16.922247] [c00001ffb07cbc80] [c000000000136e08]
process_one_work+0x248/0x540 <br>
[ 16.922253] [c00001ffb07cbd20] [c000000000137198]
worker_thread+0x98/0x5f0 <br>
[ 16.922260] [c00001ffb07cbdc0] [c00000000014100c]
kthread+0x1ac/0x1c0 <br>
[ 16.922268] [c00001ffb07cbe30] [c00000000000bca0]
ret_from_kernel_thread+0x5c/0xbc.<br>
</blockquote>
</blockquote>
<br>
and part of nonseen NVMe-disks in /dev/-tree. This point is in
nvme_pci_enable() function in readl()-execution string -<br>
<br>
<a class="moz-txt-link-freetext"
href="https://git.launchpad.net/%7Eubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/nvme/host/pci.c?id=Ubuntu-4.15.0-29.31#n2088">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/nvme/host/pci.c?id=Ubuntu-4.15.0-29.31#n2088</a>.<br>
<br>
When 64K alignment is not set by the Yongji Xie patch, this point is
handled successfully. But when the 64K alignment is set from this
patch,<br>
those readl() execution give us 0xffffffff inside the code execution
of this function (inside eeh_readl()) - in_le32() returns 0xffffffff
here -<br>
<br>
<a class="moz-txt-link-freetext"
href="https://git.launchpad.net/%7Eubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/arch/powerpc/include/asm/eeh.h?id=Ubuntu-4.15.0-29.31#n379">https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/arch/powerpc/include/asm/eeh.h?id=Ubuntu-4.15.0-29.31#n379</a>.<br>
<br>
3) Btw, from linux kernel documentation we know that for pci error
recovery technique -<br>
<br>
"STEP 0: Error Event<br>
-------------------<br>
A PCI bus error is detected by the PCI hardware. On powerpc, the
slot is isolated, in that all I/O is blocked: all reads return
0xffffffff,<br>
all writes are ignored."<br>
<br>
Also we know that 2)step-readl() gets virtual address of
NVME_REG_CSTS register. I suppose that this readl() operation
concerns to MMIO<br>
read mechanism. And in the same kernel documentation there are some
explanation -<br>
<br>
<pre> CPU CPU Bus
Virtual Physical Address
Address Address Space
Space Space
+-------+ +------+ +------+
| | |MMIO | Offset | |
| | Virtual |Space | applied | |
C +-------+ --------> B +------+ ----------> +------+ A
| | mapping | | by host | |
+-----+ | | | | bridge | | +--------+
| | | | +------+ | | | |
| CPU | | | | RAM | | | | Device |
| | | | | | | | | |
+-----+ +-------+ +------+ +------+ +--------+
| | Virtual |Buffer| Mapping | |
X +-------+ --------> Y +------+ <---------- +------+ Z
| | mapping | RAM | by IOMMU
| | | |
| | | |
+-------+ +------+</pre>
<br>
"During the enumeration process, the kernel learns about I/O devices
and<br>
their MMIO space and the host bridges that connect them to the
system. For<br>
example, if a PCI device has a BAR, the kernel reads the bus address
(A)<br>
from the BAR and converts it to a CPU physical address (B). The
address B<br>
is stored in a struct resource and usually exposed via /proc/iomem.
When a<br>
driver claims a device, it typically uses ioremap() to map physical
address<br>
B at a virtual address (C). It can then use, e.g., ioread32(C), to
access<br>
the device registers at bus address A."<br>
<br>
So, we can conclude from this scheme and readl() execution string
above that<br>
readl() execution must step through C->B->A path from virtual
address to physical and then<br>
to bus address. I've added printk strings to some places in linux
kernel, like nvme_dev_map()<br>
and so, to check the physical memory allocations, virtual memory
allocations, and ioremap()<br>
results. Everything is ok in printk outputs even when EEH/NVMe-issue
is reproduced.<br>
<br>
Hence, the hypothesis of EEH/NVMe-issue is in the B->A mapping
which must be done,<br>
regarding the scheme above, - by PHB3 on POWER8 chip. When
alignment=0, this process<br>
works fine, but with alignment=64K - this translation to bus
addresses reproduce EEH/NVMe-issue,<br>
when more than one NVMe-disk are enabled in parallel with pcie
bridge. Based on another commit<br>
but for phb4 - "phb4: Only clear some PHB config space registers on
errors"<br>
<br>
<blockquote>
<blockquote><a class="moz-txt-link-freetext"
href="https://github.com/open-power/skiboot/commit/5a8c57b6f2fcb46f1cac1840062a542c597996f9">https://github.com/open-power/skiboot/commit/5a8c57b6f2fcb46f1cac1840062a542c597996f9</a><br>
</blockquote>
</blockquote>
<br>
i suppose that the main reason may be in race in skiboot mechanism
(or like firmware/hardware), which<br>
frequently doesn't have enough time for B->A translation when
alignment is ON and =64K -<br>
i.e. in some entity which implement the pcie logic according to
B->A translation with align=64K<br>
(may be bound with coherent cache layer).<br>
<br>
<br>
<br>
Question. Could you comment this hypothesis ? Is it far from to be
true from your point of view ?<br>
May be this letter is enough to start the research in this direction
to find the exact reason of<br>
mentioned EEH/NVMe-issue ?<br>
<br>
(btw, i understand that another reason may be in the format of
memory allocated for pcie resources<br>
and its concrete field values. But according to things said above -
i precisely tend to the hypothesis above).<br>
<br>
<br>
<br>
<br>
ps: there are some other insteresting facts about experiments with
this EEH/NVMe-issue - like for<br>
example - this issue is not reproducible with CONFIG_PCIEASPM='y'
(in kernel config), which is always set<br>
to 'y' when CONFIG_PCIEPORTBUS='y' (D3hot, D3cold actions with some
pauses in the appropriate<br>
ASPM code may give necessary time for B->A translation).<br>
<br>
<br>
<br>
<br>
Regards,<br>
Dmitriy.
</body>
</html>