[Skiboot] Important details about race condition in EEH/NVMe-issue on ppc64le.
d.koltsov at yadro.com
Tue Oct 30 01:27:46 AEDT 2018
There is an EEH/NVMe-issue on SUT ppc64le POWER8-based server Yadro
VESNIN (with minimum 2 NVMe disks) with
Linux OS kernel 4.15.17/4.15.18/4.16.7 (OS - Ubuntu 16.04.5
LTS(hwe)/18.04.1 LTS or Fedora 27).
The issue is described here -
I've already had some discussion about it with IBM and now we know about
suggested workaround (above).
But for know there are imho several important facts that remained
"uncovered" in this discussion.
The facts are that -
1) We found out that main reason of the issue is near evolution process
of vanilla linux kernel taken as a base
for Ubuntu releases, namely - the issue is not reproducible till Linux
v.4.11 vanilla release (01.05.2017) and becomes
reproducible from the closest release - Linux v.4.12-rc1 (13.05.2017).
Further research showed that there is the commit
responsible for appearance of the issue - its name is "Merge tag
More definitely - there are only two patches (of author Yongji Xie, in
this commit) which enable the issue - they are:
"PCI: Add pcibios_default_alignment() for arch-specific alignment control",
"powerpc/powernv: Override pcibios_default_alignment() to force PCI
devices to be page aligned"
If we remove the code strings of these patches from 4.15.18 kernel in
Ubuntu 18.04.1 LTS (or in OS Fedora 27) - the issue disappears and
everything, regarding EEH/NVMe-issue, works fine.
2) Finally, we found out the exact point in kernel source which begins
the abnormal behaviour during boot process and starts the sequence
with EEH/NVMe-issue symptoms - stacktraces like show below
[ 16.922094] nvme 0000:07:00.0: Using 64-bit DMA iommu bypass
[ 16.922164] EEH: Frozen PHB#21-PE#fd detected
[ 16.922171] EEH: PE location: S002103, PHB location: N/A
[ 16.922177] CPU: 97 PID: 590 Comm: kworker/u770:0 Not tainted
[ 16.922182] nvme 0000:08:00.0: Using 64-bit DMA iommu bypass
[ 16.922194] Workqueue: nvme-wq nvme_reset_work [nvme]
[ 16.922198] Call Trace:
[ 16.922208] [c00001ffb07cba10] [c000000000ba5b3c]
[ 16.922219] [c00001ffb07cba50] [c00000000003f4e8]
[ 16.922227] [c00001ffb07cbaf0] [c00000000003f58c]
[ 16.922236] [c00001ffb07cbb30] [d00000003c443970]
[ 16.922240] nvme 0022:04:00.0: enabling device (0140 -> 0142)
[ 16.922247] [c00001ffb07cbc80] [c000000000136e08]
[ 16.922253] [c00001ffb07cbd20] [c000000000137198]
[ 16.922260] [c00001ffb07cbdc0] [c00000000014100c]
[ 16.922268] [c00001ffb07cbe30] [c00000000000bca0]
and part of nonseen NVMe-disks in /dev/-tree. This point is in
nvme_pci_enable() function in readl()-execution string -
When 64K alignment is not set by the Yongji Xie patch, this point is
handled successfully. But when the 64K alignment is set from this patch,
those readl() execution give us 0xffffffff inside the code execution of
this function (inside eeh_readl()) - in_le32() returns 0xffffffff here -
3) Btw, from linux kernel documentation we know that for pci error
recovery technique -
"STEP 0: Error Event
A PCI bus error is detected by the PCI hardware. On powerpc, the slot
is isolated, in that all I/O is blocked: all reads return 0xffffffff,
all writes are ignored."
Also we know that 2)step-readl() gets virtual address of NVME_REG_CSTS
register. I suppose that this readl() operation concerns to MMIO
read mechanism. And in the same kernel documentation there are some
CPU CPU Bus
Virtual Physical Address
Address Address Space
+-------+ +------+ +------+
| | |MMIO | Offset | |
| | Virtual |Space | applied | |
C +-------+ --------> B +------+ ----------> +------+ A
| | mapping | | by host | |
+-----+ | | | | bridge | | +--------+
| | | | +------+ | | | |
| CPU | | | | RAM | | | | Device |
| | | | | | | | | |
+-----+ +-------+ +------+ +------+ +--------+
| | Virtual |Buffer| Mapping | |
X +-------+ --------> Y +------+ <---------- +------+ Z
| | mapping | RAM | by IOMMU
| | | |
| | | |
"During the enumeration process, the kernel learns about I/O devices and
their MMIO space and the host bridges that connect them to the system. For
example, if a PCI device has a BAR, the kernel reads the bus address (A)
from the BAR and converts it to a CPU physical address (B). The address B
is stored in a struct resource and usually exposed via /proc/iomem. When a
driver claims a device, it typically uses ioremap() to map physical address
B at a virtual address (C). It can then use, e.g., ioread32(C), to access
the device registers at bus address A."
So, we can conclude from this scheme and readl() execution string above that
readl() execution must step through C->B->A path from virtual address to
physical and then
to bus address. I've added printk strings to some places in linux
kernel, like nvme_dev_map()
and so, to check the physical memory allocations, virtual memory
allocations, and ioremap()
results. Everything is ok in printk outputs even when EEH/NVMe-issue is
Hence, the hypothesis of EEH/NVMe-issue is in the B->A mapping which
must be done,
regarding the scheme above, - by PHB3 on POWER8 chip. When alignment=0,
works fine, but with alignment=64K - this translation to bus addresses
when more than one NVMe-disk are enabled in parallel with pcie bridge.
Based on another commit
but for phb4 - "phb4: Only clear some PHB config space registers on errors"
i suppose that the main reason may be in race in skiboot mechanism (or
like firmware/hardware), which
frequently doesn't have enough time for B->A translation when alignment
is ON and =64K -
i.e. in some entity which implement the pcie logic according to B->A
translation with align=64K
(may be bound with coherent cache layer).
Question. Could you comment this hypothesis ? Is it far from to be true
from your point of view ?
May be this letter is enough to start the research in this direction to
find the exact reason of
mentioned EEH/NVMe-issue ?
(btw, i understand that another reason may be in the format of memory
allocated for pcie resources
and its concrete field values. But according to things said above - i
precisely tend to the hypothesis above).
ps: there are some other insteresting facts about experiments with this
EEH/NVMe-issue - like for
example - this issue is not reproducible with CONFIG_PCIEASPM='y' (in
kernel config), which is always set
to 'y' when CONFIG_PCIEPORTBUS='y' (D3hot, D3cold actions with some
pauses in the appropriate
ASPM code may give necessary time for B->A translation).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Skiboot