[Skiboot] Important details about race condition in EEH/NVMe-issue on ppc64le.
d.koltsov at yadro.com
Fri Nov 9 02:30:13 AEDT 2018
Your version of EEH/NVMe-issue looks close to be true.
I've applied simple patch to ./drivers/nvme/host/pci.c file:
@@ -29,6 +29,7 @@
@@ -2073,11 +2074,14 @@
static int nvme_pci_enable(struct nvme_dev *dev)
+ static int num=0;
int result = -ENOMEM;
struct pci_dev *pdev = to_pci_dev(dev->dev);
and find out approximate minimal estimation of delay which removes the
This delay is about 1 millisecond (see in the code above). So, it's a
that when all nvme disks have minimum mutual interval of 1(one)
millisecond (each) to
enable upstream bridge then EEH/NVMe-issue is not reproducible. Hence, I
hypothesis that due to bar alignment patch (see prev message) the bridge
lasts significantly more time than in case of alignment=0. So, in
correspondence with your
hypothesis about several threads "pinging" PHB, it turns out that one of
begins to write to PHB, while necessary time is not passed yet (~1 ms).
Then, of course,
HW error follows and readl() of NVME_REG_CSTS returns 0xffffffff in
(And when align=0 the PHB enable procedure duration is significantly
less than 1ms so
other threads which don't enable PHB themselves - has no conflict,
because by the time
they tries to write to PHB - PHB is already really enabled).
I don't see here contradictions in such version of events sequence leading
So, there are two questions:
Q.1. Does my hypothesis about delay=1ms and about difference of PHB
duration in two cases (align=0 and align=64K) looks close to be true ?
Q.2. If answer is 'yes' to Q.1 - are there some procedures in FW or HW
POWER8 PHB that may "extend" phb enable procedure time execution because
of alignment change
to 64K ? May be in OPAL or on-chip ? Or the only "suspected area" for
is in linux kernel procedures ?
On 05.11.2018 07:30, Oliver wrote:
> On Tue, Oct 30, 2018 at 1:28 AM Koltsov Dmitriy <d.koltsov at yadro.com> wrote:
>> There is an EEH/NVMe-issue on SUT ppc64le POWER8-based server Yadro VESNIN (with minimum 2 NVMe disks) with
>> Linux OS kernel 4.15.17/4.15.18/4.16.7 (OS - Ubuntu 16.04.5 LTS(hwe)/18.04.1 LTS or Fedora 27).
>> The issue is described here -
>> I've already had some discussion about it with IBM and now we know about suggested workaround (above).
> Given this is a linux issue the discussion should go to linuxppc-dev
> (+cc) directly rather than the skiboot list.
>> But for know there are imho several important facts that remained "uncovered" in this discussion.
>> The facts are that -
>> 1) We found out that main reason of the issue is near evolution process of vanilla linux kernel taken as a base
>> for Ubuntu releases, namely - the issue is not reproducible till Linux v.4.11 vanilla release (01.05.2017) and becomes
>> reproducible from the closest release - Linux v.4.12-rc1 (13.05.2017). Further research showed that there is the commit
>> responsible for appearance of the issue - its name is "Merge tag 'pci-v4.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci",
>> see https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=857f8640147c9fb43f20e43cbca6452710e1ca5d.
>> More definitely - there are only two patches (of author Yongji Xie, in this commit) which enable the issue - they are:
>> "PCI: Add pcibios_default_alignment() for arch-specific alignment control",
>> see https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=0a701aa6378496ea54fb065c68b41d918e372e94
>> "powerpc/powernv: Override pcibios_default_alignment() to force PCI devices to be page aligned"
>> see https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/commit/?id=382746376993cfa6d6c4e546c67384201c0f3a82.
>> If we remove the code strings of these patches from 4.15.18 kernel in Ubuntu 18.04.1 LTS (or in OS Fedora 27) - the issue disappears and
>> everything, regarding EEH/NVMe-issue, works fine.
>> 2) Finally, we found out the exact point in kernel source which begins the abnormal behaviour during boot process and starts the sequence
>> with EEH/NVMe-issue symptoms - stacktraces like show below
>> [ 16.922094] nvme 0000:07:00.0: Using 64-bit DMA iommu bypass
>> [ 16.922164] EEH: Frozen PHB#21-PE#fd detected
>> [ 16.922171] EEH: PE location: S002103, PHB location: N/A
>> [ 16.922177] CPU: 97 PID: 590 Comm: kworker/u770:0 Not tainted 4.15.18.ppc64le #1
>> [ 16.922182] nvme 0000:08:00.0: Using 64-bit DMA iommu bypass
>> [ 16.922194] Workqueue: nvme-wq nvme_reset_work [nvme]
>> [ 16.922198] Call Trace:
>> [ 16.922208] [c00001ffb07cba10] [c000000000ba5b3c] dump_stack+0xb0/0xf4 (unreliable)
>> [ 16.922219] [c00001ffb07cba50] [c00000000003f4e8] eeh_dev_check_failure+0x598/0x5b0
>> [ 16.922227] [c00001ffb07cbaf0] [c00000000003f58c] eeh_check_failure+0x8c/0xd0
>> [ 16.922236] [c00001ffb07cbb30] [d00000003c443970] nvme_reset_work+0x398/0x1890 [nvme]
>> [ 16.922240] nvme 0022:04:00.0: enabling device (0140 -> 0142)
>> [ 16.922247] [c00001ffb07cbc80] [c000000000136e08] process_one_work+0x248/0x540
>> [ 16.922253] [c00001ffb07cbd20] [c000000000137198] worker_thread+0x98/0x5f0
>> [ 16.922260] [c00001ffb07cbdc0] [c00000000014100c] kthread+0x1ac/0x1c0
>> [ 16.922268] [c00001ffb07cbe30] [c00000000000bca0] ret_from_kernel_thread+0x5c/0xbc.
>> and part of nonseen NVMe-disks in /dev/-tree. This point is in nvme_pci_enable() function in readl()-execution string -
>> When 64K alignment is not set by the Yongji Xie patch, this point is handled successfully. But when the 64K alignment is set from this patch,
>> those readl() execution give us 0xffffffff inside the code execution of this function (inside eeh_readl()) - in_le32() returns 0xffffffff here -
>> 3) Btw, from linux kernel documentation we know that for pci error recovery technique -
>> "STEP 0: Error Event
>> A PCI bus error is detected by the PCI hardware. On powerpc, the slot is isolated, in that all I/O is blocked: all reads return 0xffffffff,
>> all writes are ignored."
>> Also we know that 2)step-readl() gets virtual address of NVME_REG_CSTS register. I suppose that this readl() operation concerns to MMIO
>> read mechanism. And in the same kernel documentation there are some explanation -
>> CPU CPU Bus
>> Virtual Physical Address
>> Address Address Space
>> Space Space
>> +-------+ +------+ +------+
>> | | |MMIO | Offset | |
>> | | Virtual |Space | applied | |
>> C +-------+ --------> B +------+ ----------> +------+ A
>> | | mapping | | by host | |
>> +-----+ | | | | bridge | | +--------+
>> | | | | +------+ | | | |
>> | CPU | | | | RAM | | | | Device |
>> | | | | | | | | | |
>> +-----+ +-------+ +------+ +------+ +--------+
>> | | Virtual |Buffer| Mapping | |
>> X +-------+ --------> Y +------+ <---------- +------+ Z
>> | | mapping | RAM | by IOMMU
>> | | | |
>> | | | |
>> +-------+ +------+
>> "During the enumeration process, the kernel learns about I/O devices and
>> their MMIO space and the host bridges that connect them to the system. For
>> example, if a PCI device has a BAR, the kernel reads the bus address (A)
>> from the BAR and converts it to a CPU physical address (B). The address B
>> is stored in a struct resource and usually exposed via /proc/iomem. When a
>> driver claims a device, it typically uses ioremap() to map physical address
>> B at a virtual address (C). It can then use, e.g., ioread32(C), to access
>> the device registers at bus address A."
>> So, we can conclude from this scheme and readl() execution string above that
>> readl() execution must step through C->B->A path from virtual address to physical and then
>> to bus address. I've added printk strings to some places in linux kernel, like nvme_dev_map()
>> and so, to check the physical memory allocations, virtual memory allocations, and ioremap()
>> results. Everything is ok in printk outputs even when EEH/NVMe-issue is reproduced.
>> Hence, the hypothesis of EEH/NVMe-issue is in the B->A mapping which must be done,
>> regarding the scheme above, - by PHB3 on POWER8 chip. When alignment=0, this process
>> works fine, but with alignment=64K - this translation to bus addresses reproduce EEH/NVMe-issue,
>> when more than one NVMe-disk are enabled in parallel with pcie bridge. Based on another commit
>> but for phb4 - "phb4: Only clear some PHB config space registers on errors"
>> i suppose that the main reason may be in race in skiboot mechanism (or like firmware/hardware), which
>> frequently doesn't have enough time for B->A translation when alignment is ON and =64K -
>> i.e. in some entity which implement the pcie logic according to B->A translation with align=64K
>> (may be bound with coherent cache layer).
>> Question. Could you comment this hypothesis ? Is it far from to be true from your point of view ?
>> May be this letter is enough to start the research in this direction to find the exact reason of
>> mentioned EEH/NVMe-issue ?
> Hi there,
> So I spent a fair amount of time debugging that issue and I don't
> really see anything here that suggests there is some kind of
> underlying translation issue. To elaborate, the race being worked
> around in db2173198b95 is due to this code inside
> if (atomic_inc_return(&dev->enable_cnt) > 1)
> return 0; /* already enabled */
> If two threads are probing two different PCI devices with a common
> upstream bridge they will both call pci_enable_device_flags() on their
> common bridge (e.g. pcie switch upstream port). Since it uses an
> atomic increment only one thread will attempt to enable the bridge and
> the other thread will return from the function assuming the bridge is
> already enabled. Keep in mind that a bridge will always forward PCI
> configuration space accesses so the remainder of the PCI device enable
> process will succeed and the 2nd thread well return to the driver
> which assumes that MMIOs are now permitted since the device is
> enabled. If the first thread has not updated the PCI_COMMAND register
> of the upstream bridge to allow forwarding of memory cycles then any
> MMIOs done by the 2nd thread will result in an Unsupported Request
> error which invokes EEH.
> To my eye, everything you have mentioned here is completely consistent
> with the underlying cause being the bridge enable race. I don't know
> why the BAR alignment patch would expose that race, but I don't see
> any reason to believe there is some other address translation issue.
>> (btw, i understand that another reason may be in the format of memory allocated for pcie resources
>> and its concrete field values. But according to things said above - i precisely tend to the hypothesis above).
>> ps: there are some other insteresting facts about experiments with this EEH/NVMe-issue - like for
>> example - this issue is not reproducible with CONFIG_PCIEASPM='y' (in kernel config), which is always set
>> to 'y' when CONFIG_PCIEPORTBUS='y' (D3hot, D3cold actions with some pauses in the appropriate
>> ASPM code may give necessary time for B->A translation).
> As brief look at pci_device_enable() and friends suggests that
> additional work happens in the enable step when ASPM is enabled which
> helps mask the bridge enable race.
>> Skiboot mailing list
>> Skiboot at lists.ozlabs.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Skiboot