[PATCH v3 0/6] powerpc/powernv/pci: Discover surprise-hotplugged PCIe devices during rescan

Sergey Miroshnichenko s.miroshnichenko at yadro.com
Sat Sep 15 06:42:38 AEST 2018


Hello Oliver,

On 9/12/18 12:49 PM, Oliver wrote:
> On Tue, Sep 11, 2018 at 9:56 PM, Sergey Miroshnichenko
> <s.miroshnichenko at yadro.com> wrote:
>> This patchset allows hotplugged PCIe devices to be enumerated during a bus
>> rescan being issued via sysfs on PowerNV platforms, when the "Presence
>> Detect Changed" interrupt is not available.
> 
> Seems to be on par with the sysfs slot power hack that pnv_php uses.
> 

Yes, ours is just for manual initiation of rescan, which helps us with
reliable detection of a bridge hotplug in our particular config.

>> As a first part of our work on adding support for hotplugging PCIe bridges
>> full of devices (without special requirement such as Hot-Plug Controller,
>> reservation of bus numbers and memory regions by firmware, etc.), this
>> serie is intended to solve the first two problems of the listed below:
>>
>> I   PowerNV doesn't discover new hotplugged PCIe devices
>> II  EEH is falsely triggered when poking empty slots during the PCIe rescan
> 
> We avoid this problem in pnv_php by having OPAL to do the rescan and
> Linux requests
> a FDT fragment of everything under the slot. I'm don't think it's a
> great system, but
> it keeps firmware and the OS on the same page.
> 

So if we re-enumerate the PCIe topology from the Linux, we must then
synchronize with the firmware? How would you recommend to approach that
for PowerNV and OPAL? Can we can find somewhere a list of criteria to
ensure that they are properly synced?

>> III The PCI subsystem is not prepared to runtime changes of BAR addresses
>> IV  Device drivers don't track changes of their BAR addresses
>> V   BARs of working devices don't move to make space for new ones
> 
> I'm having a really hard to figuring out what would make this
> necessary. Keep in mind
> that each PHB has it's own set of bus numbers and it's own MMIO space,
> so it's not
> like you're short on either.
> 
> How are you planning on making this sort of live-device-migration work? And what
> are you trying to do that makes the added complexity worth it?
> 

With the "pci=realloc" command line argument and with the
PCI_REASSIGN_ALL_BUS flag the kernel doesn't rely on values of bus
numbers and BAR addresses provided by a firmware (OPAL via FDT in our
case, BIOS/UEFI/Coreboot for x86_64), but re-enumerates the PCIe
topology by its own means, and it arranges BARs quite compactly.

Let's say we have two bridges plugged into neighboring ports of the
root/PHB, each of them have a few NVME drives inserted and several empty
slots, when the system boots. Linux makes their bridge windows adjacent,
so if we plug in a new NVME into the first of them, there will be just
no free space to put its BARs.

Without considering memory pre-allocation, the only way we see to free
some space for new BARs is to move existing BARs of the second bridge
(in this example).

We've implemented a "firmware-independent" proof-of-concept (not
flawless, though, as you and Ben pointed out) and verified on
PowerNV+OPAL and x86_64 that a running NVME with an ongoing "fio"
benchmark always survives BAR movement during hotplug - of course after
applying a patch that pauses the NVME Linux driver during rescan. The
only visible effect is a bandwidth temporary drops to 0 for a second or
two, until NVME restarts. The same for a network adapter - an SSH
connection just freezes for a while.

This patchset is a first part of our work, and we've just published [1]
a second part (on BAR movement and pausing the drivers) for the
community to review, discuss and validate.

[1] https://www.spinics.net/lists/linux-pci/msg76211.html

Best regards,
Serge

>> Tested on:
>>  - POWER8 PowerNV+OPAL ppc64le (our Vesnin server) w/ and w/o pci=realloc;
>>  - POWER8 IBM 8247-42L (pSeries);
>>  - POWER8 IBM 8247-42L (PowerNV+OPAL) w/ and w/o pci=realloc.
>>
>> Changes since v2:
>>  - Don't reassign bus numbers on PowerNV by default (to retain the default
>>    behavior), but only when pci=realloc is passed;
>>  - Less code affected;
>>  - pci_add_device_node_info is refactored with add_one_dev_pci_data;
>>  - Minor code cleanup.
>>
>> Changes since v1:
>>  - Fixed build for ppc64le and ppc64be when CONFIG_PCI_IOV is disabled;
>>  - Fixed build for ppc64e when CONFIG_EEH is disabled;
>>  - Fixed code style warnings.
>>
>> Sergey Miroshnichenko (6):
>>   powerpc/pci: Access PCI config space directly w/o pci_dn
>>   powerpc/pci: Create pci_dn on demand
>>   powerpc/pci: Use DT to create pci_dn for root bridges only
>>   powerpc/powernv/pci: Enable reassigning the bus numbers
>>   PCI/powerpc/eeh: Add pcibios hooks for preparing to rescan
>>   powerpc/pci: Reduce code duplication in pci_add_device_node_info
>>
>>  arch/powerpc/include/asm/eeh.h               |   2 +
>>  arch/powerpc/kernel/eeh.c                    |  12 ++
>>  arch/powerpc/kernel/pci_dn.c                 | 119 ++++++++++++++-----
>>  arch/powerpc/kernel/rtas_pci.c               |  97 ++++++++++-----
>>  arch/powerpc/platforms/powernv/eeh-powernv.c |  22 ++++
>>  arch/powerpc/platforms/powernv/pci.c         |  64 ++++++----
>>  drivers/pci/probe.c                          |  14 +++
>>  include/linux/pci.h                          |   2 +
>>  8 files changed, 253 insertions(+), 79 deletions(-)
>>
>> --
>> 2.17.1
>>


More information about the Linuxppc-dev mailing list