[Skiboot] PowerNV PCIe hotplug support?

Timothy Pearson tpearson at raptorengineering.com
Thu May 16 02:10:12 AEST 2024



----- Original Message -----
> From: "Mahesh J Salgaonkar" <mahesh at linux.ibm.com>
> To: "Dan Horák" <dan at danny.cz>
> Cc: "skiboot" <skiboot at lists.ozlabs.org>
> Sent: Tuesday, May 14, 2024 11:40:39 PM
> Subject: Re: [Skiboot] PowerNV PCIe hotplug support?

> On 2024-05-14 18:09:17 Tue, Dan Horák wrote:
>> On Thu, 28 Dec 2023 13:31:53 -0600 (CST)
>> Timothy Pearson <tpearson at raptorengineering.com> wrote:
>> 
>> > Resend to skiboot since I think skiboot is actually responsible for trying to
>> > bring the links back up / providing new DT information to Linux:
>> > 
>> > I've been evaluating some new options for our POWER9-based hardware in the NVMe
>> > space, and would like some clarification on the current status of PCIe hotplug
>> > for the PowerNV platforms.
>> > 
>> > From what I understand, the pnv_php driver provides the basic hotplug
>> > functionality on PowerNV.  What I'm not clear on is to what extent this is
>> > intended to flow downstream to attached PCIe switches.
>> > 
>> > I have a test setup here that consists of a PMC 8533 switch and several
>> > downstream NVMe drives, with the switch attached directly to the PHB4 root
>> > port.  After loading the pnv_php module, I can disconnect the downstream NVMe
>> > devices by either using echo 0 on /sys/bus/pcu/slots/Snnnnnnn/power, or by
>> > doing a physical surprise unplug, however nothing I try can induce a newly
>> > plugged device to train and be detected on the bus.  Even trying a echo 0 and
>> > subsequent echo 1 to /sys/bus/pcu/slots/Snnnnnnn/power only results in the
>> > device going offline, there seems to be no way to bring the device back online
>> > short of a reboot.
>> > 
>> > Hotplug of other devices connected directly to the PHB4 appears to work properly
>> > (I can online and offline via the power node); the issue seems to be restricted
>> > to downstream devices connected to the (theoretically hotplug capable) PMC 8533
>> > switch.
>> > 
>> > Is this the intended behavior for downstream (non-IBM) PCIe ports?  Raptor can
>> > provide resources to assist in a fix if needed, but I would like to understand
>> > if this is a bug or an unimplemented feature first, and if the latter what the
>> > main issues are likely to be in implementation.
>> 
>> isn't
>> https://lists.ozlabs.org/pipermail/linuxppc-dev/2024-May/271751.html
>> fixing the issue mentioned here?
> 
> Yes, that series is targeted to fix the issue that Timothy reported.
> 
> Timothy, Did you get chance to try the fix ?
> 
> Thanks,
> -Mahesh.

I had another kernel engineer here at Raptor try the fix, since I'm on another project at the moment.  He reported it's partially fixed, but not fully, he's discussing directly with Krishna Kumar at the moment:

> I've tested your series and can confirm that it resolves the crash and
> failure to re-attach that is encountered when hotplugging the bridge's
> slot itself. Re-attaching now works and all devices downstream of the
> bridge are correctly re-attached as well.
> 
> However, on my test setup with a Samsung NVMe SSD behind a PMC 8533
> switch, offlining and then re-attaching the downstream device's slot
> itself still doesn't work.
> 
> # lspci -tv
> +-[0000:00]---00.0-[01-26]----00.0-[02-26]--+-00.0-[03-07]--
> |                                           +-01.0-[08-0c]--
> |                                           +-02.0-[0d]----00.0
> Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> |                                           +-03.0-[0e-12]--
> |                                           +-04.0-[13-17]--
> |                                           +-05.0-[18-1c]--
> |                                           +-06.0-[1d-21]--
> |                                           \-07.0-[22-26]--
> 
> # cat /sys/bus/pci/slots/S00000d/address    # Downstream NVMe device
> 0000:0d:00
> # echo 0 > /sys/bus/pci/slots/S00000d/power # This works fine
> # cat /sys/bus/pci/slots/S00000d/power
> 0
> # echo 1 > /sys/bus/pci/slots/S00000d/power # This fails
> # cat /sys/bus/pci/slots/S00000d/power
> 0
> 
> As you can see, the device still remains offline even after attempting
> to online it.
> 
> Strangely enough, hotplugging the parent bridge's slot itself still
> works and manages to bring the NVMe back online.
> 
> # cat /sys/bus/pci/slots/S00000d/power # NVMe still stuck offline
> 0
> # cat /sys/bus/pci/slots/SLOT1\ PCIE\ 4.0\ X16/address # bridge's slot
> 0000:01:00
> # echo 0 > /sys/bus/pci/slots/SLOT1\ PCIE\ 4.0\ X16/power
> # echo 1 > /sys/bus/pci/slots/SLOT1\ PCIE\ 4.0\ X16/power
> # cat /sys/bus/pci/slots/S00000d/power # NVMe is now back online
> 1
> 
> I will continue investigating this, but if you have any ideas why this
> might be happening, I'd be curious to hear your thoughts.
> 
> Thanks,
> Shawn



More information about the Skiboot mailing list