[Linuxppc-users] Fedora 28-1.1 taking 30 seconds to discover/enable PCIe adapter after link disable/enable

Steve Pittman skywalker at alum.mit.edu
Tue Jun 5 15:24:44 AEST 2018


Mike,

>From one-on-one conversations with you, I know that Broadcom is testing prototype adapters and that an important test is to disable and reenable the link between the host and the adapter under test thousands of times to make sure negotiations are always successful.  On X86 platforms, the adapter always comes up at the maximum width and rate and it is possible to do about 10 of these cycles in a second.

 When I explained to a colleague of mine what you are attempting to do, the colleague wrote:

You should just be able to hit the link retraining bit (bit 5) in the PCIe Link Control / Status Register in the PCIe config space.  In our PHB this is at offset 0x58 but it'll be at a different offset depending on where the extended capabilities are in the broadcom device.

To hit this in our PHB you can just do:
  setpci -s <PCI BDFN> 0x58.w=0x20

You should then be able to check the width and speed with lspci

Doing it here on one of my Romulus I see this.

root at ozrom1:~# setpci -s 0001:00:00.0  0x58.w=0x20
root at ozrom1:~# lspci -s 0001:00:00.0  -vv |grep LnkSta
		 		 LnkSta:		 Speed 8GT/s, Width x8, TrErr- Train- SlotClk- DLActive+ BWMgmt+ ABWMgmt+
		 		 LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+

You can also set bit 4 and the link will go down, but then you enter EEH recovery which I don't think you want.

The other option which could be explored is to hack skiboot to retry the training on boot.  This has been done in the past but this is a bit trickier since one must change the firmware.

I hope that helps?

Steve

On Wed, 2018-05-30 at 12:13 -0600, Mike Bieker wrote:
> Hi Brian,
>
>
>
> Thanks for looking at this!  I will check dmesg for the same.
>
>
>
> Mike
>
>
>
> > *From:* Brian King [mailto:brking at linux.ibm.com]
> > *Sent:* Wednesday, May 30, 2018 12:06 PM
> > *To:* Mike Bieker; linuxppc-users at lists.ozlabs.org
> > *Subject:* Re: [Linuxppc-users] Fedora 28-1.1 taking 30 seconds to
> > discover/enable PCIe adapter after link disable/enable
> > 
> > 
> > 
> > Hi Mike,
> > 
> > When I try that on a Power 9 system of mine, the act of doing the link
> > disable results in the PHB going into EEH
> > state, which is essentially the PHB going into a frozen state due to an
> > unexpected error of some sort. Lots of things
> > can cause this - bad DMA address, PCIe link errors, etc. In this case its
> > the act of disabling the link.
> > If you check dmesg, my guess is that you will see errors related to EEH.
> > The kernel will then attempt to
> > recover from this state. In fact, what I see on my system, is I don't even
> > need to clear the link disable state,
> > as the act of going through EEH recovery in the kernel ends up clearing it.
> > 
> > Thanks,
> > 
> > Brian
> >
> > On 05/30/2018 11:00 AM, Mike Bieker wrote:
> >
> > On x86 system, discovering/enabling a PCIe adapter after PCIe link
> > disable/enable takes less than a second.  However, on Power Systems it
> > takes 30 seconds or more.
> >
> >
> >
> > Here is the process we are using to test:
> >
> > 1)        Boot system and verify that link is up between IBM Root Port and
> > our Atlas PCIe Gen4x16 switch with no errors – ‘lspci –s 034:01:00.0 –vvv’
> >
> > 2)        Set Link Disable bit (Bit 4) in PCIe Link Control register of
> > Root Port - ‘setpci –s 034:00:00.0 58.w=0018’.
> >
> > 3)        Verify that link is disabled between Root and Atlas – ‘setpci –s
> > 034:00:00.0 58.w’ should show that link disable bit is set.  Can also
> > execute ‘lspci’ and see that link is down between Root Port and Atlas.
> >
> > 4)        Clear Link Disable bit in PCIe Link Control register of Root Port
> > – ‘setpci –s 034:00:00.0 58.w=0008’
> >
> > 5)        Wait 5 seconds  - ‘sleep 5’
> >
> > 6)        Check that that link between Root Port and Atlas is enabled and
> > at proper rate and width (Gen4x16) – ‘lspci –s 034:01:00.0 –vvv’.  This is
> > where error occurs because link is not up.  If I keep trying lspci, after
> > 30 to 60 seconds the port returns valid data.  Why does Fedora on Power
> > Systems take so long to link up and discover the adapter after link
> > disable/enable?
> >
> >
> >
> > Thanks,
> >
> > Mike


More information about the Linuxppc-users mailing list