[Linuxppc-users] Fedora 28-1.1 taking 30 seconds to discover/enable PCIe adapter after link disable/enable

Mike Bieker mike.bieker at broadcom.com
Wed Jun 6 01:17:09 AEST 2018


I've tested retrains on the system also and those worked fine.  For our HW
validations we do a full set of thousands:
- Hot Resets (Secondary Bus Resets)
- Link Retrains
- Speed Changes
- Power Management L1
- Link Disable/Enable

I believe both the Link Disable/Enable and the Hot Reset tests on the PPC
system failed in the same way where it took 30 to 60 seconds for the link to
come back up after the reset/disable.  We really need to be able to perform
these to validate our HW and Serdes, so if there is some hack that would be
great.

Thanks!
Mike

-----Original Message-----
From: Benjamin Herrenschmidt [mailto:benh at au1.ibm.com]
Sent: Tuesday, June 05, 2018 6:09 AM
To: Michael Neuling; Steve Pittman; Mike Bieker
Cc: Linuxppc-users; Brian King; linuxppc-users at lists.ozlabs.org
Subject: Re: [Linuxppc-users] Fedora 28-1.1 taking 30 seconds to
discover/enable PCIe adapter after link disable/enable

On Tue, 2018-06-05 at 17:07 +1000, Michael Neuling wrote:
> On Tue, 2018-06-05 at 16:12 +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2018-06-05 at 01:24 -0400, Steve Pittman wrote:
> > > Mike,
> > >
> > > From one-on-one conversations with you, I know that Broadcom is
> > > testing prototype adapters and that an important test is to disable
> > > and reenable the link between the host and the adapter under test
> > > thousands of times to make sure negotiations are always successful.
> > > On X86 platforms, the adapter always comes up at the maximum width
> > > and rate and it is possible to do about 10 of these cycles in a
> > > second.
> >
> > Right, the main problem is that if you bring the link down, the EEH
> > error handling will kick in.
> >
> > For "lab" work we could try to provide firmware settings to disable
> > EEH on link down I suppose in a FW update. Mikey what do you reckon ?
>
> We could probably do a putscom hack from userspace to setup the FIR
> actions to
> the right values.

I think those are MMIO based registers

> Mikey
>
> >
> >
> > >  When I explained to a colleague of mine what you are attempting to
> > > do, the colleague wrote:
> > >
> > > You should just be able to hit the link retraining bit (bit 5) in the
> > > PCIe Link Control / Status Register in the PCIe config space.  In our
> > > PHB this is at offset 0x58 but it'll be at a different offset
> > > depending on where the extended capabilities are in the broadcom
> > > device.
> > >
> > > To hit this in our PHB you can just do:
> > >   setpci -s <PCI BDFN> 0x58.w=0x20
> > >
> > > You should then be able to check the width and speed with lspci
> > >
> > > Doing it here on one of my Romulus I see this.
> > >
> > > root at ozrom1:~# setpci -s 0001:00:00.0  0x58.w=0x20
> > > root at ozrom1:~# lspci -s 0001:00:00.0  -vv |grep LnkSta
> > > 		 		 LnkSta:		 Speed 8GT/s,
> > > Width x8, TrErr- Train- SlotClk- DLActive+ BWMgmt+ ABWMgmt+
> > > 		 		 LnkSta2: Current De-emphasis Level:
> > > -3.5dB, EqualizationComplete+, EqualizationPhase1+
> > >
> > > You can also set bit 4 and the link will go down, but then you enter
> > > EEH recovery which I don't think you want.
> > >
> > > The other option which could be explored is to hack skiboot to retry
> > > the training on boot.  This has been done in the past but this is a
> > > bit trickier since one must change the firmware.
> > >
> > > I hope that helps?
> > >
> > > Steve
> > >
> > > On Wed, 2018-05-30 at 12:13 -0600, Mike Bieker wrote:
> > > > Hi Brian,
> > > >
> > > >
> > > >
> > > > Thanks for looking at this!  I will check dmesg for the same.
> > > >
> > > >
> > > >
> > > > Mike
> > > >
> > > >
> > > >
> > > > > *From:* Brian King [mailto:brking at linux.ibm.com]
> > > > > *Sent:* Wednesday, May 30, 2018 12:06 PM
> > > > > *To:* Mike Bieker; linuxppc-users at lists.ozlabs.org
> > > > > *Subject:* Re: [Linuxppc-users] Fedora 28-1.1 taking 30 seconds
> > > > > to
> > > > > discover/enable PCIe adapter after link disable/enable
> > > > >
> > > > >
> > > > >
> > > > > Hi Mike,
> > > > >
> > > > > When I try that on a Power 9 system of mine, the act of doing the
> > > > > link
> > > > > disable results in the PHB going into EEH
> > > > > state, which is essentially the PHB going into a frozen state due
> > > > > to an
> > > > > unexpected error of some sort. Lots of things
> > > > > can cause this - bad DMA address, PCIe link errors, etc. In this
> > > > > case its
> > > > > the act of disabling the link.
> > > > > If you check dmesg, my guess is that you will see errors related
> > > > > to EEH.
> > > > > The kernel will then attempt to
> > > > > recover from this state. In fact, what I see on my system, is I
> > > > > don't even
> > > > > need to clear the link disable state,
> > > > > as the act of going through EEH recovery in the kernel ends up
> > > > > clearing it.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Brian
> > > > >
> > > > > On 05/30/2018 11:00 AM, Mike Bieker wrote:
> > > > >
> > > > > On x86 system, discovering/enabling a PCIe adapter after PCIe
> > > > > link
> > > > > disable/enable takes less than a second.  However, on Power
> > > > > Systems it
> > > > > takes 30 seconds or more.
> > > > >
> > > > >
> > > > >
> > > > > Here is the process we are using to test:
> > > > >
> > > > > 1)        Boot system and verify that link is up between IBM Root
> > > > > Port and
> > > > > our Atlas PCIe Gen4x16 switch with no errors – ‘lspci –s
> > > > > 034:01:00.0 –vvv’
> > > > >
> > > > > 2)        Set Link Disable bit (Bit 4) in PCIe Link Control
> > > > > register of
> > > > > Root Port - ‘setpci –s 034:00:00.0 58.w=0018’.
> > > > >
> > > > > 3)        Verify that link is disabled between Root and Atlas –
> > > > > ‘setpci –s
> > > > > 034:00:00.0 58.w’ should show that link disable bit is set.  Can
> > > > > also
> > > > > execute ‘lspci’ and see that link is down between Root Port and
> > > > > Atlas.
> > > > >
> > > > > 4)        Clear Link Disable bit in PCIe Link Control register of
> > > > > Root Port
> > > > > – ‘setpci –s 034:00:00.0 58.w=0008’
> > > > >
> > > > > 5)        Wait 5 seconds  - ‘sleep 5’
> > > > >
> > > > > 6)        Check that that link between Root Port and Atlas is
> > > > > enabled and
> > > > > at proper rate and width (Gen4x16) – ‘lspci –s 034:01:00.0 –vvv’.
> > > > >  This is
> > > > > where error occurs because link is not up.  If I keep trying
> > > > > lspci, after
> > > > > 30 to 60 seconds the port returns valid data.  Why does Fedora on
> > > > > Power
> > > > > Systems take so long to link up and discover the adapter after
> > > > > link
> > > > > disable/enable?
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Mike
> > >
> > > _______________________________________________
> > > Linuxppc-users mailing list
> > > Linuxppc-users at lists.ozlabs.org
> > > INVALID URI REMOVED
> > > _listinfo_linuxppc-2Dusers&d=DwIGaQ&c=jf_iaSHvJObTbx-
> > > siA1ZOg&r=ilASlOu_ee2uzSGEkp3JMw&m=rYq3zVGQVDzKNNTjLsU5Dol7cLR3N6OyFv
> > > JHOsN3-yY&s=q4e00uWdtjznnxkKm_Rnr9PxHNgZT80uZmqRlNKrlT0&e=
> > >
> >
> > _______________________________________________
> > Linuxppc-users mailing list
> > Linuxppc-users at lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-users


More information about the Linuxppc-users mailing list