[Linuxppc-users] Fedora 28-1.1 taking 30 seconds to discover/enable PCIe adapter after link disable/enable

Michael Neuling mikey at neuling.org
Tue Jun 5 17:07:48 AEST 2018


On Tue, 2018-06-05 at 16:12 +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2018-06-05 at 01:24 -0400, Steve Pittman wrote:
> > Mike,
> > 
> > From one-on-one conversations with you, I know that Broadcom is
> > testing prototype adapters and that an important test is to disable
> > and reenable the link between the host and the adapter under test
> > thousands of times to make sure negotiations are always successful. 
> > On X86 platforms, the adapter always comes up at the maximum width
> > and rate and it is possible to do about 10 of these cycles in a
> > second.
> 
> Right, the main problem is that if you bring the link down, the EEH
> error handling will kick in.
> 
> For "lab" work we could try to provide firmware settings to disable
> EEH on link down I suppose in a FW update. Mikey what do you reckon ?

We could probably do a putscom hack from userspace to setup the FIR actions to
the right values.

Mikey

> 
> 
> >  When I explained to a colleague of mine what you are attempting to
> > do, the colleague wrote:
> > 
> > You should just be able to hit the link retraining bit (bit 5) in the
> > PCIe Link Control / Status Register in the PCIe config space.  In our
> > PHB this is at offset 0x58 but it'll be at a different offset
> > depending on where the extended capabilities are in the broadcom
> > device.
> > 
> > To hit this in our PHB you can just do:
> >   setpci -s <PCI BDFN> 0x58.w=0x20
> > 
> > You should then be able to check the width and speed with lspci
> > 
> > Doing it here on one of my Romulus I see this.
> > 
> > root at ozrom1:~# setpci -s 0001:00:00.0  0x58.w=0x20
> > root at ozrom1:~# lspci -s 0001:00:00.0  -vv |grep LnkSta
> > 		 		 LnkSta:		 Speed 8GT/s,
> > Width x8, TrErr- Train- SlotClk- DLActive+ BWMgmt+ ABWMgmt+
> > 		 		 LnkSta2: Current De-emphasis Level:
> > -3.5dB, EqualizationComplete+, EqualizationPhase1+
> > 
> > You can also set bit 4 and the link will go down, but then you enter
> > EEH recovery which I don't think you want.
> > 
> > The other option which could be explored is to hack skiboot to retry
> > the training on boot.  This has been done in the past but this is a
> > bit trickier since one must change the firmware.
> > 
> > I hope that helps?
> > 
> > Steve
> > 
> > On Wed, 2018-05-30 at 12:13 -0600, Mike Bieker wrote:
> > > Hi Brian,
> > > 
> > > 
> > > 
> > > Thanks for looking at this!  I will check dmesg for the same.
> > > 
> > > 
> > > 
> > > Mike
> > > 
> > > 
> > > 
> > > > *From:* Brian King [mailto:brking at linux.ibm.com]
> > > > *Sent:* Wednesday, May 30, 2018 12:06 PM
> > > > *To:* Mike Bieker; linuxppc-users at lists.ozlabs.org
> > > > *Subject:* Re: [Linuxppc-users] Fedora 28-1.1 taking 30 seconds
> > > > to
> > > > discover/enable PCIe adapter after link disable/enable
> > > > 
> > > > 
> > > > 
> > > > Hi Mike,
> > > > 
> > > > When I try that on a Power 9 system of mine, the act of doing the
> > > > link
> > > > disable results in the PHB going into EEH
> > > > state, which is essentially the PHB going into a frozen state due
> > > > to an
> > > > unexpected error of some sort. Lots of things
> > > > can cause this - bad DMA address, PCIe link errors, etc. In this
> > > > case its
> > > > the act of disabling the link.
> > > > If you check dmesg, my guess is that you will see errors related
> > > > to EEH.
> > > > The kernel will then attempt to
> > > > recover from this state. In fact, what I see on my system, is I
> > > > don't even
> > > > need to clear the link disable state,
> > > > as the act of going through EEH recovery in the kernel ends up
> > > > clearing it.
> > > > 
> > > > Thanks,
> > > > 
> > > > Brian
> > > > 
> > > > On 05/30/2018 11:00 AM, Mike Bieker wrote:
> > > > 
> > > > On x86 system, discovering/enabling a PCIe adapter after PCIe
> > > > link
> > > > disable/enable takes less than a second.  However, on Power
> > > > Systems it
> > > > takes 30 seconds or more.
> > > > 
> > > > 
> > > > 
> > > > Here is the process we are using to test:
> > > > 
> > > > 1)        Boot system and verify that link is up between IBM Root
> > > > Port and
> > > > our Atlas PCIe Gen4x16 switch with no errors – ‘lspci –s
> > > > 034:01:00.0 –vvv’
> > > > 
> > > > 2)        Set Link Disable bit (Bit 4) in PCIe Link Control
> > > > register of
> > > > Root Port - ‘setpci –s 034:00:00.0 58.w=0018’.
> > > > 
> > > > 3)        Verify that link is disabled between Root and Atlas –
> > > > ‘setpci –s
> > > > 034:00:00.0 58.w’ should show that link disable bit is set.  Can
> > > > also
> > > > execute ‘lspci’ and see that link is down between Root Port and
> > > > Atlas.
> > > > 
> > > > 4)        Clear Link Disable bit in PCIe Link Control register of
> > > > Root Port
> > > > – ‘setpci –s 034:00:00.0 58.w=0008’
> > > > 
> > > > 5)        Wait 5 seconds  - ‘sleep 5’
> > > > 
> > > > 6)        Check that that link between Root Port and Atlas is
> > > > enabled and
> > > > at proper rate and width (Gen4x16) – ‘lspci –s 034:01:00.0 –vvv’.
> > > >  This is
> > > > where error occurs because link is not up.  If I keep trying
> > > > lspci, after
> > > > 30 to 60 seconds the port returns valid data.  Why does Fedora on
> > > > Power
> > > > Systems take so long to link up and discover the adapter after
> > > > link
> > > > disable/enable?
> > > > 
> > > > 
> > > > 
> > > > Thanks,
> > > > 
> > > > Mike
> > 
> > _______________________________________________
> > Linuxppc-users mailing list
> > Linuxppc-users at lists.ozlabs.org
> > INVALID URI REMOVED
> > _listinfo_linuxppc-2Dusers&d=DwIGaQ&c=jf_iaSHvJObTbx-
> > siA1ZOg&r=ilASlOu_ee2uzSGEkp3JMw&m=rYq3zVGQVDzKNNTjLsU5Dol7cLR3N6OyFv
> > JHOsN3-yY&s=q4e00uWdtjznnxkKm_Rnr9PxHNgZT80uZmqRlNKrlT0&e=
> > 
> 
> _______________________________________________
> Linuxppc-users mailing list
> Linuxppc-users at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-users


More information about the Linuxppc-users mailing list