[PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]
Benjamin Herrenschmidt
benh at kernel.crashing.org
Wed Jan 9 16:09:02 AEDT 2019
On Mon, 2019-01-07 at 21:01 -0700, Jason Gunthorpe wrote:
>
> > In a very cryptic way that requires manual parsing using non-public
> > docs sadly but yes. From the look of it, it's a completion timeout.
> >
> > Looks to me like we don't get a response to a config space access
> > during the change of D state. I don't know if it's the write of the D3
> > state itself or the read back though (it's probably detected on the
> > read back or a subsequent read, but that doesn't tell me which specific
> > one failed).
>
> If it is just one card doing it (again, check you have latest
> firmware) I wonder if it is a sketchy PCI-E electrical link that is
> causing a long re-training cycle? Can you tell if the PCI-E link is
> permanently gone or does it eventually return?
No, it's 100% reproducable on systems with that specific card model,
not card instance, and maybe different systems/cards as well, I'll let
David & Alexey comment further on that.
> Does the card work in Gen 3 when it starts? Is there any indication of
> PCI-E link errors?
Nope.
> Everytime or sometimes?
>
> POWER 8 firmware is good? If the link does eventually come back, is
> the POWER8's D3 resumption timeout long enough?
>
> If this doesn't lead to an obvious conclusion you'll probably need to
> connect to IBM's Mellanox support team to get more information from
> the card side.
We are IBM :-) So far, it seems to be that the card is doing something
not quite right, but we don't know what. We might need to engage
Mellanox themselves.
Cheers,
Ben.
More information about the Linuxppc-dev
mailing list