[PATCH 1/2] PCI: Add pci_set_vpd_timeout() to set VPD access timeout

Tue Nov 22 11:34:36 AEDT 2016

Hi Bjorn,

> On Nov 21, 2016, at 4:05 PM, Bjorn Helgaas <helgaas at kernel.org> wrote:
> 
> Hi Matthew,
> 
> On Mon, Nov 21, 2016 at 03:09:49PM -0600, Matthew R. Ochs wrote:
>> The PCI core uses a fixed 50ms timeout when waiting for VPD accesses to
>> complete. When an access does not complete within this period, a warning
>> is logged and an error returned to the caller.
>> 
>> While this default timeout is valid for most hardware, some devices can
>> experience longer access delays under certain circumstances. For example,
>> one of the IBM CXL Flash devices can take up to ~120ms in a worst-case
>> scenario. These types of devices can benefit from an extended timeout that
>> is specific to their hardware constraints.
>> 
>> To support per-device VPD access timeouts, pci_set_vpd_timeout() is added
>> as an exported service. PCI devices will continue to default with the 50ms
>> timeout and use a per-device timeout when a driver calls this new service.
> 
> Can you include a pointer to something in the spec that's behind the
> default 50ms timeout, or did somebody just pull that number out of the
> air?

AFAIK the PCI spec is silent on VPD access timeouts.

The current 50ms timeout can be traced to Commit 1120f8b8169f ("PCI: handle
long delays in VPD access") where the timeout was increased to accommodate
specific hardware. Prior to that the wait timeout was dependent upon a read or
write with the write waiting up to a maximum of 10ms.

> I'm wondering how we know 50ms or 120ms or 250ms or whatever is the
> right number.  What bad things would happen if we just increased the
> timeout from 50 to 125ms for *all* devices?

You're asking the right questions. The timeout chosen for the CXL flash device
was derived through instrumentation, and this was only after witnessing VPD
timeout messages in the kernel log at random times.

I originally thought about proposing a blanket increase, but figured the scope
might be too broad. There are 2 downsides I see with simply replacing 50ms
to a larger value:

 - Raising the timeout bar [potentially] raises the total time it takes to complete
a VPD access. One would hope that scenarios where every access times out
are very rare and that the max limits are only rarely encountered.

 - It's difficult to settle on a single 'catch all' value. What might be fine for h/w
A may end up not working for h/w B (as is the case here). That said, given that
50ms has served as the value for roughly nine years I think this point doesn't
carry much weight.

> 
> I don't really want to end up with a bunch of device-specific quirks
> here.  If we have a quirk to work around one defective device, that's
> one thing.  If the spec allows a huge variation in VPD access time,
> that might be something we want to handle 

I agree 100% and would be more than happy with submitting a patch that
simply increases the value.

-matt