Proposal: configurable per-sensor error behavior in phosphor-hwmon
msbarth at linux.ibm.com
Thu Jul 18 00:36:38 AEST 2019
On 7/16/19 4:19 PM, Patrick Venture wrote:
> On Mon, Jul 15, 2019 at 9:10 AM Matt Spinler <mspinler at linux.ibm.com> wrote:
>> On 7/15/2019 9:45 AM, Matthew Barth wrote:
>>> This is a great proposal, just a few concerns/notes.
>>> On 7/12/19 5:27 PM, Kun Yi wrote:
>>>> Hi there,
>>>> Current phosphor-hwmon code is filled with preprocessor macros to
>>>> branch error condition for sysfs reads, and it seems to me that
>>>> adding a per-sensor configuration would solve two issues at least:
>>>> 1. code can be greatly simplified
>>>> 2. we can code more flexible sensor reading behavior
>>>> Why 2) is needed: with many types of sensors that BMC controls,
>>>> having an one-size-fits-all policy will always have cases that it
>>>> can't handle. Each flaky sensor is flaky in its own way.
>>>> Rough proposal on how this will work:
>>>> add properties to each sensor group's configuration file:
>>>> "error behavior": can be one of
>>>> - always keep
>>>> - remove from D-Bus on error
>>> There is a REMOVERCS device config file option that can be configured
>>> to remove an individual sensor or any sensor of the device when a
>>> given set of return codes occur when attempting to read the sensor.
>>>> "error condition": can be combination of
>>>> - certain sysfs return codes
>>> REMOVERCS combines this error condition to the behavior of removing
>>> the sensor from Dbus. I'd be interested in how these types of
>>> bahavior-to-conditions will be mapped within the device's config file.
>>>> - timeout
>>> In the case of phosphor-hwmon, isnt a timeout condition similar to
>>> error retries since a timeout condition is presented as a ETIMEDOUT
>>> return code on the sensor.
>>>> - invalid value
>>> This is another area I'd be interested to hear more on, how would one
>>> go about defining when a value would be invalid? Or is this a simple,
>>> negative values are invalid for a sensor that should always return a
>>> positive value?
>>>> "error retries": number of retries before declaring the sensor has an
>>> This would be great to have configurable per sensors, however a
>>> possible issue here would be allowing too many retries causing hwmon
>>> to take too long. So this should be capped or controlled in someway
>>> with the delay between reads as well. Right now a sensor is allowed to
>>> be retried 10x's with a 100ms delay between each attempt.
>>>> Happy to hear any feedback.
> I think the idea in general is interesting -- it may be ugly to try
> and set the values per sensor, versus a group of sensors in a config
> -- see INTERVAL, which is set for all sensors in a file. I realize
> that some things are sensor specific... so maybe a mixture? I'd just
> hate to see us need to specify the error handling for every sensor if
> only one is different, and the others aren't default. Although, maybe
> that can be done nicely -- have some default you can set in a config
> once and then override per sensor...
FWIW, REMOVERCS is partially implemented to how you describe. A
REMOVERCS entry can be added to a devices config either per sensor or
for all sensors provided by the device. The return codes defined per
sensor are added to the list for the sensor when its created and any
other/additional return codes are added to the same sensor if defined as
a general config option on the device.
ex.) REMOVERCS_fan1 = 7,11 and REMOVERCS = 13 within the same device
config would remove fan1 when return codes 7,11or13 occur when
attempting to read it.
>> I like this idea. Hopefully the defaults can be kept the same as they
>> are today so users of
>> today's default settings wouldn't have to change their config files to
>> keep things the same.
>> Another thing we've been thinking about adding is the ability for
>> sensors to only be
>> read when the power is on, though there's still some invention
>> required as to how
>> to represent this state on D-Bus.
More information about the openbmc