Proposal: configurable per-sensor error behavior in phosphor-hwmon
venture at google.com
Wed Jul 17 07:19:58 AEST 2019
On Mon, Jul 15, 2019 at 9:10 AM Matt Spinler <mspinler at linux.ibm.com> wrote:
> On 7/15/2019 9:45 AM, Matthew Barth wrote:
> > This is a great proposal, just a few concerns/notes.
> > On 7/12/19 5:27 PM, Kun Yi wrote:
> >> Hi there,
> >> Current phosphor-hwmon code is filled with preprocessor macros to
> >> branch error condition for sysfs reads, and it seems to me that
> >> adding a per-sensor configuration would solve two issues at least:
> >> 1. code can be greatly simplified
> >> 2. we can code more flexible sensor reading behavior
> >> Why 2) is needed: with many types of sensors that BMC controls,
> >> having an one-size-fits-all policy will always have cases that it
> >> can't handle. Each flaky sensor is flaky in its own way.
> >> Rough proposal on how this will work:
> >> add properties to each sensor group's configuration file:
> >> "error behavior": can be one of
> >> - always keep
> >> - remove from D-Bus on error
> > There is a REMOVERCS device config file option that can be configured
> > to remove an individual sensor or any sensor of the device when a
> > given set of return codes occur when attempting to read the sensor.
> >> "error condition": can be combination of
> >> - certain sysfs return codes
> > REMOVERCS combines this error condition to the behavior of removing
> > the sensor from Dbus. I'd be interested in how these types of
> > bahavior-to-conditions will be mapped within the device's config file.
> >> - timeout
> > In the case of phosphor-hwmon, isnt a timeout condition similar to
> > error retries since a timeout condition is presented as a ETIMEDOUT
> > return code on the sensor.
> >> - invalid value
> > This is another area I'd be interested to hear more on, how would one
> > go about defining when a value would be invalid? Or is this a simple,
> > negative values are invalid for a sensor that should always return a
> > positive value?
> >> "error retries": number of retries before declaring the sensor has an
> >> error
> > This would be great to have configurable per sensors, however a
> > possible issue here would be allowing too many retries causing hwmon
> > to take too long. So this should be capped or controlled in someway
> > with the delay between reads as well. Right now a sensor is allowed to
> > be retried 10x's with a 100ms delay between each attempt.
> >> Happy to hear any feedback.
I think the idea in general is interesting -- it may be ugly to try
and set the values per sensor, versus a group of sensors in a config
-- see INTERVAL, which is set for all sensors in a file. I realize
that some things are sensor specific... so maybe a mixture? I'd just
hate to see us need to specify the error handling for every sensor if
only one is different, and the others aren't default. Although, maybe
that can be done nicely -- have some default you can set in a config
once and then override per sensor...
> I like this idea. Hopefully the defaults can be kept the same as they
> are today so users of
> today's default settings wouldn't have to change their config files to
> keep things the same.
> Another thing we've been thinking about adding is the ability for
> sensors to only be
> read when the power is on, though there's still some invention
> required as to how
> to represent this state on D-Bus.
> >> Regards,
> >> Kun
More information about the openbmc