Proposal: configurable per-sensor error behavior in phosphor-hwmon

Patrick Venture venture at google.com
Thu Jul 18 00:39:35 AEST 2019


On Wed, Jul 17, 2019 at 7:37 AM Matthew Barth <msbarth at linux.ibm.com> wrote:
>
>
>
> On 7/16/19 4:19 PM, Patrick Venture wrote:
> > On Mon, Jul 15, 2019 at 9:10 AM Matt Spinler <mspinler at linux.ibm.com> wrote:
> >>
> >>
> >>
> >> On 7/15/2019 9:45 AM, Matthew Barth wrote:
> >>> This is a great proposal, just a few concerns/notes.
> >>>
> >>> On 7/12/19 5:27 PM, Kun Yi wrote:
> >>>> Hi there,
> >>>>
> >>>> Current phosphor-hwmon code is filled with preprocessor macros to
> >>>> branch error condition for sysfs reads, and it seems to me that
> >>>> adding a per-sensor configuration would solve two issues at least:
> >>>> 1. code can be greatly simplified
> >>>> 2. we can code more flexible sensor reading behavior
> >>>>
> >>>> Why 2) is needed: with many types of sensors that BMC controls,
> >>>> having an one-size-fits-all policy will always have cases that it
> >>>> can't handle. Each flaky sensor is flaky in its own way.
> >>>>
> >>>> Rough proposal on how this will work:
> >>>>
> >>>> add properties to each sensor group's configuration file:
> >>>>
> >>>> "error behavior": can be one of
> >>>> - always keep
> >>>> - remove from D-Bus on error
> >>> There is a REMOVERCS device config file option that can be configured
> >>> to remove an individual sensor or any sensor of the device when a
> >>> given set of return codes occur when attempting to read the sensor.
> >>>>
> >>>> "error condition":  can be combination of
> >>>> - certain sysfs return codes
> >>> REMOVERCS combines this error condition to the behavior of removing
> >>> the sensor from Dbus. I'd be interested in how these types of
> >>> bahavior-to-conditions will be mapped within the device's config file.
> >>>
> >>>> - timeout
> >>> In the case of phosphor-hwmon, isnt a timeout condition similar to
> >>> error retries since a timeout condition is presented as a ETIMEDOUT
> >>> return code on the sensor.
> >>>> - invalid value
> >>> This is another area I'd be interested to hear more on, how would one
> >>> go about defining when a value would be invalid? Or is this a simple,
> >>> negative values are invalid for a sensor that should always return a
> >>> positive value?
> >>>>
> >>>> "error retries": number of retries before declaring the sensor has an
> >>>> error
> >>> This would be great to have configurable per sensors, however a
> >>> possible issue here would be allowing too many retries causing hwmon
> >>> to take too long. So this should be capped or controlled in someway
> >>> with the delay between reads as well. Right now a sensor is allowed to
> >>> be retried 10x's with a 100ms delay between each attempt.
> >>>>
> >>>> Happy to hear any feedback.
> >
> > I think the idea in general is interesting -- it may be ugly to try
> > and set the values per sensor, versus a group of sensors in a config
> > -- see INTERVAL, which is set for all sensors in a file.  I realize
> > that some things are sensor specific... so maybe a mixture?  I'd just
> > hate to see us need to specify the error handling for every sensor if
> > only one is different, and the others aren't default.  Although, maybe
> > that can be done nicely -- have some default you can set in a config
> > once and then override per sensor...
> >
>
> FWIW, REMOVERCS is partially implemented to how you describe. A
> REMOVERCS entry can be added to a devices config either per sensor or
> for all sensors provided by the device. The return codes defined per
> sensor are added to the list for the sensor when its created and any
> other/additional return codes are added to the same sensor if defined as
> a general config option on the device.
>
> ex.) REMOVERCS_fan1 = 7,11 and REMOVERCS = 13 within the same device
> config would remove fan1 when return codes 7,11or13 occur when
> attempting to read it.

Thanks for clarifying the behavior.

>
> >>>>
> >>
> >> I like this idea.  Hopefully the defaults can be kept the same as they
> >> are today so users of
> >> today's default settings wouldn't have to change their config files to
> >> keep things the same.
> >>
> >> Another thing we've been thinking about adding is the ability for
> >> sensors to only be
> >> read when the power is on,   though there's still some invention
> >> required as to how
> >> to represent this state on D-Bus.
> >>
> >>>> Regards,
> >>>> Kun
> >>>
> >>
> >
>


More information about the openbmc mailing list