Proposal: configurable per-sensor error behavior in phosphor-hwmon

Sat Jul 20 01:00:00 AEST 2019

Thanks for the feedback! I started working on refactoring the code a
little and now has more ideas. To avoid designing over email, I will
write a succinct design doc on several proposed improvements:
1. The feature I mentioned, customizable error behavior.
2. Change the underlying read to be asynchronous and non-blocking
3. Some general refactor directions. e.g. better abstraction of the
sensor interface

Comments inline in response to previous questions:

On Wed, Jul 17, 2019 at 7:37 AM Matthew Barth <msbarth at linux.ibm.com> wrote:
>
>
>
> On 7/16/19 4:19 PM, Patrick Venture wrote:
> > On Mon, Jul 15, 2019 at 9:10 AM Matt Spinler <mspinler at linux.ibm.com> wrote:
> >>
> >>
> >>
> >> On 7/15/2019 9:45 AM, Matthew Barth wrote:
> >>> This is a great proposal, just a few concerns/notes.
> >>>
> >>> On 7/12/19 5:27 PM, Kun Yi wrote:
> >>>> Hi there,
> >>>>
> >>>> Current phosphor-hwmon code is filled with preprocessor macros to
> >>>> branch error condition for sysfs reads, and it seems to me that
> >>>> adding a per-sensor configuration would solve two issues at least:
> >>>> 1. code can be greatly simplified
> >>>> 2. we can code more flexible sensor reading behavior
> >>>>
> >>>> Why 2) is needed: with many types of sensors that BMC controls,
> >>>> having an one-size-fits-all policy will always have cases that it
> >>>> can't handle. Each flaky sensor is flaky in its own way.
> >>>>
> >>>> Rough proposal on how this will work:
> >>>>
> >>>> add properties to each sensor group's configuration file:
> >>>>
> >>>> "error behavior": can be one of
> >>>> - always keep
> >>>> - remove from D-Bus on error
> >>> There is a REMOVERCS device config file option that can be configured
> >>> to remove an individual sensor or any sensor of the device when a
> >>> given set of return codes occur when attempting to read the sensor.
> >>>>

Matthew: Ack. My point is to fold the REMOVERCS into one of the
configurable options.

> >>>> "error condition":  can be combination of
> >>>> - certain sysfs return codes
> >>> REMOVERCS combines this error condition to the behavior of removing
> >>> the sensor from Dbus. I'd be interested in how these types of
> >>> bahavior-to-conditions will be mapped within the device's config file.
> >>>
> >>>> - timeout
> >>> In the case of phosphor-hwmon, isnt a timeout condition similar to
> >>> error retries since a timeout condition is presented as a ETIMEDOUT
> >>> return code on the sensor.

Matthew: That's what I thought originally, but defining a timeout in
userspace actually gives better control. Relying on ETIMEOUT returned
from the kernel is not optimal since the kernel driver could have a
long timeout (I think SMBus kernel driver timeout usually is around
10-20s), a longer timeout (some of the regmap_read* functions times
out in 2 mins in my experimentation), or possibly no timeout at all
(think about a bug in kernel driver!)

> >>>> - invalid value
> >>> This is another area I'd be interested to hear more on, how would one
> >>> go about defining when a value would be invalid? Or is this a simple,
> >>> negative values are invalid for a sensor that should always return a
> >>> positive value?
> >>>>

Matthew: I'm thinking the upper/lower thresholds can be defined in the
configuration file. Arguably this definition has some overlap with the
sysfs crit/crit_alarm attributes, and it would be nicer to have a
single representation. Maybe phosphor-hwmon should just faithfully
represent what sysfs reports and leave interpretation to the receiving
end.

> >>>> "error retries": number of retries before declaring the sensor has an
> >>>> error
> >>> This would be great to have configurable per sensors, however a
> >>> possible issue here would be allowing too many retries causing hwmon
> >>> to take too long. So this should be capped or controlled in someway
> >>> with the delay between reads as well. Right now a sensor is allowed to
> >>> be retried 10x's with a 100ms delay between each attempt.

Matthew: definitely a good point. I find this 10x to be quite
arbitrary though. What if the sensor is just not available at the
moment, until 10 seconds later? For example, a sensor on a pluggable
board that's not powered on until say 1 minute after BMC is up?

I haven't thought carefully but presumably the hot-pluggability is
something that phosphor-hwmon can handle with the help of several
D-Bus tunables.

> >>>>
> >>>> Happy to hear any feedback.
> >
> > I think the idea in general is interesting -- it may be ugly to try
> > and set the values per sensor, versus a group of sensors in a config
> > -- see INTERVAL, which is set for all sensors in a file.  I realize
> > that some things are sensor specific... so maybe a mixture?  I'd just
> > hate to see us need to specify the error handling for every sensor if
> > only one is different, and the others aren't default.  Although, maybe
> > that can be done nicely -- have some default you can set in a config
> > once and then override per sensor...
> >
>
> FWIW, REMOVERCS is partially implemented to how you describe. A
> REMOVERCS entry can be added to a devices config either per sensor or
> for all sensors provided by the device. The return codes defined per
> sensor are added to the list for the sensor when its created and any
> other/additional return codes are added to the same sensor if defined as
> a general config option on the device.
>
> ex.) REMOVERCS_fan1 = 7,11 and REMOVERCS = 13 within the same device
> config would remove fan1 when return codes 7,11or13 occur when
> attempting to read it.
>

Patrick/Mattew: good point. In my mind by default most options could
be per sensor-set, but allowing a sensor to override the set default
doesn't sound hard. (like what REMOVERCS already did)

> >>>>
> >>
> >> I like this idea.  Hopefully the defaults can be kept the same as they
> >> are today so users of
> >> today's default settings wouldn't have to change their config files to
> >> keep things the same.
> >>

Matt: I think so. I agree at least for now, backward compatibility is
needed in terms of default behavior given the same configuration
files.

> >> Another thing we've been thinking about adding is the ability for
> >> sensors to only be
> >> read when the power is on,   though there's still some invention
> >> required as to how
> >> to represent this state on D-Bus.

Matt: I'm thinking a more general case, where entire *sets* of sensors
can appear later in the game. I know entity manager handles some of
that, but seems to me the sensor part is not figured out.
(someone pointed out https://github.com/openbmc/dbus-sensors before,
which I have to check)

> >>
> >>>> Regards,
> >>>> Kun
> >>>
> >>
> >
>

-- 
Regards,
Kun