Notion of Flakiness in Sensor Readings

Patrick Venture venture at google.com
Tue Oct 16 01:34:39 AEDT 2018


On Mon, Oct 15, 2018 at 6:40 AM Matthew Barth <msbarth at linux.ibm.com> wrote:
>
>
>
> On 10/12/2018 04:32 PM, Patrick Venture wrote:
> > On Fri, Oct 12, 2018 at 2:28 PM Matt Spinler
> > <mspinler at linux.vnet.ibm.com> wrote:
> >>
> >>
> >>
> >> On 10/12/2018 4:23 PM, Patrick Venture wrote:
> >>> On Fri, Oct 12, 2018 at 2:09 PM James Feist <james.feist at linux.intel.com> wrote:
> >>>> On 10/12/2018 01:53 PM, Patrick Venture wrote:
> >>>>> Currently, there are a few approaches in phosphor-hwmon on how to
> >>>>> handle sensors being present sometimes, or flaky on reads.  We've
> >>>>> talked in the past about how to handle a read failure that should be
> >>>>> ignored.  A real failure that is flaky or temporary.
> >>>>>
> >>>>> I was thinking of just adding a property to the Sensor.Value interface
> >>>>> to report information.  Basically, someone reading the value needs to
> >>>>> know if it's valid.  The idea of a range check for validity doesn't
> >>>>> work as cleanly for this specific purpose, in my opinion.  The min/max
> >>>>> can be used cleanly to determine the linearization though!
> >>>>>
> >>>>> I was thinking perhaps a boolean, that you can check to see if the
> >>>>> value should be used or trusted.  Then, I thought, perhaps, a enum
> >>>>> with a series of states, starting with the states of "Valid',
> >>>>> "Invalid."  Not being able to think of a third state, I fell back onto
> >>>>> a boolean.
> >>
> >> If the hwmon driver for your device provides a _fault attribute, and
> >> that gets set, the code is already
> >> set up to create an OperationalStatus interface on that sensor instance
> >> and set the Functional
> >> property to false.
> >
> > I'd like to drop this such that the failure behavior can be returning
> > a NaN double.
> > https://github.com/openbmc/phosphor-hwmon/blob/master/hwmonio.cpp#L127
> >
> > For the drivers that don't have the fault attribute, and in the case
> > where phosphor-host-ipmid doesn't check for that kind of failure.
> >
> IMHO, we should handle this by removing the sensor from dbus instead of
> using a special value type if we're unable to use the OperationalStatus.
> In either case, the code utilizing this sensor would need to
> specifically handle this NaN value or handle having the sensor removed
> from dbus. It seems more generically, to me, to have hwmon remove the
> sensor from dbus instead of using a NaN double so this can be used
> across other sensor drivers that may not return a double type.

In our infrastructure if we removed the sensor from dbus we'd have to
report an invalid reading but not a sensor missing reading.  So, to
us, if it's NaN or not on dbus we have to treat it the same way
basically, so fewer code changes to support it is more or less the
goal.  Although right now phosphor-host-ipmid isn't set up nicely to
deal with NaN(). I have a downstream change that I needed to introduce
so that it excepts on that specific type of failure and then returns
invalid reading (or something akin to that).

> >>
> >>>> For Valid / Invalid we've produced nan on D-Bus to know that the sensor
> >>>> is in an invalid state (like power off for tach values). Although I'm
> >>>> not sure how this would be produced using the int64 interface, for the
> >>>> double std::nan will go over d-bus correctly.
> >>> So if i just wait until we're using double, the issue basically
> >>> resolves itself? :D :D
> >>>
> >>>>> Thoughts?
> >>>>>
> >>>>> Patrick
> >>>>>
> >>
> >
>


More information about the openbmc mailing list