Notion of Flakiness in Sensor Readings

Tue Oct 16 13:34:26 AEDT 2018

On Mon, Oct 15, 2018 at 7:30 PM Brad Bishop <bradleyb at fuzziesquirrel.com> wrote:
>
>
>
> > On Oct 15, 2018, at 10:15 PM, Patrick Venture <venture at google.com> wrote:
> >
> > On Mon, Oct 15, 2018 at 6:14 PM Brad Bishop <bradleyb at fuzziesquirrel.com> wrote:
> >>
> >>
> >>
> >>> On Oct 12, 2018, at 5:32 PM, Patrick Venture <venture at google.com> wrote:
> >>>
> >>> On Fri, Oct 12, 2018 at 2:28 PM Matt Spinler
> >>> <mspinler at linux.vnet.ibm.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 10/12/2018 4:23 PM, Patrick Venture wrote:
> >>>>> On Fri, Oct 12, 2018 at 2:09 PM James Feist <james.feist at linux.intel.com> wrote:
> >>>>>> On 10/12/2018 01:53 PM, Patrick Venture wrote:
> >>>>>>> Currently, there are a few approaches in phosphor-hwmon on how to
> >>>>>>> handle sensors being present sometimes, or flaky on reads.  We've
> >>>>>>> talked in the past about how to handle a read failure that should be
> >>>>>>> ignored.  A real failure that is flaky or temporary.
> >>>>>>>
> >>>>>>> I was thinking of just adding a property to the Sensor.Value interface
> >>>>>>> to report information.  Basically, someone reading the value needs to
> >>>>>>> know if it's valid.  The idea of a range check for validity doesn't
> >>>>>>> work as cleanly for this specific purpose, in my opinion.  The min/max
> >>>>>>> can be used cleanly to determine the linearization though!
> >>>>>>>
> >>>>>>> I was thinking perhaps a boolean, that you can check to see if the
> >>>>>>> value should be used or trusted.  Then, I thought, perhaps, a enum
> >>>>>>> with a series of states, starting with the states of "Valid',
> >>>>>>> "Invalid."  Not being able to think of a third state, I fell back onto
> >>>>>>> a boolean.
> >>>>
> >>>> If the hwmon driver for your device provides a _fault attribute, and
> >>>> that gets set, the code is already
> >>>> set up to create an OperationalStatus interface on that sensor instance
> >>>> and set the Functional
> >>>> property to false.
> >>>
> >>> I'd like to drop this such that the failure behavior can be returning
> >>> a NaN double.
> >>
> >> One benefit I see here is that everything is in the same DBus interface
> >> (the same property even), but with OperationalStatus you have two and
> >> thus more DBus calls, if you are polling?  Is that it or are there others?
> >>
> >>> https://github.com/openbmc/phosphor-hwmon/blob/master/hwmonio.cpp#L127
> >>>
> >>> For the drivers that don't have the fault attribute,
> >>
> >> It seems like OperationalStatus could be extended to emulate easily here.
> >>
> >>> and in the case where phosphor-host-ipmid doesn't check for that kind of
> >>> failure.
> >>
> >> So is there no value in consistency of interface?  A lot of things
> >> can be ‘not working’ on a BMC besides sensors - do you want to model that
> >> in the DBus API differently for each domain (which seems like the direction
> >> you are headed)?
> >>
> >> Further this model falls apart if you ever want something besides _working_
> >> and _not_working_ for possible states.  I realize YAGNI, but 1 - I have seen that
> >> in the past, 2 - it seems trivial to implement, and 3 - the down-side of
> >> OperationalStatus is not apparent to me.
> >
> > I have no issue with OperationalStatus for this -- I believe Matt's
> > comment was that it's available gratis if the driver supports it.
>
> Correct.  But I think it could be extended to emulate it for drivers that
> don’t - similar to the thresholds.
>
> > Since this driver may not (doesn't, but i haven't checked recently), I
> > haven't dug back into that use case.
> >
> > The idea of a sensor falling off and being added to the dbus every
> > other time the sensor needs to be read falls over for at least one key
> > reason: 1) You can't read the sensor on demand anymore.
>
> I agree - I don’t think we want to do this anymore, when we have
> options like OperationalStatus or NaN.
>
> >
> > I brought all this up because I want to get away from -errno patches,
> > and I want the error path to follow the natural state of things.
> > Reading a dbus double and seeing NaN() may be enough of an error, and
> > you can check OperationalStatus or other interfaces for more
> > information if desired.
>
> That SGTM.  I thought you wanted to drop OperationalStatus completely.

No, I definitely want to keep it.  And the idea of emulating it was
sort of the idea going in -- although I hadn't thought of it when I
started this chain - - the but general idea that I can use something
like NaN and get the results for the flakiness situation with few
changes in phosphor-host-ipmid, I'm pretty happy with it.

There may be other types of errors that warrant using OperationStatus,
but that is a second check, so that's something that could be avoided
for some errors.

I think we're all on the same page.

>
> >
> >>
> >>>
> >>>>
> >>>>>> For Valid / Invalid we've produced nan on D-Bus to know that the sensor
> >>>>>> is in an invalid state (like power off for tach values). Although I'm
> >>>>>> not sure how this would be produced using the int64 interface, for the
> >>>>>> double std::nan will go over d-bus correctly.
> >>>>> So if i just wait until we're using double, the issue basically
> >>>>> resolves itself? :D :D
> >>>>>
> >>>>>>> Thoughts?
> >>>>>>>
> >>>>>>> Patrick