Notion of Flakiness in Sensor Readings

Brad Bishop bradleyb at fuzziesquirrel.com
Tue Oct 16 13:30:04 AEDT 2018



> On Oct 15, 2018, at 10:15 PM, Patrick Venture <venture at google.com> wrote:
> 
> On Mon, Oct 15, 2018 at 6:14 PM Brad Bishop <bradleyb at fuzziesquirrel.com> wrote:
>> 
>> 
>> 
>>> On Oct 12, 2018, at 5:32 PM, Patrick Venture <venture at google.com> wrote:
>>> 
>>> On Fri, Oct 12, 2018 at 2:28 PM Matt Spinler
>>> <mspinler at linux.vnet.ibm.com> wrote:
>>>> 
>>>> 
>>>> 
>>>> On 10/12/2018 4:23 PM, Patrick Venture wrote:
>>>>> On Fri, Oct 12, 2018 at 2:09 PM James Feist <james.feist at linux.intel.com> wrote:
>>>>>> On 10/12/2018 01:53 PM, Patrick Venture wrote:
>>>>>>> Currently, there are a few approaches in phosphor-hwmon on how to
>>>>>>> handle sensors being present sometimes, or flaky on reads.  We've
>>>>>>> talked in the past about how to handle a read failure that should be
>>>>>>> ignored.  A real failure that is flaky or temporary.
>>>>>>> 
>>>>>>> I was thinking of just adding a property to the Sensor.Value interface
>>>>>>> to report information.  Basically, someone reading the value needs to
>>>>>>> know if it's valid.  The idea of a range check for validity doesn't
>>>>>>> work as cleanly for this specific purpose, in my opinion.  The min/max
>>>>>>> can be used cleanly to determine the linearization though!
>>>>>>> 
>>>>>>> I was thinking perhaps a boolean, that you can check to see if the
>>>>>>> value should be used or trusted.  Then, I thought, perhaps, a enum
>>>>>>> with a series of states, starting with the states of "Valid',
>>>>>>> "Invalid."  Not being able to think of a third state, I fell back onto
>>>>>>> a boolean.
>>>> 
>>>> If the hwmon driver for your device provides a _fault attribute, and
>>>> that gets set, the code is already
>>>> set up to create an OperationalStatus interface on that sensor instance
>>>> and set the Functional
>>>> property to false.
>>> 
>>> I'd like to drop this such that the failure behavior can be returning
>>> a NaN double.
>> 
>> One benefit I see here is that everything is in the same DBus interface
>> (the same property even), but with OperationalStatus you have two and
>> thus more DBus calls, if you are polling?  Is that it or are there others?
>> 
>>> https://github.com/openbmc/phosphor-hwmon/blob/master/hwmonio.cpp#L127
>>> 
>>> For the drivers that don't have the fault attribute,
>> 
>> It seems like OperationalStatus could be extended to emulate easily here.
>> 
>>> and in the case where phosphor-host-ipmid doesn't check for that kind of
>>> failure.
>> 
>> So is there no value in consistency of interface?  A lot of things
>> can be ‘not working’ on a BMC besides sensors - do you want to model that
>> in the DBus API differently for each domain (which seems like the direction
>> you are headed)?
>> 
>> Further this model falls apart if you ever want something besides _working_
>> and _not_working_ for possible states.  I realize YAGNI, but 1 - I have seen that
>> in the past, 2 - it seems trivial to implement, and 3 - the down-side of
>> OperationalStatus is not apparent to me.
> 
> I have no issue with OperationalStatus for this -- I believe Matt's
> comment was that it's available gratis if the driver supports it.

Correct.  But I think it could be extended to emulate it for drivers that
don’t - similar to the thresholds.

> Since this driver may not (doesn't, but i haven't checked recently), I
> haven't dug back into that use case.
> 
> The idea of a sensor falling off and being added to the dbus every
> other time the sensor needs to be read falls over for at least one key
> reason: 1) You can't read the sensor on demand anymore.

I agree - I don’t think we want to do this anymore, when we have
options like OperationalStatus or NaN.

> 
> I brought all this up because I want to get away from -errno patches,
> and I want the error path to follow the natural state of things.
> Reading a dbus double and seeing NaN() may be enough of an error, and
> you can check OperationalStatus or other interfaces for more
> information if desired.

That SGTM.  I thought you wanted to drop OperationalStatus completely.

> 
>> 
>>> 
>>>> 
>>>>>> For Valid / Invalid we've produced nan on D-Bus to know that the sensor
>>>>>> is in an invalid state (like power off for tach values). Although I'm
>>>>>> not sure how this would be produced using the int64 interface, for the
>>>>>> double std::nan will go over d-bus correctly.
>>>>> So if i just wait until we're using double, the issue basically
>>>>> resolves itself? :D :D
>>>>> 
>>>>>>> Thoughts?
>>>>>>> 
>>>>>>> Patrick


More information about the openbmc mailing list