[PATCH linux dev-4.10] Revert "drivers/hwmon/occ: Add temperature fault attribute and VRM temp alarm"

Joel Stanley joel at jms.id.au
Thu Oct 5 14:53:25 AEDT 2017


On Thu, Oct 5, 2017 at 11:31 AM, Brad Bishop
<bradleyb at fuzziesquirrel.com> wrote:
>
>> On Oct 4, 2017, at 11:04 AM, Eddie James <eajames at linux.vnet.ibm.com> wrote:
>>
>>
>>
>> On 10/03/2017 11:38 PM, Joel Stanley wrote:
>>> On Wed, Oct 4, 2017 at 4:44 AM, Eddie James <eajames at linux.vnet.ibm.com> wrote:
>>>> From: "Edward A. James" <eajames at us.ibm.com>
>>>>
>>>> This reverts commit e55423ee10a5057338d24383c00e813436a126ea.
>>>>
>>>> Apologies for pushing this up so early... Userspace applications aren't
>>>> ready for this change. The hwmon polling application cannot accept EGAIN
>>>> yet, and we can't be returning apparent errors if the sensor is
>>>> temporarily unavailable.
>>> Can you go into some more detail here please.
>>>
>>> It looks like these changes are within the hwmon ABI, so userspace
>>> that supports hwmon devices shouldn't break.
>>
>> Sure. With this change, the driver returns -EAGAIN when reading temperature sensors that show 0 temperature. Temperature 0 is how the OCC indicates that the sensor is not available at the moment. So, we should retry in a bit. But the hwmon application currently sees any negative errno as an error,
>
> Actually the userspace retries for a configurable period on eio, etimedout,
> ebadmsg, eagain, and exnio.  It also exits cleanly on enoent.
>
> If any of these errnos don’t go away after the configured number of retry
> attempts, an error is logged.  Any other errnos and an error is logged
> immediately. If this list needs improvement I’d love to hear about it.
>
> ebadmsg and enxio are observed when i2c devices are unplugged with transfers
> in various stages of flight.  They occur just before the driver is unbound
> after the presence gpio toggle on these devices is noticed and processed
> (killing the hwmon userspace daemon cleanly).
>
> eio, etimedout seem to be bus type errors that appear somewhat infrequently,
> but occur nevertheless.
>
> we all know what eagain means…
>
> It isn’t so much that the hwmon userspace doesn’t support this change,
> its just the resulting behavior difference at the other end of the
> hwmon abi <-> dbus api translation is not optimal, right now.
>
> 1 - Without this change, the hwmon userspace reads a value of 0 out of
> the sysfs attribute and happily reports that as the temp at the DBus level.
>
> With this change, the hwmon userspace would read the attribute, get an
> error and retry for a bit.  After a number of retries, the error is
> logged and the sensor is marked faulted at a dbus level.
>
> The issue is the consumer of the sensor at the dbus level.  Today, the
> dbus consumer happens to translate the sensor value of zero into the
> desired behavior.  If we submit this change the sensor will be put
> into faulted state and the consuming application doesn’t have support
> for that yet.
>
> We have open issues to enhance the application to support faulted sensors,
> its just not implemented yet.
>
> I hope this helps a decision to be made.

Thanks for the background. In general we want to write and test our
userspace such that it can cope with all of the return codes that the
kernel may throw at it. I look forward to helping us in this area in
the future.

We will take this patch as a once-off to keep the OCC train moving.
I've added your notes into the commit message.

Applied to dev-4.10.

Cheers,

Joel


More information about the openbmc mailing list