[PATCH linux dev-4.13 1/2] hwmon (occ): Remove temperature fault attribute and VRM temp alarm

Matt Spinler mspinler at linux.vnet.ibm.com
Fri Mar 9 06:15:58 AEDT 2018



On 3/1/2018 6:39 PM, Joel Stanley wrote:
> On Fri, Mar 2, 2018 at 9:14 AM, Eddie James <eajames at linux.vnet.ibm.com> wrote:
>> From: "Edward A. James" <eajames at us.ibm.com>
>>
>> This removes the EAGAIN return for "sensor not ready" and the temperature
>> fault and vrm temp alarms attributes.
>>
>> Apologies for pushing this up so early... Userspace applications aren't
>> ready for this change. The hwmon polling application cannot accept EGAIN
>> yet, and we can't be returning apparent errors if the sensor is
>> temporarily unavailable.
> 
> Okay. Brad's email is from October 5th, 2017. That's 148 days ago.
> 
> Any ETA on when the userspace plans to get some love?
> 

We have 4 issues in the backlog to address the OCC GPU sensors returning
either a 0 or 0xFF:
https://github.com/openbmc/openbmc/issues/2327
https://github.com/openbmc/openbmc/issues/2329
https://github.com/openbmc/openbmc/issues/2222
https://github.com/openbmc/openbmc/issues/2223

As fan control is already doing the right thing when it reads the values
as they are now, we deprioritized the work quite a bit so it still
isn't scheduled and may not be until later systems.


> Cheers,
> 
> Joel
> 
> 
>>
>> Brad Bishop <bradleyb at fuzziesquirrel.com> wrote:
>>> userspace retries for a configurable period on eio, etimedout,
>>> ebadmsg, eagain, and exnio.  It also exits cleanly on enoent.
>>>
>>> If any of these errnos don’t go away after the configured number of retry
>>> attempts, an error is logged.  Any other errnos and an error is logged
>>> immediately. If this list needs improvement I’d love to hear about it.
>>>
>>> ebadmsg and enxio are observed when i2c devices are unplugged with transfers
>>> in various stages of flight.  They occur just before the driver is unbound
>>> after the presence gpio toggle on these devices is noticed and processed
>>> (killing the hwmon userspace daemon cleanly).
>>>
>>> eio, etimedout seem to be bus type errors that appear somewhat infrequently,
>>> but occur nevertheless.
>>>
>>> we all know what eagain means...
>>>
>>> It isn’t so much that the hwmon userspace doesn’t support this change,
>>> its just the resulting behavior difference at the other end of the
>>> hwmon abi <-> dbus api translation is not optimal, right now.
>>>
>>> 1 - Without this change, the hwmon userspace reads a value of 0 out of
>>> the sysfs attribute and happily reports that as the temp at the DBus level.
>>>
>>> With this change, the hwmon userspace would read the attribute, get an
>>> error and retry for a bit.  After a number of retries, the error is
>>> logged and the sensor is marked faulted at a dbus level.
>>>
>>> The issue is the consumer of the sensor at the dbus level.  Today, the
>>> dbus consumer happens to translate the sensor value of zero into the
>>> desired behavior.  If we submit this change the sensor will be put
>>> into faulted state and the consuming application doesn’t have support
>>> for that yet.
>>>
>>> We have open issues to enhance the application to support faulted sensors,
>>> its just not implemented yet.
>>
>> OpenBMC-Staging-Count: 1
>> Signed-off-by: Eddie James <eajames at us.ibm.com>
>> Signed-off-by: Joel Stanley <joel at jms.id.au>
>> ---
>>   drivers/hwmon/occ/common.c | 33 +++------------------------------
>>   1 file changed, 3 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c
>> index 7c97a4c..a5c4df0 100644
>> --- a/drivers/hwmon/occ/common.c
>> +++ b/drivers/hwmon/occ/common.c
>> @@ -304,24 +304,11 @@ static ssize_t occ_show_temp_2(struct device *dev,
>>                  val = get_unaligned_be32(&temp->sensor_id);
>>                  break;
>>          case 1:
>> -               val = temp->value;
>> -               if (val == OCC_TEMP_SENSOR_FAULT)
>> -                       return -EREMOTEIO;
>> -
>> -               if (temp->fru_type != OCC_FRU_TYPE_VRM) {
>> -                       /* sensor not ready */
>> -                       if (val == 0)
>> -                               return -EAGAIN;
>> -
>> -                       val *= 1000;    /* millidegrees */
>> -               }
>> +               val = temp->value * 1000;
>>                  break;
>>          case 2:
>>                  val = temp->fru_type;
>>                  break;
>> -       case 3:
>> -               val = temp->value == OCC_TEMP_SENSOR_FAULT;
>> -               break;
>>          default:
>>                  return -EINVAL;
>>          }
>> @@ -790,7 +777,7 @@ static int occ_setup_sensor_attrs(struct occ *occ)
>>                  num_attrs += (sensors->temp.num_sensors * 2);
>>                  break;
>>          case 2:
>> -               num_attrs += (sensors->temp.num_sensors * 4);
>> +               num_attrs += (sensors->temp.num_sensors * 3);
>>                  show_temp = occ_show_temp_2;
>>                  break;
>>          default:
>> @@ -863,22 +850,13 @@ static int occ_setup_sensor_attrs(struct occ *occ)
>>
>>          for (i = 0; i < sensors->temp.num_sensors; ++i) {
>>                  s = i + 1;
>> -               temp = ((struct temp_sensor_2 *)sensors->temp.data) + i;
>>
>>                  snprintf(attr->name, sizeof(attr->name), "temp%d_label", s);
>>                  attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL,
>>                                               0, i);
>>                  attr++;
>>
>> -               if (sensors->temp.version > 1 &&
>> -                   temp->fru_type == OCC_FRU_TYPE_VRM) {
>> -                       snprintf(attr->name, sizeof(attr->name),
>> -                                "temp%d_alarm", s);
>> -               } else {
>> -                       snprintf(attr->name, sizeof(attr->name),
>> -                                "temp%d_input", s);
>> -               }
>> -
>> +               snprintf(attr->name, sizeof(attr->name), "temp%d_input", s);
>>                  attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL,
>>                                               1, i);
>>                  attr++;
>> @@ -890,11 +868,6 @@ static int occ_setup_sensor_attrs(struct occ *occ)
>>                                                       show_temp, NULL, 2, i);
>>                          attr++;
>>
>> -                       snprintf(attr->name, sizeof(attr->name),
>> -                                "temp%d_fault", s);
>> -                       attr->sensor = OCC_INIT_ATTR(attr->name, 0444,
>> -                                                    show_temp, NULL, 3, i);
>> -                       attr++;
>>                  }
>>          }
>>
>> --
>> 1.8.3.1
>>
> 



More information about the openbmc mailing list