[PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
mpe at ellerman.id.au
Mon Feb 20 16:03:50 AEDT 2017
Stewart Smith <stewart at linux.vnet.ibm.com> writes:
> Vipin K Parashar <vipin at linux.vnet.ibm.com> writes:
>> On Monday 13 February 2017 06:13 AM, Michael Ellerman wrote:
>>> Vipin K Parashar <vipin at linux.vnet.ibm.com> writes:
>>>> OPAL returns OPAL_WRONG_STATE for XSCOM operations
>>>> done to read any core FIR which is sleeping, offline.
>>> Do we know why Linux is causing that to happen?
>> This issue is originally seen upon running STAF (Software Test
>> Automation Framework) stress tests and off-lining some cores
>> with stress tests running.
>> It can also be re-created after off-lining few cores and following
>> one of below methods.
>> 1. Executing Linux "sensors" command
>> 2. Reading contents of file /sys/class/hwmon/hwmon0/tempX_input,
>> where X is offline CPU.
>> Its "opal_get_sensor_data" Linux API that that triggers
>> OPAL call "opal_sensor_read", performing XSCOM ops here.
>> If core is found sleeping/offline Linux throws up
>> "opal_error_code: Unexpected OPAL error" error onto console.
>> Currently Linux isn't aware about OPAL_WRONG_STATE return code
>> from OPAL. Thus it prints "Unexpected OPAL error" message, same
>> as it would log for any unknown OPAL return codes.
>> Seeing this error over console has been a concern for Test and
>> would puzzle real user as well. This patch makes Linux aware about
>> OPAL_WRONG_STATE return code from OPAL and stops printing
>> "Unexpected OPAL error" message onto console for OPAL fails
>> with OPAL_WRONG_STATE
> Ahh... so this is a DTS sensor, which indeed is just XSCOMs and we
> return the xscom_read return code in event of error.
> I would argue that converting to EIO in that instance is probably
> correct... or EAGAIN? EAGAIN may be more correct in the situation where
> the core is just sleeping.
> What kind of offlining are you doing?
> Arguably, the correct behaviour would be to remove said sensors when the
> core is offline.
Right, that would be ideal. There appear to be at least two other hwmon
drivers that are CPU hotplug aware (coretemp and via-cputemp).
But perhaps it's not possible to work out which sensors are attached to
which CPU etc., I haven't looked in detail.
In that case changing just opal_get_sensor_data() to handle
OPAL_WRONG_STATE would be OK, with a comment explaining that we might be
asked to read a sensor on an offline CPU and we aren't able to detect
More information about the Linuxppc-dev