phosphor-hwmon bottleneck potential

Patrick Williams patrick at stwcx.xyz
Sat May 6 06:35:38 AEST 2017


On Fri, May 05, 2017 at 11:37:06AM -0700, Nancy Yuen wrote:
> 1. General design issue if time between reads of a sensor is dependent on
> the number of sensors in the system.

I don't see any general design issue with phosphor-hwmon or dbus being
talked about here.  There is one specific driver that has a pretty large
scaling factor no matter how you read sensor values out of it.  We've
talked elsewhere in the thread about a potential driver change to
improve it.  Most hwmon drivers do not have any issue.

> And drivers can fail, due to miss behaving code or hardware.

hwmon drivers should never block userspace forever.  If they do that is
a serious driver bug.  We could be defensive against it by enhancing
phosphor-hwmon to use non-blocking IO, assuming the kernel supports it,
but that seems like a lot of code for a non-existent problem.

Hardware failures are already handled by the hwmon drivers, reported
back to phosphor-hwmon as errnos on read, and dealt with.

> In this design, sensor report could be
> significantly delayed if one sensor/driver were bad or misbehaving.

You have no difference in the problem of one sensor reading not working
in any of these three potential designs:
    1. One big loop that reads each hwmon sysfs entry for the whole
       system in sequence.
    2. One big program with N threads that, with a stampeding herd,
       attempt to read all hwmon sysfs at the same instant.
    3. M processes with N hwmon sysfs reads in sequence.

In any of these designs, if the driver delays your read for 8 seconds
your data is delayed and stale.  I think our expectation is that a
fan-control algorithm is using dbus signals to keep track of the most
recent value and if it doesn't have up-to-date data by the time it wants
to make a decision it would deal with it in whatever way it sees fit.
Likely, either using old data or treating that sensor as in-error.

If you chose design #2 and then expanded on it by adding a thermal
control loop in yet another thread, you'd still have the exact same
problems to deal with.  It just is now all in one process using shm to
communicate the cached sensor values instead of using dbus between
processes.

-- 
Patrick Williams
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Digital signature
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20170505/5de251d4/attachment.sig>


More information about the openbmc mailing list