[PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds

Andrew Jeffery andrew at aj.id.au
Mon Oct 26 11:45:51 AEDT 2020



On Tue, 1 Sep 2020, at 15:51, Joel Stanley wrote:
> On Thu, 27 Aug 2020 at 13:30, Andrew Jeffery <andrew at aj.id.au> wrote:
> >
> > Hello,
> >
> > This series works around reliability problems with the MAX31785 fan controller
> > as observed in the field on some POWER systems.
> >
> > I'm the first to admit the patches are not elegant, so feedback there is
> > appreciated.
> 
> The way you implemented the loop took me several goes to grok. I
> finally got there.
> 
> 
> >
> > Separately, our previous workarounds have run aground upstream as the hwmon
> > maintainer was unable to reproduce our observations on the MAX31785 evaluation
> > kit. I've recently received an evaluation kit, so I plan on putting some of
> > these issues to the test myself. Ultimately this will help determine whether we
> > have issues with our fan card designs or whether the controller itself is at
> > fault (I have to admit, given some of the failures, it's hard to see how the
> > controller might not be at fault). Basically, this paragraph is my excuse for
> > not pushing these patches further upstream for the moment; I will re-evaluate
> > once I have the results from testing against the evkit.
> 
> I would post this series upstream so Guneter has some context for
> future patches that come out of your investigation.
> 
> >
> > In the mean time, these patches resolve issues we've seen in some system
> > deployments. Taken together, I've put the driver through an unbind/bind loop
> > of over 20,000 iterations with no "loss" of fans, where prior to the series we
> > typically achieved only a few hundred. This feels like a significant
> > improvement, so please consider merging despite their ugliness.
> 
> Do you want these in dev-5.4, or both 5.4 and 5.8?

A bit late to respond, but can we apply these to 5.8 as well? I've experimented
with applying the same workaround as we did for the UCD90320, but it looks
like a bad idea: We're not actually monitoring the UCD90320 sensors at runtime
and so this hides the scheduling latency/jank that the I2C throttling patches
introduce. This jank is exposed by the fact that unlike the UCD90320 we do
monitor the MAX31785 sensors. Decreasing the jank by increasing the upper
bound of usleep_range() to e.g. a full tick leads to unacceptable latencies when
reading the MAX31785 sensor data (one read blows out to ~2.8 seconds due
to the caching strategy implemented in pmbus_core.c).

Cheers,

Andrew


More information about the openbmc mailing list