[PATCH] hwmon: (peci/dimmtemp) Do not provide fake thresholds data

Tue Jan 28 14:34:21 AEDT 2025

On 1/27/25 11:10, Paul Fertser wrote:
> On Mon, Jan 27, 2025 at 10:39:44AM -0800, Guenter Roeck wrote:
>> On 1/27/25 10:30, Paul Fertser wrote:
>>> Hi Guenter,
>>>
>>> On Mon, Jan 27, 2025 at 09:29:39AM -0800, Guenter Roeck wrote:
>>>> On 1/27/25 08:40, Winiarska, Iwona wrote:
>>>>> On Thu, 2025-01-23 at 15:20 +0300, Paul Fertser wrote:
>>>>>> When an Icelake or Sapphire Rapids CPU isn't providing the maximum and
>>>>>> critical thresholds for particular DIMM the driver should return an
>>>>>> error to the userspace instead of giving it stale (best case) or wrong
>>>>>> (the structure contains all zeros after kzalloc() call) data.
>>>>>>
>>>>>> The issue can be reproduced by binding the peci driver while the host is
>>>>>> fully booted and idle, this makes PECI interaction unreliable enough.
>>>>>>
>>>>>> Fixes: 73bc1b885dae ("hwmon: peci: Add dimmtemp driver")
>>>>>> Fixes: 621995b6d795 ("hwmon: (peci/dimmtemp) Add Sapphire Rapids support")
>>>>>> Cc: stable at vger.kernel.org
>>>>>> Signed-off-by: Paul Fertser <fercerpav at gmail.com>
>>>>>
>>>>> Hi!
>>>>>
>>>>> Thank you for the patch.
>>>>> Did you have a chance to test it with OpenBMC dbus-sensors?
>>>>> In general, the change looks okay to me, but since it modifies the behavior
>>>>> (applications will need to handle this, and returning an error will happen more
>>>>> often) we need to confirm that it does not cause any regressions for userspace.
>>>>>
>>>>
>>>> I would also like to understand if the error is temporary or permanent.
>>>> If it is permanent, the attributes should not be created in the first
>>>> place. It does not make sense to have limit attributes which always report
>>>> -ENODATA.
>>>
>>> The error is temporary. The underlying reason is that when host CPUs
>>> go to deep enough idle sleep state (probably C6) they stop responding
>>> to PECI requests from BMC. Once something starts running the CPU
>>> leaves C6 and starts responding and all the temperature data
>>> (including the thresholds) becomes available again.
>>>
>>
>> Thanks.
>>
>> Next question: Is there evidence that the thresholds change while the CPU
>> is in a deep sleep state (or, in other words, that they are indeed stale) ?
>> Because if not it would be (much) better to only report -ENODATA if the
>> thresholds are uninitialized, and it would be even better than that if the
>> limits are read during initialization (and not updated at all) if they do
>> not change dynamically.
> 
>>From BMC point of view when getting a timeout there is little
> difference between the host not answering being in idle deep sleep
> state and between host being completely powered off. Now I can imagine
> a server system where BMC keeps running and the server has its DIMMs
> physically changed to a different model with different threshold.
> 
> Whether it's realistic scenario and whether it's worth caching the
> thresholds in the kernel I hope Iwona can clarify. In my current
> opinion the added complexity isn't worth it, the PECI operation needs
> to be reliable enough anyway for BMC to monitor at least the CPU
> temperatures once a second to feed this essential data to the cooling
> fans control loop. And if we can read CPU temperatures we can also
> read DIMM thresholds when we need them and worse case retry a few
> times while starting up the daemon.
> 

Makes sense.

Applied.

Thanks,
Guenter