[Skiboot] hwmon sensors cleanup

Stewart Smith stewart at linux.vnet.ibm.com
Thu Jan 18 11:03:15 AEDT 2018


Shilpasri G Bhat <shilpa.bhat at linux.vnet.ibm.com> writes:
> On 01/16/2018 11:29 AM, Stewart Smith wrote:
>> Shilpasri G Bhat <shilpa.bhat at linux.vnet.ibm.com> writes:
>>>>> Core 0:                      +37.0°C
>>>>> Core 4:                      +37.0°C
>>>>> Core 8:                      +36.0°C
>
>>>>> Chip 0 Core 8:               +35.0°C  (lowest = +34.0°C, highest = +45.0°C)
>>>>> Chip 0 Core 12:              +35.0°C  (lowest = +34.0°C, highest = +44.0°C)
>>>>> Chip 0 Core 16:              +37.0°C  (lowest = +35.0°C, highest = +47.0°C)
>>>>> Chip 0 Core 20:              +34.0°C  (lowest = +33.0°C, highest = +45.0°C)
>
>>>>
>>>> So we have two sets of temperature sensors. One set is from measuring
>>>> the per-core DTS directly, and the other set is from the OCC measuring
>>>> the same per-core DTS. We should probably not be doubling up here.
>>>>
>>>
>>> Yup. Which one do we remove?
>>> OPAL DTS : Works even when OCC is down. But adds an extra overhead while reading
>>> and with deeper idle states enabled will take more time to read.
>>>
>>> OCC DTS: Not functional when OCC is down.
>> 
>> We could be "Smart" about it and go the 'fast' (OCC) method unless the
>> OCC is down... I wouldn't have a problem with us doing that (assuming we
>> have a sane way to do that)
>> 
>> 
>
> Maybe we can decide to add the (scomed) DTS sensors after we have initialized
> OCC sensors?

That could work.

>>>>> Chip 8 DIMM 12 :             +35.0°C  (lowest = +35.0°C, highest = +35.0°C)
>>>>> Chip 8 DIMM 13 :              +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>> Chip 8 DIMM 14 :             +35.0°C  (lowest = +35.0°C, highest = +36.0°C)
>>>>> Chip 8 DIMM 15 :              +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>
>>>> A lot of these are zero because the DIMM itself isn't populated. Witherspoon and
>>>> Zaius don't even have sockets for 16 DIMMS on each socket, so why even
>>>> report them
>>>> here?
>>>
>>> Possibly there is way to filter out configured DIMMS using DT information with
>>> DIMM <id> provided by OCC like we do for deconfigured cores. Or should we just
>>> filter them out if the sensor value is zero.
>> 
>> We should have properties in the DT for each DIMM due to VPD and all
>> that jazz, so we should be able to work this out on boot and remove
>> based on actual presence.
>> 
>> I don't think just checking for 0 is safe, as that may legitimately be
>> some kind of error condition (or somebody putting their computer in a
>> freezer).
>>
>
> Agree. I will need to check though if OCC provided DIMM <id> in the sensor name
> matches with the DIMM <id> in the device-tree.

If it doesn't, we should probably work out why and fix it. Otherwise it
just sounds like it's asking for trouble :)

>>>>> Chip 0 GPU 0 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>> Chip 0 GPU 1 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>> Chip 0 GPU 2 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>> Chip 0 GPU 0 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>> Chip 0 GPU 1 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>> Chip 0 GPU 2 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>> Chip 8 GPU 0 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>> Chip 8 GPU 1 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>> Chip 8 GPU 2 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>> Chip 8 GPU 0 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>> Chip 8 GPU 1 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>> Chip 8 GPU 2 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>>
>>>> I think that these should probably be witherspoon specific. If we
>>>> have support
>> 
>> We should maybe only have these if GPUs are present, which we should
>> know on boot too.
>
> Yeah this is a better way to filter out the GPU sensors.
>
>> 
>>>>> APSS 11 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>>>> APSS 12 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>>>> APSS 13 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>>>> APSS 14 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>>>> APSS 15 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>>>
>>>> Should these even be here? I don't believe pass2 romulus even has an
>>>> APSS so it shouldn't be appearing here. Really we should have more
>>>> useful names for each APSS channel.
>>>
>>> Ideally it would have been good if OCC had not added these sensors to main
>>> memory on boxes where APSS is not present. Corresponding component power sensors
>>> like System power, memory power will be also be zero which point to one of the
>>> APSS channel.
>>>
>>> So probably we should filter these based on 0 sensor values in OPAL.
>> 
>> Is there a more reliably way to filter these out? I really don't like
>> just filtering out based on sensor reading of 0.
>> 
>
> Martha Broyles from OCC team did confirm that '0' valued power sensors can be
> considered as unavailable sensors. Either we use this method to filter or we
> need to check if there is any HDAT property that tells us that APSS is not
> available?
>
> Or filtering out APSS sensors for Zaius platform is a possible way but not a
> complete solution. Since OCC is not providing a clear way to identify
> deconfigured sensors we will have to hardcode this in OPAL using
> sensor names.

Ok. Maybe we could go with Martha's guidance here and check on zero
(probably add a comment in the code that "Martha said this was okay" so
we know that it's intentional rather than accident)

-- 
Stewart Smith
OPAL Architect, IBM.



More information about the Skiboot mailing list