[Skiboot] hwmon sensors cleanup

Shilpasri G Bhat shilpa.bhat at linux.vnet.ibm.com
Wed Jan 17 15:59:43 AEDT 2018


Hi,

On 01/16/2018 11:29 AM, Stewart Smith wrote:
> Shilpasri G Bhat <shilpa.bhat at linux.vnet.ibm.com> writes:
>>>> Core 0:                      +37.0°C
>>>> Core 4:                      +37.0°C
>>>> Core 8:                      +36.0°C

>>>> Chip 0 Core 8:               +35.0°C  (lowest = +34.0°C, highest = +45.0°C)
>>>> Chip 0 Core 12:              +35.0°C  (lowest = +34.0°C, highest = +44.0°C)
>>>> Chip 0 Core 16:              +37.0°C  (lowest = +35.0°C, highest = +47.0°C)
>>>> Chip 0 Core 20:              +34.0°C  (lowest = +33.0°C, highest = +45.0°C)

>>>
>>> So we have two sets of temperature sensors. One set is from measuring
>>> the per-core DTS directly, and the other set is from the OCC measuring
>>> the same per-core DTS. We should probably not be doubling up here.
>>>
>>
>> Yup. Which one do we remove?
>> OPAL DTS : Works even when OCC is down. But adds an extra overhead while reading
>> and with deeper idle states enabled will take more time to read.
>>
>> OCC DTS: Not functional when OCC is down.
> 
> We could be "Smart" about it and go the 'fast' (OCC) method unless the
> OCC is down... I wouldn't have a problem with us doing that (assuming we
> have a sane way to do that)
> 
> 

Maybe we can decide to add the (scomed) DTS sensors after we have initialized
OCC sensors?

>>>> Chip 8 DIMM 12 :             +35.0°C  (lowest = +35.0°C, highest = +35.0°C)
>>>> Chip 8 DIMM 13 :              +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>> Chip 8 DIMM 14 :             +35.0°C  (lowest = +35.0°C, highest = +36.0°C)
>>>> Chip 8 DIMM 15 :              +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>
>>> A lot of these are zero because the DIMM itself isn't populated. Witherspoon and
>>> Zaius don't even have sockets for 16 DIMMS on each socket, so why even
>>> report them
>>> here?
>>
>> Possibly there is way to filter out configured DIMMS using DT information with
>> DIMM <id> provided by OCC like we do for deconfigured cores. Or should we just
>> filter them out if the sensor value is zero.
> 
> We should have properties in the DT for each DIMM due to VPD and all
> that jazz, so we should be able to work this out on boot and remove
> based on actual presence.
> 
> I don't think just checking for 0 is safe, as that may legitimately be
> some kind of error condition (or somebody putting their computer in a
> freezer).
>

Agree. I will need to check though if OCC provided DIMM <id> in the sensor name
matches with the DIMM <id> in the device-tree.


>>>> Chip 0 Nest:                 +35.0°C  (lowest = +34.0°C, highest = +40.0°C)
>>>> Chip 8 Nest:                 +37.0°C  (lowest = +35.0°C, highest = +42.0°C)
>>>
>>> Could we report this as the chip overall temperature?
>>
>> Yup.
> 
> Is that a fair statement on the sensor though? Is the position of teh
> nest sensor good enough for a whole chip temperature?
> 
> I'm genuinely asking, I have no idea if it would be valid to assume that.
> 
>>>> Chip 0 GPU 0 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>> Chip 0 GPU 1 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>> Chip 0 GPU 2 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>> Chip 0 GPU 0 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>> Chip 0 GPU 1 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>> Chip 0 GPU 2 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>> Chip 8 GPU 0 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>> Chip 8 GPU 1 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>> Chip 8 GPU 2 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>> Chip 8 GPU 0 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>> Chip 8 GPU 1 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>> Chip 8 GPU 2 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>>
>>> I think that these should probably be witherspoon specific. If we
>>> have support
> 
> We should maybe only have these if GPUs are present, which we should
> know on boot too.

Yeah this is a better way to filter out the GPU sensors.

> 
>>>> APSS 11 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>>> APSS 12 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>>> APSS 13 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>>> APSS 14 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>>> APSS 15 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>>
>>> Should these even be here? I don't believe pass2 romulus even has an
>>> APSS so it shouldn't be appearing here. Really we should have more
>>> useful names for each APSS channel.
>>
>> Ideally it would have been good if OCC had not added these sensors to main
>> memory on boxes where APSS is not present. Corresponding component power sensors
>> like System power, memory power will be also be zero which point to one of the
>> APSS channel.
>>
>> So probably we should filter these based on 0 sensor values in OPAL.
> 
> Is there a more reliably way to filter these out? I really don't like
> just filtering out based on sensor reading of 0.
> 

Martha Broyles from OCC team did confirm that '0' valued power sensors can be
considered as unavailable sensors. Either we use this method to filter or we
need to check if there is any HDAT property that tells us that APSS is not
available?

Or filtering out APSS sensors for Zaius platform is a possible way but not a
complete solution. Since OCC is not providing a clear way to identify
deconfigured sensors we will have to hardcode this in OPAL using sensor names.

Thanks and Regards,
Shilpa



More information about the Skiboot mailing list