[Skiboot] hwmon sensors cleanup

Stewart Smith stewart at linux.vnet.ibm.com
Tue Jan 16 16:59:36 AEDT 2018


Shilpasri G Bhat <shilpa.bhat at linux.vnet.ibm.com> writes:
>>> Core 0:                      +37.0°C
>>> Core 4:                      +37.0°C
>>> Core 8:                      +36.0°C
>>> Core 12:                     +37.0°C
>>> Core 16:                     +38.0°C
>>> Core 20:                     +37.0°C
>>> Core 24:                     +37.0°C
>>> Core 28:                     +37.0°C
>>> Core 32:                     +37.0°C
>>> Core 36:                     +37.0°C
>>> Core 40:                     +36.0°C
>>> Core 44:                     +36.0°C
>>> Core 48:                     +37.0°C
>>> Core 52:                     +37.0°C
>>> Core 56:                     +35.0°C
>>> Core 60:                     +35.0°C
>>> Core 64:                     +35.0°C
>>> Core 68:                     +37.0°C
>>> Core 72:                     +37.0°C
>>> Core 76:                     +39.0°C
>>> Core 80:                     +35.0°C
>>> Core 84:                     +34.0°C
>>> Core 88:                     +36.0°C
>>> Core 92:                     +36.0°C
>>> Core 96:                     +37.0°C
>>> Core 100:                    +38.0°C
>>> Core 104:                    +36.0°C
>>> Core 108:                    +37.0°C
>>> Core 112:                    +37.0°C
>>> Core 116:                    +35.0°C
>>> Core 120:                    +36.0°C
>>> Core 124:                    +37.0°C
>>> Chip 0 Core 0:               +36.0°C  (lowest = +34.0°C, highest = +44.0°C)
>>> Chip 0 Core 4:               +37.0°C  (lowest = +34.0°C, highest = +45.0°C)
>>> Chip 0 Core 8:               +35.0°C  (lowest = +34.0°C, highest = +45.0°C)
>>> Chip 0 Core 12:              +35.0°C  (lowest = +34.0°C, highest = +44.0°C)
>>> Chip 0 Core 16:              +37.0°C  (lowest = +35.0°C, highest = +47.0°C)
>>> Chip 0 Core 20:              +34.0°C  (lowest = +33.0°C, highest = +45.0°C)
>>> Chip 0 Core 24:              +36.0°C  (lowest = +34.0°C, highest = +45.0°C)
>>> Chip 0 Core 28:              +36.0°C  (lowest = +35.0°C, highest = +45.0°C)
>>> Chip 0 Core 32:              +37.0°C  (lowest = +35.0°C, highest = +45.0°C)
>>> Chip 0 Core 36:              +35.0°C  (lowest = +33.0°C, highest = +44.0°C)
>>> Chip 0 Core 40:              +35.0°C  (lowest = +33.0°C, highest = +43.0°C)
>>> Chip 0 Core 44:              +36.0°C  (lowest = +34.0°C, highest = +44.0°C)
>>> Chip 0 Core 48:              +36.0°C  (lowest = +35.0°C, highest = +45.0°C)
>>> Chip 0 Core 52:              +35.0°C  (lowest = +34.0°C, highest = +46.0°C)
>>> Chip 0 Core 56:              +33.0°C  (lowest = +32.0°C, highest = +43.0°C)
>>> Chip 0 Core 60:              +34.0°C  (lowest = +34.0°C, highest = +44.0°C)
>>> Chip 8 Core 64:              +35.0°C  (lowest = +32.0°C, highest = +42.0°C)
>>> Chip 8 Core 68:              +35.0°C  (lowest = +33.0°C, highest = +42.0°C)
>>> Chip 8 Core 72:              +36.0°C  (lowest = +34.0°C, highest = +43.0°C)
>>> Chip 8 Core 76:              +37.0°C  (lowest = +34.0°C, highest = +44.0°C)
>>> Chip 8 Core 80:              +35.0°C  (lowest = +33.0°C, highest = +43.0°C)
>>> Chip 8 Core 84:              +34.0°C  (lowest = +32.0°C, highest = +41.0°C)
>>> Chip 8 Core 88:              +35.0°C  (lowest = +31.0°C, highest = +42.0°C)
>>> Chip 8 Core 92:              +35.0°C  (lowest = +32.0°C, highest = +41.0°C)
>>> Chip 8 Core 96:              +36.0°C  (lowest = +33.0°C, highest = +43.0°C)
>>> Chip 8 Core 100:             +37.0°C  (lowest = +34.0°C, highest = +43.0°C)
>>> Chip 8 Core 104:             +36.0°C  (lowest = +33.0°C, highest = +42.0°C)
>>> Chip 8 Core 108:             +35.0°C  (lowest = +32.0°C, highest = +42.0°C)
>>> Chip 8 Core 112:             +34.0°C  (lowest = +29.0°C, highest = +41.0°C)
>>> Chip 8 Core 116:             +34.0°C  (lowest = +29.0°C, highest = +42.0°C)
>>> Chip 8 Core 120:             +36.0°C  (lowest = +33.0°C, highest = +43.0°C)
>>> Chip 8 Core 124:             +35.0°C  (lowest = +32.0°C, highest = +42.0°C)
>> 
>> So we have two sets of temperature sensors. One set is from measuring
>> the per-core DTS directly, and the other set is from the OCC measuring
>> the same per-core DTS. We should probably not be doubling up here.
>> 
>
> Yup. Which one do we remove?
> OPAL DTS : Works even when OCC is down. But adds an extra overhead while reading
> and with deeper idle states enabled will take more time to read.
>
> OCC DTS: Not functional when OCC is down.

We could be "Smart" about it and go the 'fast' (OCC) method unless the
OCC is down... I wouldn't have a problem with us doing that (assuming we
have a sane way to do that)


>>> Chip 8 DIMM 12 :             +35.0°C  (lowest = +35.0°C, highest = +35.0°C)
>>> Chip 8 DIMM 13 :              +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>> Chip 8 DIMM 14 :             +35.0°C  (lowest = +35.0°C, highest = +36.0°C)
>>> Chip 8 DIMM 15 :              +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>> 
>> A lot of these are zero because the DIMM itself isn't populated. Witherspoon and
>> Zaius don't even have sockets for 16 DIMMS on each socket, so why even
>> report them
>> here?
>
> Possibly there is way to filter out configured DIMMS using DT information with
> DIMM <id> provided by OCC like we do for deconfigured cores. Or should we just
> filter them out if the sensor value is zero.

We should have properties in the DT for each DIMM due to VPD and all
that jazz, so we should be able to work this out on boot and remove
based on actual presence.

I don't think just checking for 0 is safe, as that may legitimately be
some kind of error condition (or somebody putting their computer in a
freezer).

>>> Chip 0 Nest:                 +35.0°C  (lowest = +34.0°C, highest = +40.0°C)
>>> Chip 8 Nest:                 +37.0°C  (lowest = +35.0°C, highest = +42.0°C)
>> 
>> Could we report this as the chip overall temperature?
>
> Yup.

Is that a fair statement on the sensor though? Is the position of teh
nest sensor good enough for a whole chip temperature?

I'm genuinely asking, I have no idea if it would be valid to assume that.

>>> Chip 0 GPU 0 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>> Chip 0 GPU 1 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>> Chip 0 GPU 2 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>> Chip 0 GPU 0 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>> Chip 0 GPU 1 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>> Chip 0 GPU 2 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>> Chip 8 GPU 0 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>> Chip 8 GPU 1 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>> Chip 8 GPU 2 :                +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>> Chip 8 GPU 0 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>> Chip 8 GPU 1 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>>> Chip 8 GPU 2 MEM:             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
>> 
>> I think that these should probably be witherspoon specific. If we
>> have support

We should maybe only have these if GPUs are present, which we should
know on boot too.

>>> APSS 11 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>> APSS 12 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>> APSS 13 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>> APSS 14 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>>> APSS 15 :                     0.00 W  (lowest =   0.00 W, highest =   0.00 W)
>> 
>> Should these even be here? I don't believe pass2 romulus even has an
>> APSS so it shouldn't be appearing here. Really we should have more
>> useful names for each APSS channel.
>
> Ideally it would have been good if OCC had not added these sensors to main
> memory on boxes where APSS is not present. Corresponding component power sensors
> like System power, memory power will be also be zero which point to one of the
> APSS channel.
>
> So probably we should filter these based on 0 sensor values in OPAL.

Is there a more reliably way to filter these out? I really don't like
just filtering out based on sensor reading of 0.

-- 
Stewart Smith
OPAL Architect, IBM.



More information about the Skiboot mailing list