[Skiboot] hwmon sensors cleanup
Shilpasri G Bhat
shilpa.bhat at linux.vnet.ibm.com
Thu Nov 30 00:43:05 AEDT 2017
Hi,
On 11/24/2017 11:21 AM, Oliver wrote:
> Hi all,
>
> Currently we provide a lot of sensors via the device-tree which are
> plugged into the kernel's hwmon interface via the powernv-sensor
> driver. Unfortunately a lot of these sensors are... not very useful.
> For example, take the output `sensors` on a (pass 1) romulus:
>
>> Chip 0 VOLTDROOPCNTC04 0: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC05 4: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC06 8: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC07 12: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC12 16: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC13 20: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC14 24: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC15 28: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC16 32: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC17 36: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC18 40: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC19 44: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC20 48: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>> Chip 0 VOLTDROOPCNTC21 52: +0.00 V (lowest = +0.00 V, highest = +0.00 V)
>
> I don't think we should have a hwmon sensor for these. Maybe we can
> use that information
> to add a secondary attribute to something?
I agree.
>
>> Chip 0 Vdd Remote Sense: +6.51 V (lowest = +6.45 V, highest = +8.31 V)
>> Chip 0 Vdn Remote Sense: +9.01 V (lowest = +9.01 V, highest = +9.01 V)
>> Chip 8 Vdd Remote Sense: +8.04 V (lowest = +6.50 V, highest = +8.07 V)
>> Chip 8 Vdn Remote Sense: +9.01 V (lowest = +9.01 V, highest = +9.01 V)
> Are these the point of load voltages?
According to the OCC document these refer to "Vdn/Vdd Voltage at the remote
sense. (AVS reading adjusted for loadline)"
>
>> Chip 0 Vdd: +6.56 V (lowest = +6.56 V, highest = +8.36 V)
>> Chip 0 Vdn: +9.02 V (lowest = +9.02 V, highest = +9.02 V)
>> Chip 8 Vdd: +8.12 V (lowest = +6.67 V, highest = +8.12 V)
>> Chip 8 Vdn: +9.02 V (lowest = +9.02 V, highest = +9.02 V)
> The point of supply voltages?
"Processor Vdd/Vdn Voltage (read from AVSBus)"
>
> why are these duplicated?
>
>> Core 0: +37.0°C
>> Core 4: +37.0°C
>> Core 8: +36.0°C
>> Core 12: +37.0°C
>> Core 16: +38.0°C
>> Core 20: +37.0°C
>> Core 24: +37.0°C
>> Core 28: +37.0°C
>> Core 32: +37.0°C
>> Core 36: +37.0°C
>> Core 40: +36.0°C
>> Core 44: +36.0°C
>> Core 48: +37.0°C
>> Core 52: +37.0°C
>> Core 56: +35.0°C
>> Core 60: +35.0°C
>> Core 64: +35.0°C
>> Core 68: +37.0°C
>> Core 72: +37.0°C
>> Core 76: +39.0°C
>> Core 80: +35.0°C
>> Core 84: +34.0°C
>> Core 88: +36.0°C
>> Core 92: +36.0°C
>> Core 96: +37.0°C
>> Core 100: +38.0°C
>> Core 104: +36.0°C
>> Core 108: +37.0°C
>> Core 112: +37.0°C
>> Core 116: +35.0°C
>> Core 120: +36.0°C
>> Core 124: +37.0°C
>> Chip 0 Core 0: +36.0°C (lowest = +34.0°C, highest = +44.0°C)
>> Chip 0 Core 4: +37.0°C (lowest = +34.0°C, highest = +45.0°C)
>> Chip 0 Core 8: +35.0°C (lowest = +34.0°C, highest = +45.0°C)
>> Chip 0 Core 12: +35.0°C (lowest = +34.0°C, highest = +44.0°C)
>> Chip 0 Core 16: +37.0°C (lowest = +35.0°C, highest = +47.0°C)
>> Chip 0 Core 20: +34.0°C (lowest = +33.0°C, highest = +45.0°C)
>> Chip 0 Core 24: +36.0°C (lowest = +34.0°C, highest = +45.0°C)
>> Chip 0 Core 28: +36.0°C (lowest = +35.0°C, highest = +45.0°C)
>> Chip 0 Core 32: +37.0°C (lowest = +35.0°C, highest = +45.0°C)
>> Chip 0 Core 36: +35.0°C (lowest = +33.0°C, highest = +44.0°C)
>> Chip 0 Core 40: +35.0°C (lowest = +33.0°C, highest = +43.0°C)
>> Chip 0 Core 44: +36.0°C (lowest = +34.0°C, highest = +44.0°C)
>> Chip 0 Core 48: +36.0°C (lowest = +35.0°C, highest = +45.0°C)
>> Chip 0 Core 52: +35.0°C (lowest = +34.0°C, highest = +46.0°C)
>> Chip 0 Core 56: +33.0°C (lowest = +32.0°C, highest = +43.0°C)
>> Chip 0 Core 60: +34.0°C (lowest = +34.0°C, highest = +44.0°C)
>> Chip 8 Core 64: +35.0°C (lowest = +32.0°C, highest = +42.0°C)
>> Chip 8 Core 68: +35.0°C (lowest = +33.0°C, highest = +42.0°C)
>> Chip 8 Core 72: +36.0°C (lowest = +34.0°C, highest = +43.0°C)
>> Chip 8 Core 76: +37.0°C (lowest = +34.0°C, highest = +44.0°C)
>> Chip 8 Core 80: +35.0°C (lowest = +33.0°C, highest = +43.0°C)
>> Chip 8 Core 84: +34.0°C (lowest = +32.0°C, highest = +41.0°C)
>> Chip 8 Core 88: +35.0°C (lowest = +31.0°C, highest = +42.0°C)
>> Chip 8 Core 92: +35.0°C (lowest = +32.0°C, highest = +41.0°C)
>> Chip 8 Core 96: +36.0°C (lowest = +33.0°C, highest = +43.0°C)
>> Chip 8 Core 100: +37.0°C (lowest = +34.0°C, highest = +43.0°C)
>> Chip 8 Core 104: +36.0°C (lowest = +33.0°C, highest = +42.0°C)
>> Chip 8 Core 108: +35.0°C (lowest = +32.0°C, highest = +42.0°C)
>> Chip 8 Core 112: +34.0°C (lowest = +29.0°C, highest = +41.0°C)
>> Chip 8 Core 116: +34.0°C (lowest = +29.0°C, highest = +42.0°C)
>> Chip 8 Core 120: +36.0°C (lowest = +33.0°C, highest = +43.0°C)
>> Chip 8 Core 124: +35.0°C (lowest = +32.0°C, highest = +42.0°C)
>
> So we have two sets of temperature sensors. One set is from measuring
> the per-core DTS directly, and the other set is from the OCC measuring
> the same per-core DTS. We should probably not be doubling up here.
>
Yup. Which one do we remove?
OPAL DTS : Works even when OCC is down. But adds an extra overhead while reading
and with deeper idle states enabled will take more time to read.
OCC DTS: Not functional when OCC is down.
>> Chip 0 DIMM 0 : +38.0°C (lowest = +37.0°C, highest = +38.0°C)
>> Chip 0 DIMM 1 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 DIMM 2 : +40.0°C (lowest = +39.0°C, highest = +40.0°C)
>> Chip 0 DIMM 3 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 DIMM 4 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 DIMM 5 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 DIMM 6 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 DIMM 7 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 DIMM 8 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 DIMM 9 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 DIMM 10 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 DIMM 11 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 DIMM 12 : +40.0°C (lowest = +40.0°C, highest = +41.0°C)
>> Chip 0 DIMM 13 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 DIMM 14 : +40.0°C (lowest = +39.0°C, highest = +41.0°C)
>> Chip 0 DIMM 15 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 DIMM 0 : +36.0°C (lowest = +35.0°C, highest = +36.0°C)
>> Chip 8 DIMM 1 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 DIMM 2 : +36.0°C (lowest = +35.0°C, highest = +36.0°C)
>> Chip 8 DIMM 3 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 DIMM 4 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 DIMM 5 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 DIMM 6 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 DIMM 7 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 DIMM 8 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 DIMM 9 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 DIMM 10 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 DIMM 11 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 DIMM 12 : +35.0°C (lowest = +35.0°C, highest = +35.0°C)
>> Chip 8 DIMM 13 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 DIMM 14 : +35.0°C (lowest = +35.0°C, highest = +36.0°C)
>> Chip 8 DIMM 15 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>
> A lot of these are zero because the DIMM itself isn't populated. Witherspoon and
> Zaius don't even have sockets for 16 DIMMS on each socket, so why even
> report them
> here?
Possibly there is way to filter out configured DIMMS using DT information with
DIMM <id> provided by OCC like we do for deconfigured cores. Or should we just
filter them out if the sensor value is zero.
>
>> Chip 0 Nest: +35.0°C (lowest = +34.0°C, highest = +40.0°C)
>> Chip 8 Nest: +37.0°C (lowest = +35.0°C, highest = +42.0°C)
>
> Could we report this as the chip overall temperature?
Yup.
>
>> Chip 0 GPU 0 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 GPU 1 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 GPU 2 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 GPU 0 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 GPU 1 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 0 GPU 2 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 GPU 0 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 GPU 1 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 GPU 2 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 GPU 0 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 GPU 1 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>> Chip 8 GPU 2 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C)
>
> I think that these should probably be witherspoon specific. If we have support
>
>> Chip 0 TEMPVDD: +40.0°C (lowest = +39.0°C, highest = +45.0°C)
>> Chip 8 TEMPVDD: +38.0°C (lowest = +37.0°C, highest = +42.0°C)
>
> What is this even? Temperature of the Vdn regulator? What about Vdn?
Yes VRM Vdd temperature.
>
>> Chip 0 Memory: 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> Chip 8 Memory: 0.00 W (lowest = 0.00 W, highest = 0.00 W)
> I'm not sure how you would even fill this out. Average DIMM temperature?
"Power consumption for Memory for this Processor read from APSS."
>
>> Chip 8 : 61.00 W (lowest = 47.00 W, highest = 126.00 W)
>> Chip 0 : 48.00 W (lowest = 45.00 W, highest = 129.00 W)
> Seems useful.
>
>> Chip 0 Vdd: 13.00 W (lowest = 10.00 W, highest = 93.00 W)
>> Chip 0 Vdn: 16.00 W (lowest = 16.00 W, highest = 18.00 W)
>> Chip 8 Vdd: 26.00 W (lowest = 12.00 W, highest = 90.00 W)
>> Chip 8 Vdn: 16.00 W (lowest = 16.00 W, highest = 19.00 W)
> This does make sense, but there isn't a whole lot of point to.
>
>> Chip 0 GPU: 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> Chip 8 GPU: 0.00 W (lowest = 0.00 W, highest = 0.00 W)
> where did this one come from?
>
>> System: 0.00 W (lowest = 0.00 W, highest = 0.00 W)
> This should probably not be zero.
>
>> APSS 0 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 1 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 2 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 3 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 4 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 5 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 6 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 7 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 8 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 9 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 10 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 11 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 12 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 13 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 14 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>> APSS 15 : 0.00 W (lowest = 0.00 W, highest = 0.00 W)
>
> Should these even be here? I don't believe pass2 romulus even has an
> APSS so it shouldn't be appearing here. Really we should have more
> useful names for each APSS channel.
Ideally it would have been good if OCC had not added these sensors to main
memory on boxes where APSS is not present. Corresponding component power sensors
like System power, memory power will be also be zero which point to one of the
APSS channel.
So probably we should filter these based on 0 sensor values in OPAL.
>
>> Chip 0 Vdd: +2.05 A (lowest = +1.36 A, highest = +11.64 A)
>> Chip 0 Vdn: +1.85 A (lowest = +1.81 A, highest = +2.05 A)
>> Chip 8 Vdd: +3.00 A (lowest = +1.52 A, highest = +11.54 A)
>> Chip 8 Vdn: +1.89 A (lowest = +1.82 A, highest = +2.16 A)
>
> Maybe we should move towards a white-listing approach rather than just
> throwing in as much stuff as possible.
>
The sensor list populated by OCC is initially a CORAL sensor requirement for
profiling. But I do agree that we should not add sensors that are deconfigured
or not present.
Thanks and Regards,
Shilpa
> Oliver
> _______________________________________________
> Skiboot mailing list
> Skiboot at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot
>
More information about the Skiboot
mailing list