[Skiboot] [PATCH 1/2] sensors: occ: Fix the GPU detection code

Gautham R Shenoy ego at linux.vnet.ibm.com
Mon Apr 27 23:53:49 AEST 2020


Hello Frederic,

On Fri, Apr 24, 2020 at 06:29:57PM +0200, Frederic Barrat wrote:

[..snip..]
> >---
> >  hw/occ-sensor.c | 22 ++++++++++++++++++++--
> >  1 file changed, 20 insertions(+), 2 deletions(-)
> >
> >diff --git a/hw/occ-sensor.c b/hw/occ-sensor.c
> >index 524d00f..a5d0974 100644
> >--- a/hw/occ-sensor.c
> >+++ b/hw/occ-sensor.c
> >@@ -521,8 +521,26 @@ bool occ_sensors_init(void)
> >  	dt_add_property_cells(sg, "#address-cells", 1);
> >  	dt_add_property_cells(sg, "#size-cells", 0);
> >-	if (dt_find_compatible_node(dt_root, NULL, "ibm,power9-npu"))
> >-		has_gpu = true;
> >+	/*
> >+	 * On POWER9, ibm,ioda2-npu2-phb indicates the presence of a
> >+	 * GPU NVlink.
> >+	 */
> >+	if (dt_find_compatible_node(dt_root, NULL, "ibm,ioda2-npu2-phb")) {
> >+
> >+		for_each_chip(chip) {
> >+			int max_gpus_per_chip = 2, i;
> >+
> >+			for(i = 0; i < max_gpus_per_chip; i++) {
> >+				has_gpu = occ_get_gpu_presence(chip, i);
> >+
> >+				if (has_gpu)
> >+					break;
> >+			}
> >+
> >+			if (has_gpu)
> >+				break;
> >+		}
> >+	}
> 
> So on platform other than witherspoon, we skip the full check.
> On witherspoon, we go check the presence of each individual GPU slots.
> So I think that works.
> Out of curiosity, what's the behavior if we have GPUs but not on all slots
> (it may not be an official configuration, I don't know)? Would we see
> errors?

No there won't be errors. Those GPU slots would show the values 0 for
temperature and POWER. Example below:

On a Witherspoon with only two GPUs 
=================================================================
Chip 0 GPU 0 :            +33.0°C  (lowest = +23.0°C, highest = +33.0°C)
Chip 0 GPU 1 :             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 GPU 2 :             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 GPU 0 MEM:          +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 GPU 1 MEM:          +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 GPU 2 MEM:          +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 8 GPU 0 :            +29.0°C  (lowest = +23.0°C, highest = +29.0°C)
Chip 8 GPU 1 :             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 8 GPU 2 :             +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 8 GPU 0 MEM:          +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 8 GPU 1 MEM:          +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 8 GPU 2 MEM:          +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
=================================================================
> 
> Reviewed-by: Frederic Barrat <fbarrat at linux.ibm.com>

Thanks!


> 
>   Fred
> 
> 
> 
> >  	for_each_chip(chip) {
> >  		struct occ_sensor_data_header *hb;
> >

Thanks and Regards
gautham.


More information about the Skiboot mailing list