[Skiboot] [PATCH 1/2] sensors: occ: Fix the GPU detection code
Gautham R Shenoy
ego at linux.vnet.ibm.com
Mon Apr 27 23:53:49 AEST 2020
Hello Frederic,
On Fri, Apr 24, 2020 at 06:29:57PM +0200, Frederic Barrat wrote:
[..snip..]
> >---
> > hw/occ-sensor.c | 22 ++++++++++++++++++++--
> > 1 file changed, 20 insertions(+), 2 deletions(-)
> >
> >diff --git a/hw/occ-sensor.c b/hw/occ-sensor.c
> >index 524d00f..a5d0974 100644
> >--- a/hw/occ-sensor.c
> >+++ b/hw/occ-sensor.c
> >@@ -521,8 +521,26 @@ bool occ_sensors_init(void)
> > dt_add_property_cells(sg, "#address-cells", 1);
> > dt_add_property_cells(sg, "#size-cells", 0);
> >- if (dt_find_compatible_node(dt_root, NULL, "ibm,power9-npu"))
> >- has_gpu = true;
> >+ /*
> >+ * On POWER9, ibm,ioda2-npu2-phb indicates the presence of a
> >+ * GPU NVlink.
> >+ */
> >+ if (dt_find_compatible_node(dt_root, NULL, "ibm,ioda2-npu2-phb")) {
> >+
> >+ for_each_chip(chip) {
> >+ int max_gpus_per_chip = 2, i;
> >+
> >+ for(i = 0; i < max_gpus_per_chip; i++) {
> >+ has_gpu = occ_get_gpu_presence(chip, i);
> >+
> >+ if (has_gpu)
> >+ break;
> >+ }
> >+
> >+ if (has_gpu)
> >+ break;
> >+ }
> >+ }
>
> So on platform other than witherspoon, we skip the full check.
> On witherspoon, we go check the presence of each individual GPU slots.
> So I think that works.
> Out of curiosity, what's the behavior if we have GPUs but not on all slots
> (it may not be an official configuration, I don't know)? Would we see
> errors?
No there won't be errors. Those GPU slots would show the values 0 for
temperature and POWER. Example below:
On a Witherspoon with only two GPUs
=================================================================
Chip 0 GPU 0 : +33.0°C (lowest = +23.0°C, highest = +33.0°C)
Chip 0 GPU 1 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
Chip 0 GPU 2 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
Chip 0 GPU 0 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C)
Chip 0 GPU 1 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C)
Chip 0 GPU 2 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C)
Chip 8 GPU 0 : +29.0°C (lowest = +23.0°C, highest = +29.0°C)
Chip 8 GPU 1 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
Chip 8 GPU 2 : +0.0°C (lowest = +0.0°C, highest = +0.0°C)
Chip 8 GPU 0 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C)
Chip 8 GPU 1 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C)
Chip 8 GPU 2 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C)
=================================================================
>
> Reviewed-by: Frederic Barrat <fbarrat at linux.ibm.com>
Thanks!
>
> Fred
>
>
>
> > for_each_chip(chip) {
> > struct occ_sensor_data_header *hb;
> >
Thanks and Regards
gautham.
More information about the Skiboot
mailing list