[Skiboot] [PATCH 1/2] sensors: occ: Fix the GPU detection code
Frederic Barrat
fbarrat at linux.ibm.com
Sat Apr 25 02:29:57 AEST 2020
Le 24/04/2020 à 08:59, Gautham R. Shenoy a écrit :
> From: "Gautham R. Shenoy" <ego at linux.vnet.ibm.com>
>
> commit bebe096ee242 ("sensors: occ: Skip GPU sensors for non-gpu
> systems") assumes that presence of "ibm,power9-npu" compatible node
> indicates the presence of GPUs. However this is incorrect, as even
> OpenCAPI is supported via NPU. Thus ZZ systems, which have OpenCAPI
> connectors but not GPUs will have "ibm,power9-npu" compatible nodes.
> This results in OPAL creating device-tree entries for the GPU sensors
> on ZZ systems which don't even have GPUs.
>
> This patch fixes the GPU detection code in occ-sensors, by first
> checking for "ibm,ioda2-npu2-phb" compatible node which indicates the
> presence of nvlink. Only if such a node exists, do we check with the
> OCC for presence of GPUs on systems to confirm the presence of the
> GPU. Otherwise, we cut the GPU sensors.
>
> Thanks to Frederic Barrat <fbarrat at linux.ibm.com> for suggesting
> "ibm,ioda2-npu2-phb" for detecting the presence of nvlink GPUs.
>
> Fixes: commit bebe096ee242 ("sensors: occ: Skip GPU sensors for non-gpu
> systems")
> Reported-by: Pavaman Subramaniyam <pavsubra at in.ibm.com>
> Tested-by: Pavaman Subramaniyam <pavsubra at in.ibm.com>
> Signed-off-by: Gautham R. Shenoy <ego at linux.vnet.ibm.com>
> ---
> hw/occ-sensor.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/hw/occ-sensor.c b/hw/occ-sensor.c
> index 524d00f..a5d0974 100644
> --- a/hw/occ-sensor.c
> +++ b/hw/occ-sensor.c
> @@ -521,8 +521,26 @@ bool occ_sensors_init(void)
> dt_add_property_cells(sg, "#address-cells", 1);
> dt_add_property_cells(sg, "#size-cells", 0);
>
> - if (dt_find_compatible_node(dt_root, NULL, "ibm,power9-npu"))
> - has_gpu = true;
> + /*
> + * On POWER9, ibm,ioda2-npu2-phb indicates the presence of a
> + * GPU NVlink.
> + */
> + if (dt_find_compatible_node(dt_root, NULL, "ibm,ioda2-npu2-phb")) {
> +
> + for_each_chip(chip) {
> + int max_gpus_per_chip = 2, i;
> +
> + for(i = 0; i < max_gpus_per_chip; i++) {
> + has_gpu = occ_get_gpu_presence(chip, i);
> +
> + if (has_gpu)
> + break;
> + }
> +
> + if (has_gpu)
> + break;
> + }
> + }
So on platform other than witherspoon, we skip the full check.
On witherspoon, we go check the presence of each individual GPU slots.
So I think that works.
Out of curiosity, what's the behavior if we have GPUs but not on all
slots (it may not be an official configuration, I don't know)? Would we
see errors?
Reviewed-by: Frederic Barrat <fbarrat at linux.ibm.com>
Fred
>
> for_each_chip(chip) {
> struct occ_sensor_data_header *hb;
>
More information about the Skiboot
mailing list