[Skiboot] [PATCH 16/16] npu2-opencapi: Log a warning when resetting a broken device

christophe lombard clombard at linux.vnet.ibm.com
Tue Sep 17 23:55:30 AEST 2019


On 09/09/2019 14:31, Frederic Barrat wrote:
> On P9, the NPU doesn't support recovery if the link goes down
> unexpectedly. It was not fully verified. We mark the device as broken
> when we receive an error interrupt from the NPU. However, there's
> nothing to prevent the OS from trying to reset the device; It may or
> may not work, it's unsupported territory, so let's log a message to
> make it clear, as it could help when debugging. We haven't hit any
> cases where the reset goes badly enough that we'd want to prevent it,
> so let it go for now. We can revisit later if we have evidence that
> it's causing more problems than it is worth.
> 
> Signed-off-by: Frederic Barrat <fbarrat at linux.ibm.com>
> ---
>   hw/npu2-opencapi.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/hw/npu2-opencapi.c b/hw/npu2-opencapi.c
> index 46aeb6d3..f044fdbf 100644
> --- a/hw/npu2-opencapi.c
> +++ b/hw/npu2-opencapi.c
> @@ -1246,6 +1246,10 @@ static int64_t npu2_opencapi_freset(struct pci_slot *slot)
>   			OCAPIINF(dev, "no card detected\n");
>   			return OPAL_SUCCESS;
>   		}
> +		if (dev->flags & NPU2_DEV_BROKEN) {
> +			OCAPIERR(dev, "Resetting a device which hit a previous error. Device recovery is not supported, so future behavior is undefined\n");
> +			dev->flags &= ~NPU2_DEV_BROKEN;

Removing the "broken" state means that the device is available. You 
could update the state only when freset exits without issue.

> +		}
>   		slot->link_retries = OCAPI_LINK_TRAINING_RETRIES;
>   		/* fall-through */
>   	case OCAPI_SLOT_FRESET_INIT:
> 



More information about the Skiboot mailing list