[Skiboot] [PATCH 16/16] npu2-opencapi: Log a warning when resetting a broken device
christophe lombard
clombard at linux.vnet.ibm.com
Tue Sep 17 23:55:30 AEST 2019
On 09/09/2019 14:31, Frederic Barrat wrote:
> On P9, the NPU doesn't support recovery if the link goes down
> unexpectedly. It was not fully verified. We mark the device as broken
> when we receive an error interrupt from the NPU. However, there's
> nothing to prevent the OS from trying to reset the device; It may or
> may not work, it's unsupported territory, so let's log a message to
> make it clear, as it could help when debugging. We haven't hit any
> cases where the reset goes badly enough that we'd want to prevent it,
> so let it go for now. We can revisit later if we have evidence that
> it's causing more problems than it is worth.
>
> Signed-off-by: Frederic Barrat <fbarrat at linux.ibm.com>
> ---
> hw/npu2-opencapi.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/hw/npu2-opencapi.c b/hw/npu2-opencapi.c
> index 46aeb6d3..f044fdbf 100644
> --- a/hw/npu2-opencapi.c
> +++ b/hw/npu2-opencapi.c
> @@ -1246,6 +1246,10 @@ static int64_t npu2_opencapi_freset(struct pci_slot *slot)
> OCAPIINF(dev, "no card detected\n");
> return OPAL_SUCCESS;
> }
> + if (dev->flags & NPU2_DEV_BROKEN) {
> + OCAPIERR(dev, "Resetting a device which hit a previous error. Device recovery is not supported, so future behavior is undefined\n");
> + dev->flags &= ~NPU2_DEV_BROKEN;
Removing the "broken" state means that the device is available. You
could update the state only when freset exits without issue.
> + }
> slot->link_retries = OCAPI_LINK_TRAINING_RETRIES;
> /* fall-through */
> case OCAPI_SLOT_FRESET_INIT:
>
More information about the Skiboot
mailing list