[Skiboot] [PATCH 16/16] npu2-opencapi: Log a warning when resetting a broken device
Frederic Barrat
fbarrat at linux.ibm.com
Wed Sep 18 01:29:36 AEST 2019
Le 17/09/2019 à 15:55, christophe lombard a écrit :
> On 09/09/2019 14:31, Frederic Barrat wrote:
>> On P9, the NPU doesn't support recovery if the link goes down
>> unexpectedly. It was not fully verified. We mark the device as broken
>> when we receive an error interrupt from the NPU. However, there's
>> nothing to prevent the OS from trying to reset the device; It may or
>> may not work, it's unsupported territory, so let's log a message to
>> make it clear, as it could help when debugging. We haven't hit any
>> cases where the reset goes badly enough that we'd want to prevent it,
>> so let it go for now. We can revisit later if we have evidence that
>> it's causing more problems than it is worth.
>>
>> Signed-off-by: Frederic Barrat <fbarrat at linux.ibm.com>
>> ---
>> hw/npu2-opencapi.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/hw/npu2-opencapi.c b/hw/npu2-opencapi.c
>> index 46aeb6d3..f044fdbf 100644
>> --- a/hw/npu2-opencapi.c
>> +++ b/hw/npu2-opencapi.c
>> @@ -1246,6 +1246,10 @@ static int64_t npu2_opencapi_freset(struct
>> pci_slot *slot)
>> OCAPIINF(dev, "no card detected\n");
>> return OPAL_SUCCESS;
>> }
>> + if (dev->flags & NPU2_DEV_BROKEN) {
>> + OCAPIERR(dev, "Resetting a device which hit a previous
>> error. Device recovery is not supported, so future behavior is
>> undefined\n");
>> + dev->flags &= ~NPU2_DEV_BROKEN;
>
> Removing the "broken" state means that the device is available. You
> could update the state only when freset exits without issue.
Good point, I don't think the state needs to be reset that early.
Fred
>> + }
>> slot->link_retries = OCAPI_LINK_TRAINING_RETRIES;
>> /* fall-through */
>> case OCAPI_SLOT_FRESET_INIT:
>>
>
More information about the Skiboot
mailing list