[Skiboot] [PATCH 16/16] npu2-opencapi: Log a warning when resetting a broken device

Frederic Barrat fbarrat at linux.ibm.com
Wed Sep 18 01:29:36 AEST 2019



Le 17/09/2019 à 15:55, christophe lombard a écrit :
> On 09/09/2019 14:31, Frederic Barrat wrote:
>> On P9, the NPU doesn't support recovery if the link goes down
>> unexpectedly. It was not fully verified. We mark the device as broken
>> when we receive an error interrupt from the NPU. However, there's
>> nothing to prevent the OS from trying to reset the device; It may or
>> may not work, it's unsupported territory, so let's log a message to
>> make it clear, as it could help when debugging. We haven't hit any
>> cases where the reset goes badly enough that we'd want to prevent it,
>> so let it go for now. We can revisit later if we have evidence that
>> it's causing more problems than it is worth.
>>
>> Signed-off-by: Frederic Barrat <fbarrat at linux.ibm.com>
>> ---
>>   hw/npu2-opencapi.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/hw/npu2-opencapi.c b/hw/npu2-opencapi.c
>> index 46aeb6d3..f044fdbf 100644
>> --- a/hw/npu2-opencapi.c
>> +++ b/hw/npu2-opencapi.c
>> @@ -1246,6 +1246,10 @@ static int64_t npu2_opencapi_freset(struct 
>> pci_slot *slot)
>>               OCAPIINF(dev, "no card detected\n");
>>               return OPAL_SUCCESS;
>>           }
>> +        if (dev->flags & NPU2_DEV_BROKEN) {
>> +            OCAPIERR(dev, "Resetting a device which hit a previous 
>> error. Device recovery is not supported, so future behavior is 
>> undefined\n");
>> +            dev->flags &= ~NPU2_DEV_BROKEN;
> 
> Removing the "broken" state means that the device is available. You 
> could update the state only when freset exits without issue.


Good point, I don't think the state needs to be reset that early.

   Fred



>> +        }
>>           slot->link_retries = OCAPI_LINK_TRAINING_RETRIES;
>>           /* fall-through */
>>       case OCAPI_SLOT_FRESET_INIT:
>>
> 



More information about the Skiboot mailing list