[PATCH linux dev-4.13 2/4] fsi/occ: Add Retries on checksum errors

Wed May 23 00:06:34 AEST 2018

On 05/21/2018 07:36 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2018-05-21 at 13:58 -0500, Eddie James wrote:
>>>>     If it's the former then retrying independent of the OCC error
>>>> handling protocol is probably okay, but if we're trying to catch the
>>>> latter then maybe we should let this be handled as part of the OCC
>>>> error handling code?
>>>>
>>>> Eddie?
>> The checksum is part of the OCC response, so it's not a transport thing.
>> If we've gotten to checking the checksum then we've got a full response
>> that looks valid so far (reasonable length, etc).
>>
>> If we're trying to adhere to the OCC spec, then I'm of the opinion that
>> we shouldn't do any retries except for those handled in the occ-hwmon
>> driver.
> I don't see any in there ... what am I missing ?

Oh yea, we moved the retries to userspace, sort of. Basically userspace 
keeps polling every second, even if it fails, until the "error" 
attribute gets set. That error attribute is only set if we've tried and 
failed to communicate a certain number of times 
(OCC_ERROR_COUNT_THRESHOLD in the occ hwmon driver). Once userspace 
picks up the error attribute, it has to take action on that, by 
resetting the OCC over IPMI, etc.

Thanks,
Eddie

>
> Cheers,
> Ben.
>