[RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()

Gavin Shan gwshan at linux.vnet.ibm.com
Fri Mar 3 16:45:14 AEDT 2017


On Fri, Mar 03, 2017 at 03:35:05PM +1100, Russell Currey wrote:
>On Fri, 2017-03-03 at 09:51 +0530, Vaibhav Jain wrote:
>> Hi Russell,
>> 
>> Vaibhav Jain <vaibhav at linux.vnet.ibm.com> writes:
>> 
>> > This patch introduces function eeh_pe_reset_freeze_counter() which can
>> > be used to reset the PE's freeze count variable outside eeh code. This
>> > is useful for devices that can acquire a different personality after
>> > a PERST event (e.g FPGA Adapters). Presently an existing freeze
>> > count for an adapter with personality N will be taken into account
>> > when the adapter acquired personality N+1.
>> > 
>> > By calling eeh_pe_reset_freeze_counter() drivers can reset the freeze
>> > counter for an adapter once it has acquired a new personality and
>> > ideally wont be plagued by the failures similar to the one before.
>> > 
>> > Signed-off-by: Vaibhav Jain <vaibhav at linux.vnet.ibm.com>
>> > ---
>> 
>> Had a short chat discussion with Gavin Shan on this patchset and he
>> preffers restoring the freeze_count on the eeh_pe once FRESET is done.
>> He expects a the flow to be similar to one below
>> 
>> 1. module caches the value of freeze_count and resets it
>> 2. Issue warm reset
>> 3. During eeh error-detected callback module restores the freeze_count
>> from the cached value.
>> 
>> Russell, what do you think? 
>> 
>I thought about this but figured it didn't really make sense from a CAPI
>perspective.  If you're flashing the device, it is going to have different
>behaviour to before it was flashed, and that it should be treated differently as
>a result (and thus restoring the freeze_count doesn't make much sense).
>

There are nothing changed on the PHB. This patch is clearing the error count
of PHB PE, not the PE for the CAPI device. We shouldn't clear the error count
of the PHB PE. Otherwise, it's not consistent.

>Consider a case where there's a buggy FPGA image on an adapter that's failed 4
>times in the past hour, and generally has frequent errors.  You decide to update
>it to something that's less buggy, so you flash the adapter.  The freeze_count
>gets cached and thus is restored to 4 after the flash.  Now even if the new
>image is less buggy and may only fail once an hour instead of multiple times, if
>it happens to fail within an hour of the earlier failures the device is now
>fenced and you need to reboot.
>
>I don't mind either way - I just don't get the logic of restoring the count.
>

I don't get your point. FPGA image isn't the only source of EEH error. Also,
it's not related the PHB PE's error count, which the patch is to clear.

Cheers,
Gavin



More information about the Linuxppc-dev mailing list