[RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()

Fri Mar 3 15:35:05 AEDT 2017

On Fri, 2017-03-03 at 09:51 +0530, Vaibhav Jain wrote:
> Hi Russell,
> 
> Vaibhav Jain <vaibhav at linux.vnet.ibm.com> writes:
> 
> > This patch introduces function eeh_pe_reset_freeze_counter() which can
> > be used to reset the PE's freeze count variable outside eeh code. This
> > is useful for devices that can acquire a different personality after
> > a PERST event (e.g FPGA Adapters). Presently an existing freeze
> > count for an adapter with personality N will be taken into account
> > when the adapter acquired personality N+1.
> > 
> > By calling eeh_pe_reset_freeze_counter() drivers can reset the freeze
> > counter for an adapter once it has acquired a new personality and
> > ideally wont be plagued by the failures similar to the one before.
> > 
> > Signed-off-by: Vaibhav Jain <vaibhav at linux.vnet.ibm.com>
> > ---
> 
> Had a short chat discussion with Gavin Shan on this patchset and he
> preffers restoring the freeze_count on the eeh_pe once FRESET is done.
> He expects a the flow to be similar to one below
> 
> 1. module caches the value of freeze_count and resets it
> 2. Issue warm reset
> 3. During eeh error-detected callback module restores the freeze_count
> from the cached value.
> 
> Russell, what do you think? 
> 
I thought about this but figured it didn't really make sense from a CAPI
perspective.  If you're flashing the device, it is going to have different
behaviour to before it was flashed, and that it should be treated differently as
a result (and thus restoring the freeze_count doesn't make much sense).

Consider a case where there's a buggy FPGA image on an adapter that's failed 4
times in the past hour, and generally has frequent errors.  You decide to update
it to something that's less buggy, so you flash the adapter.  The freeze_count
gets cached and thus is restored to 4 after the flash.  Now even if the new
image is less buggy and may only fail once an hour instead of multiple times, if
it happens to fail within an hour of the earlier failures the device is now
fenced and you need to reboot.

I don't mind either way - I just don't get the logic of restoring the count.