[PATCH v4 25/32] cxlflash: Fix to prevent EEH recovery failure

Matthew R. Ochs mrochs at linux.vnet.ibm.com
Wed Sep 30 06:11:15 AEST 2015


> On Sep 28, 2015, at 8:25 PM, Daniel Axtens <dja at axtens.net> wrote:
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
> 
> "Matthew R. Ochs" <mrochs at linux.vnet.ibm.com> writes:
> 
> 
>> The process_sense() routine can perform a read capacity which
>> can take some time to complete. If an EEH occurs while waiting
>> on the read capacity, the EEH handler is unable to obtain the
>> context's mutex in order to put the context in an error state.
>> The EEH handler will sit and wait until the context is free,
>> but this wait can last longer than the EEH handler tolerates,
>> leading to a failed recovery.
> 
> I'm not quite clear on what you mean by the EEH handler timing
> out. AFAIK there's nothing in eehd and the EEH core that times out if a
> driver doesn't respond - indeed, it's pretty easy to hang eehd with a
> misbehaving driver.
> 
> Are you referring to your own internal timeouts?
> cxlflash_wait_for_pci_err_recovery and anything else that uses
> CXLFLASH_PCI_ERROR_RECOVERY_TIMEOUT?

Reading through this again I can see how this is misleading. This is
actually similar and related to the deadlock scenario described in
"Fix to avoid potential deadlock on EEH". Without this fix, you'd end
up in a similar situation but deadlocked on the context mutex instead
of the ioctl semaphore.



More information about the Linuxppc-dev mailing list