[PATCH v4 25/32] cxlflash: Fix to prevent EEH recovery failure

Daniel Axtens dja at axtens.net
Thu Oct 1 09:53:06 AEST 2015


"Matthew R. Ochs" <mrochs at linux.vnet.ibm.com> writes:

>>> The process_sense() routine can perform a read capacity which
>>> can take some time to complete. If an EEH occurs while waiting
>>> on the read capacity, the EEH handler is unable to obtain the
>>> context's mutex in order to put the context in an error state.
>>> The EEH handler will sit and wait until the context is free,
>>> but this wait can last longer than the EEH handler tolerates,
>>> leading to a failed recovery.
>> 
>> I'm not quite clear on what you mean by the EEH handler timing
>> out. AFAIK there's nothing in eehd and the EEH core that times out if a
>> driver doesn't respond - indeed, it's pretty easy to hang eehd with a
>> misbehaving driver.
>> 
>> Are you referring to your own internal timeouts?
>> cxlflash_wait_for_pci_err_recovery and anything else that uses
>> CXLFLASH_PCI_ERROR_RECOVERY_TIMEOUT?
>
> Reading through this again I can see how this is misleading. This is
> actually similar and related to the deadlock scenario described in
> "Fix to avoid potential deadlock on EEH". Without this fix, you'd end
> up in a similar situation but deadlocked on the context mutex instead
> of the ioctl semaphore.

That makes _much_ more sense. If you could please revise the commit
message to explain that, you can include this in the next version:
Reviewed-by: Daniel Axtens <dja at axtens.net>

Regards,
Daniel



More information about the Linuxppc-dev mailing list