[PATCH 01/14] powerpc/eeh: Clean up EEH PEs after recovery finishes

Tue Sep 17 10:43:39 AEST 2019

On Tue, Sep 03, 2019 at 08:15:52PM +1000, Oliver O'Halloran wrote:
> When the last device in an eeh_pe is removed the eeh_pe structure itself
> (and any empty parents) are freed since they are no longer needed. This
> results in a crash when a hotplug driver is involved since the following
> may occur:
> 
> 1. Device is suprise removed.
> 2. Driver performs an MMIO, which fails and queues and eeh_event.
> 3. Hotplug driver receives a hotplug interrupt and removes any
>    pci_devs that were under the slot.
> 4. pci_dev is torn down and the eeh_pe is freed.
> 5. The EEH event handler thread processes the eeh_event and crashes
>    since the eeh_pe pointer in the eeh_event structure is no
>    longer valid.
> 
> Crashing is generally considered poor form. Instead of doing that use
> the fact PEs are marked as EEH_PE_INVALID to keep them around until the
> end of the recovery cycle, at which point we can safely prune any empty
> PEs.
> 
> Signed-off-by: Oliver O'Halloran <oohall at gmail.com>

I think this could use a bit more explanation about how the change
correct the problem, but if I understand it correctly, what it's doing
is (for the hotplug problem case):
- When the (problmatic) freeeze is detected during hot unplug, change
  eeh_dev_checK_failure() so that it marks PEs with EEH_PE_RECOVERING
  immediately rather than waiting until error recovery has started.
- Then as the hot unplug continues and the device is removed, and
  eeh_rmv_from_parent_pe() is called, change it so that it does not
  remove PEs that are marked EEH_PE_RECOVERING (which will be the ones
  just marked).
- Then as recovery starts, PE removal won't race between the hotplug
  code and the recovery code, and the crash is avoided.

If that's right, it looks fine to me, although as you said we might want
to revert it if/when we have something better.

I haven't got a test setup for the error case, but I've tried my usual
test cases and they all seem to be OK. The only place I can see that
might be disturbed by this change is where EEH_PE_RECOVERING affects
eeh_pe_reset_and_recover() (used when a PE is passed back from
a guest to the host), but the test case doesn't seem to be any worse.

Reviewed-by: Sam Bobroff <sbobroff at linux.ibm.com>
> ---
> Sam Bobroff is working on implementing proper refcounting for EEH PEs,
> but that's not going to be ready for a while and this is less broken
> than what we have now.
> ---
>  arch/powerpc/kernel/eeh_driver.c | 36 ++++++++++++++++++++++++++++++--
>  arch/powerpc/kernel/eeh_event.c  |  8 +++++++
>  arch/powerpc/kernel/eeh_pe.c     | 23 +++++++++++++++++++-
>  3 files changed, 64 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index a31cd32c4ce9..75266156943f 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -734,6 +734,33 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
>   */
>  #define MAX_WAIT_FOR_RECOVERY 300
>  
> +
> +/* Walks the PE tree after processing an event to remove any stale PEs.
> + *
> + * NB: This needs to be recursive to ensure the leaf PEs get removed
> + * before their parents do. Although this is possible to do recursively
> + * we don't since this is easier to read and we need to garantee
> + * the leaf nodes will be handled first.
> + */
> +static void eeh_pe_cleanup(struct eeh_pe *pe)
> +{
> +	struct eeh_pe *child_pe, *tmp;
> +
> +	list_for_each_entry_safe(child_pe, tmp, &pe->child_list, child)
> +		eeh_pe_cleanup(child_pe);
> +
> +	if (pe->state & EEH_PE_KEEP)
> +		return;
> +
> +	if (!(pe->state & EEH_PE_INVALID))
> +		return;
> +
> +	if (list_empty(&pe->edevs) && list_empty(&pe->child_list)) {
> +		list_del(&pe->child);
> +		kfree(pe);
> +	}
> +}
> +
>  /**
>   * eeh_handle_normal_event - Handle EEH events on a specific PE
>   * @pe: EEH PE - which should not be used after we return, as it may
> @@ -772,8 +799,6 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  		return;
>  	}
>  
> -	eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
> -
>  	eeh_pe_update_time_stamp(pe);
>  	pe->freeze_count++;
>  	if (pe->freeze_count > eeh_max_freezes) {
> @@ -963,6 +988,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  			return;
>  		}
>  	}
> +
> +	/*
> +	 * Clean up any PEs without devices. While marked as EEH_PE_RECOVERYING
> +	 * we don't want to modify the PE tree structure so we do it here.
> +	 */
> +	eeh_pe_cleanup(pe);
>  	eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
>  }
>  
> @@ -1035,6 +1066,7 @@ void eeh_handle_special_event(void)
>  		 */
>  		if (rc == EEH_NEXT_ERR_FROZEN_PE ||
>  		    rc == EEH_NEXT_ERR_FENCED_PHB) {
> +			eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
>  			eeh_handle_normal_event(pe);
>  		} else {
>  			pci_lock_rescan_remove();
> diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
> index 64cfbe41174b..e36653e5f76b 100644
> --- a/arch/powerpc/kernel/eeh_event.c
> +++ b/arch/powerpc/kernel/eeh_event.c
> @@ -121,6 +121,14 @@ int __eeh_send_failure_event(struct eeh_pe *pe)
>  	}
>  	event->pe = pe;
>  
> +	/*
> +	 * Mark the PE as recovering before inserting it in the queue.
> +	 * This prevents the PE from being free()ed by a hotplug driver
> +	 * while the PE is sitting in the event queue.
> +	 */
> +	if (pe)
> +		eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
> +
>  	/* We may or may not be called in an interrupt context */
>  	spin_lock_irqsave(&eeh_eventlist_lock, flags);
>  	list_add(&event->list, &eeh_eventlist);
> diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
> index 1a6254bcf056..177852e39a25 100644
> --- a/arch/powerpc/kernel/eeh_pe.c
> +++ b/arch/powerpc/kernel/eeh_pe.c
> @@ -470,6 +470,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
>  int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
>  {
>  	struct eeh_pe *pe, *parent, *child;
> +	bool keep, recover;
>  	int cnt;
>  
>  	pe = eeh_dev_to_pe(edev);
> @@ -490,10 +491,21 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
>  	 */
>  	while (1) {
>  		parent = pe->parent;
> +
> +		/* PHB PEs should never be removed */
>  		if (pe->type & EEH_PE_PHB)
>  			break;
>  
> -		if (!(pe->state & EEH_PE_KEEP)) {
> +		/*
> +		 * XXX: KEEP is set while resetting a PE. I don't think it's
> +		 * ever set without RECOVERING also being set. I could
> +		 * be wrong though so catch that with a WARN.
> +		 */
> +		keep = !!(pe->state & EEH_PE_KEEP);
> +		recover = !!(pe->state & EEH_PE_RECOVERING);
> +		WARN_ON(keep && !recover);
> +
> +		if (!keep && !recover) {
>  			if (list_empty(&pe->edevs) &&
>  			    list_empty(&pe->child_list)) {
>  				list_del(&pe->child);
> @@ -502,6 +514,15 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
>  				break;
>  			}
>  		} else {
> +			/*
> +			 * Mark the PE as invalid. At the end of the recovery
> +			 * process any invalid PEs will be garbage collected.
> +			 *
> +			 * We need to delay the free()ing of them since we can
> +			 * remove edev's while traversing the PE tree which
> +			 * might trigger the removal of a PE and we can't
> +			 * deal with that (yet).
> +			 */
>  			if (list_empty(&pe->edevs)) {
>  				cnt = 0;
>  				list_for_each_entry(child, &pe->child_list, child) {
> -- 
> 2.21.0
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20190917/bd53e380/attachment.sig>